FORUM WARNING [2] Division by zero (Line: 2933 of /srcsgcaop/boardclass.php)
Wikipedia and Information Theory -
     
 
The Wikipedia Review: A forum for discussion and criticism of Wikipedia
Wikipedia Review Op-Ed Pages

Welcome, Guest! ( Log In | Register )

> Wikipedia and Information Theory
anthony
post
Post #1


Postmaster
*******

Group: Regulars
Posts: 2,034
Joined:
Member No.: 2,132



The English Wikipedia database, uncompressed: 5.34 terabytes
The English Wikipedia database, compressed: 32 gigabytes
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
 
Reply to this topicStart new topic
Replies
MZMcBride
post
Post #2


Ãœber Member
*****

Group: Regulars
Posts: 671
Joined:
Member No.: 10,962



anthony is almost certainly referring to pages-meta-history.xml.7z, which is "all pages with complete edit history" and weighs in at 31.9 GB. 7-Zip is pretty nifty compression, so much so that it's been seriously suggested lately to drop the bzip2 dumps altogether.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
anthony
post
Post #3


Postmaster
*******

Group: Regulars
Posts: 2,034
Joined:
Member No.: 2,132



QUOTE(MZMcBride @ Sat 3rd April 2010, 6:07am) *

anthony is almost certainly referring to pages-meta-history.xml.7z, which is "all pages with complete edit history" and weighs in at 31.9 GB.


Yes, vs. the uncompressed version of that same file.

QUOTE(Milton Roe @ Sat 3rd April 2010, 5:25pm) *

It still boggles my mind that all of WP, all past pages and non-oversighted history included, minus images, can be compressed and stored on one flash drive.


Yep. Mine too. And with a non-customized, off-the-shelf compression algorithm at that. I could geek on about it for hours. But I shouldn't.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Milton Roe
post
Post #4


Known alias of J. Random Troll
*********

Group: Regulars
Posts: 10,209
Joined:
Member No.: 5,156



QUOTE(anthony @ Sat 3rd April 2010, 4:05pm) *

QUOTE(Milton Roe @ Sat 3rd April 2010, 5:25pm) *

It still boggles my mind that all of WP, all past pages and non-oversighted history included, minus images, can be compressed and stored on one flash drive.

Yep. Mine too. And with a non-customized, off-the-shelf compression algorithm at that. I could geek on about it for hours. But I shouldn't.

Fortunately, I have no such inhibitions.

I have little doubt that most of WP (one copy of each article, uncompressed), minus illustration, could go on a 32 GB microSD card, and I'm a bit shocked (but not much) to find that that the latest generation of smartphones have a micoSD port, into which the memory chip can entirely be inserted, just as in small cameras. So you could, in theory, have access to all of WP even without net access, that way. The WP card could even be bundled with new phones.

Now, if I were the WMF, I'd be thinking about this. Most people with a smart phone would pay $5 to have it include a microSD card with all of text WP on it. If you sell a couple of million of these things each year as encyclopedia updates (not hard at all, or triple the number and halve the price) you could forget all that endless fundraising hoopla every year. It would all be replaced by a giant push to "get your quota of articles promoted to microSD-memory-card-status, in time for the Yearly December 1 edition" (which goes out then, in time for Christmas, etc).

The only problem is in flagging or promoting a "clean" version of each article to go on the yearly microSD edition. There are far too many articles to make this an admin job (it would be about 4,000 articles per admin), but not to many if you made everybody with more than N thousand edits a "promotor" with the power to press a button to make any given version of any article into a "Hardchip ready" status. Pick your N thousand edit number and your number n of need votes from separate promotors, so that you get the job done, each year. I would even include a "Not PG-13" toggle so promoters could vote articles off the microSD, as not being suitable for kids. More than X of these votes from X different promotors, and the article never gets onto the chip version, no matter how many other people want it there.

And that's it. A new batch of chips goes out for the new edition every year. People compete to make sure their favorite articles get in. Articles that nobody cares about, don't get in, and probably shouldn't (all those unwatched BLPs...). The porn also is filtered out.

Once the cell phone of the dusty child in Africa has advanced to having a microSD card port (I don't think it will be that long) even he or she might actually benefit from this. (IMG:smilys0b23ax56/default/blink.gif)

Damn, it's such an interesting idea. Eric Moeller and Jimbo will be so pleased they thought of it. Eventually....

Of course, WMF doesn't have to do this. Any private company with the parallel editor processing power to promote enough articles to microSD status, could do it. The problem is, reading and passing a million articles (let's pretend 1/3rd of WP articles are even close to being worthy to include) is real work. Without Jimbo's army of volunteer minions, I don't think any company could profitably DO it. But I'd love to be proven wrong.

Maybe you could hire enough people to do it in Ireland. And if not there, perhaps India. (IMG:smilys0b23ax56/default/smile.gif)
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
anthony
post
Post #5


Postmaster
*******

Group: Regulars
Posts: 2,034
Joined:
Member No.: 2,132



QUOTE(Milton Roe @ Sun 4th April 2010, 1:32am) *

Most people with a smart phone would pay $5 to have it include a microSD card with all of text WP on it.


How about $8.99?

The 32 gig file I mentioned is not capable of random access. But "pages-articles.xml.bz2 5.7 GB" (*) with a little bit of indexing and some intelligent (but fairly simple) software, is.

(*) Just the "current" versions of just the "articles".

QUOTE(Milton Roe @ Sun 4th April 2010, 1:32am) *

The problem is, reading and passing a million articles (let's pretend 1/3rd of WP articles are even close to being worthy to include) is real work.


Well, yeah, that's the problem with Wikipedia. Of course, I don't think reading and passing a million articles will do much good. I suppose offline Wikipedia is useful for people who are going for long periods of time without Internet access (and then, it's useful even without being vetted). For people with nearby (free) Internet access, it's fairly useless.

Maybe in China...I don't know how their ability to access the uncensored Internet is, but if it's difficult, maybe offline Wikipedia is part of a solution (for those brave enough to risk prosecution, anyway).
User is offlineProfile CardPM
Go to the top of the page
+Quote Post



Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 

-   Lo-Fi Version Time is now:
 
     
FORUM WARNING [2] Cannot modify header information - headers already sent by (output started at /home2/wikipede/public_html/int042kj398.php:242) (Line: 0 of Unknown)