Printable Version of Topic

Click here to view this topic in its original format

_ MediaWiki Software _ Wikipedia and Information Theory

Posted by: anthony

The English Wikipedia database, uncompressed: 5.34 terabytes
The English Wikipedia database, compressed: 32 gigabytes

Posted by: Moulton

QUOTE(anthony @ Fri 2nd April 2010, 11:09pm) *
The English Wikipedia database, uncompressed: 5.34 terabytes
The English Wikipedia database, compressed: 32 gigabytes

My Talk Page Signature is 666 Terror Bytes, repressed..

Posted by: Milton Roe

QUOTE(anthony @ Fri 2nd April 2010, 8:09pm) *

The English Wikipedia database, uncompressed: 5.34 terabytes
The English Wikipedia database, compressed: 32 gigabytes

Weird. 167 to 1.

The weirdness is that the best English text compression is roughly 9 to 1 (without any external dictionary for common words like "the").

So what the heck compresses down 167 times? I would assume the database is mostly English text, no? It's too small to include any images. So what's the rest of the crud on WP that takes up so much space but is just.... space? blink.gif

And no, this is not a snarky comment on content. Even crappy content and vandalism should take up the same space as the best and finest prose. Provided it's not run-on letter vandalism ala zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.


Posted by: Moulton

QUOTE(Milton Roe @ Sat 3rd April 2010, 12:12am) *
So what the heck compresses down 167 times?

167 replicates of the same old same old?

Posted by: Somey

QUOTE(Milton Roe @ Fri 2nd April 2010, 11:12pm) *
So what the heck compresses down 167 times? I would assume the database is mostly English text, no? It's too small to include any images. So what's the rest of the crud on WP that takes up so much space but is just.... space? blink.gif

When Mr. Anthony says "database," what exactly are we talking about, format-wise? A MySQL dump (either in the form of SQL commands or XML) would have a tremendous amount of textual repetition. The actual MySQL file itself, on the server, would have tons of null space and also a lot of indexes, including full-text indexes, which take up tons of space too...

As I recall, the table layouts in MediaWiki are pretty generous in terms of field-lengths... It's still an unusually high ratio, but not unheard of.

Posted by: Kelly Martin

The full dump contains full-text versions of each revision of each article. That's a great deal of redundancy just there. Then you have all the "redundancy" of the XML structure wrapping, which uses way more bits than strictly needed and add very little entropy. And I don't know much of of that 5-odd terabytes is indexing; indexes obviously add no information at all.

Posted by: Milton Roe

QUOTE(Moulton @ Fri 2nd April 2010, 9:17pm) *

QUOTE(Milton Roe @ Sat 3rd April 2010, 12:12am) *
So what the heck compresses down 167 times?

167 replicates of the same old same old?

Sure, but I assume WP doesn't do that. They really do store one entire Wiki every time you make a single letter change, BUT it does them no good in compression, because to do that, they'd have to have some method of storing diffs-only, along with gold-standard flagged-whole versions every so often.

They don't do that, so their high compression ratio does NOT reflect all that redundancy of many versions that are almost the same. blink.gif If they could tap into THAT, their text compression would be several orders BETTER. Note that this doesn't mean their retrieval time would be that much faster, though (see below).

Kelly and Anthony discussed this some time ago, I think. If you had a decent flagged versions you trusted, you could actually use the flagged versions to give you whole-image marker points to store completely, with diffs in between (assumed to be mostly crap, anyway). The worst that happens then if something goes bzzzzzt is you have to go back to your last good whole back-up Wiki copy, which is your last good flagged version.

So, anyway, if you do all this stuff at the same time, flagged revisions actually helps your data compression by a huge amount. Kelly opined that it's not as you'd think, since it takes computation-cycles to restore versions from last-good-image-plus-the-series-diffs-since, but still, you save a lot. Anthony gave a link to some kind of mini-max tradeoff algorithm between compression and computation time to expand it from diffs.

Actually, the whole problem is fascinating, and because of its diff-edit structure, WP is a unique case that has its own unique mini-max solution. For some bunch of gurus this could be a really cool academic project.

Too bad WMF's whole mindset is never to pay for anything you can get somebody to do for you free. unhappy.gif

But eventually, somebody will tackle this for a database like WP.

Posted by: MZMcBride

anthony is almost certainly referring to http://dumps.wikimedia.org/enwiki/20100130/, which is "all pages with complete edit history" and weighs in at 31.9 GB. 7-Zip is pretty nifty compression, so much so that it's been seriously suggested lately to drop the bzip2 dumps altogether.

Posted by: CharlotteWebb

QUOTE(Milton Roe @ Sat 3rd April 2010, 4:46am) *

They don't do that, so their high compression ratio does NOT reflect all that redundancy of many versions that are almost the same. blink.gif If they could tap into THAT, their text compression would be several orders BETTER. Note that this doesn't mean their retrieval time would be that much faster, though (see below).

Kelly and Anthony discussed this some time ago, I think. If you had a decent flagged versions you trusted, you could actually use the flagged versions to give you whole-image marker points to store completely, with diffs in between (assumed to be mostly crap, anyway). The worst that happens then if something goes bzzzzzt is you have to go back to your last good whole back-up Wiki copy, which is your last good flagged version.

So, anyway, if you do all this stuff at the same time, flagged revisions actually helps your data compression by a huge amount. Kelly opined that it's not as you'd think, since it takes computation-cycles to restore versions from last-good-image-plus-the-series-diffs-since, but still, you save a lot. Anthony gave a link to some kind of mini-max tradeoff algorithm between compression and computation time to expand it from diffs.

http://wikipediareview.com/index.php?act=findpost&pid=196968.

Not the "good" revisions necessarily, just the ones heuristically deemed attractive based on other factors. Actually key-framing only the "good" revs would be a drain on efficiency because in most cases the ones flagged "good" will be in chronologically adjacent groups where the intervening changes are trivial, making them more redundant to one another than to the unflagged edits where perhaps several new paragraphs were added and rejected.

Also note that public statements implying a dependency relationship between flagged revs and any other any other technical suggestion (including those http://www.urbandictionary.com/define.php?term=gfys) will delay the enablement of flagged revs by an average of 5.12 additional months.

Posted by: Milton Roe

QUOTE(MZMcBride @ Fri 2nd April 2010, 11:07pm) *

anthony is almost certainly referring to http://dumps.wikimedia.org/enwiki/20100130/, which is "all pages with complete edit history" and weighs in at 31.9 GB. 7-Zip is pretty nifty compression, so much so that it's been seriously suggested lately to drop the bzip2 dumps altogether.


It still boggles my mind that all of WP, all past pages and non-oversighted history included, minus images, can be compressed and stored on one flash drive.

If we could identify one "stable" certified version of each article, and only store THAT, perhaps it would be only 1% of this? Or 1/167th of this? wink.gif

(How many edits does the average article have, or a better question, is what is the ratio of the size of the LAST version of all the articles, to the size of their entire edit history?? Does anybody know?).

The point being that if we were looking at only the last "good" version, we might be able to store all WP text on one flash drive UNCOMPRESSED. Something like that has been done, but it's not user friendly (some business person naturally wants to sell you a KINDLE type dedicated reader).

The world changes when you can get it all on a flash drive that is readable by any hand held computer-driven device, including a smart-phone (these things all badly need flashdrive ports). That would make access to WP independent of web access, and (as I've said before) actually have a chance of being of some use to the kid in the third world (example, the people of the island of Yap, who I visited), where internet access is very, very expensive. But everybody speaks English and has a crying need for Western education (they go for secondary schooling to Guam or Hawaii).

Posted by: Milton Roe

QUOTE(CharlotteWebb @ Sat 3rd April 2010, 1:24am) *

Not the "good" revisions necessarily, just the ones heuristically deemed attractive based on other factors. Actually key-framing only the "good" revs would be a drain on efficiency because in most cases the ones flagged "good" will be in chronologically adjacent groups where the intervening changes are trivial, making them more redundant to one another than to the unflagged edits where perhaps several new paragraphs were added and rejected.

Good point, but the obvious fix for that would be artificial time frames, or rather (since edit frequency and will go through rushes of edit-excitment, timewise), artificial edit-number frames. These would take one "flagged" version per artificially-set "number frame" of 20, or 100, or whatever N number of edits you like. You can set this edit-number frame as N/n where N is the total history of edits to an article, and n is the number of times you want to key-frame-sample it.

There are other obvious compression fixes. For example, when there's a long series of uninterrupted edits by the same editor (as often happens), you can set the algorithm to simply ignore all but the last one-- including the contribution they make to the raw N number of edits to the article. The LAST one in a series by the same editor is the one that goes into the pot of "Let's see if it's a flagged version that we can key-frame." The others essentially are treated as if they never existed.

Posted by: anthony

QUOTE(MZMcBride @ Sat 3rd April 2010, 6:07am) *

anthony is almost certainly referring to http://dumps.wikimedia.org/enwiki/20100130/, which is "all pages with complete edit history" and weighs in at 31.9 GB.


Yes, vs. the uncompressed version of that same file.

QUOTE(Milton Roe @ Sat 3rd April 2010, 5:25pm) *

It still boggles my mind that all of WP, all past pages and non-oversighted history included, minus images, can be compressed and stored on one flash drive.


Yep. Mine too. And with a non-customized, off-the-shelf compression algorithm at that. I could geek on about it for hours. But I shouldn't.

Posted by: Kelly Martin

QUOTE(MZMcBride @ Sat 3rd April 2010, 1:07am) *
anthony is almost certainly referring to http://dumps.wikimedia.org/enwiki/20100130/, which is "all pages with complete edit history" and weighs in at 31.9 GB. 7-Zip is pretty nifty compression, so much so that it's been seriously suggested lately to drop the bzip2 dumps altogether.
If that's what's being discussed, then yeah, a good deal of the predictability of the dump's character stream is the inherent redundancy in English text, the extremely low SNR of the XML markup (XML is typically horrid in that regard), and the high degree of repetitiveness in the content because each revision is displayed in the dump in full and thus is likely substantially similar to text previously included in the dump. I'm not familiar with 7zip's specific compression approach, but most general-purpose stream compressors are likely to achieve significant compression when asked to compress a sequence of data which repeats itself either exactly or nearly exactly, as long as the repetitions are not too far separated from one another. Since the dumps are hierarchical, with all of the revisions of a given article appearing in the dump in sequential order without interleaving revisions from other articles, that condition will tend to hold for most frequently-edited articles.

The dump does not reflect how the revisions are stored in the running database; there they use a variety of compression methods in the actual running database to reduce storage needs. The wheel has been reinvented many times, in MediaWiki land, and they're still working on rounding off the corners.

Posted by: Milton Roe

QUOTE(anthony @ Sat 3rd April 2010, 4:05pm) *

QUOTE(Milton Roe @ Sat 3rd April 2010, 5:25pm) *

It still boggles my mind that all of WP, all past pages and non-oversighted history included, minus images, can be compressed and stored on one flash drive.

Yep. Mine too. And with a non-customized, off-the-shelf compression algorithm at that. I could geek on about it for hours. But I shouldn't.

Fortunately, I have no such inhibitions.

I have little doubt that most of WP (one copy of each article, uncompressed), minus illustration, could go on a 32 GB microSD card, and I'm a bit shocked (but not much) to find that that the latest generation of smartphones have a micoSD port, into which the memory chip can entirely be inserted, just as in small cameras. So you could, in theory, have access to all of WP even without net access, that way. The WP card could even be bundled with new phones.

Now, if I were the WMF, I'd be thinking about this. Most people with a smart phone would pay $5 to have it include a microSD card with all of text WP on it. If you sell a couple of million of these things each year as encyclopedia updates (not hard at all, or triple the number and halve the price) you could forget all that endless fundraising hoopla every year. It would all be replaced by a giant push to "get your quota of articles promoted to microSD-memory-card-status, in time for the Yearly December 1 edition" (which goes out then, in time for Christmas, etc).

The only problem is in flagging or promoting a "clean" version of each article to go on the yearly microSD edition. There are far too many articles to make this an admin job (it would be about 4,000 articles per admin), but not to many if you made everybody with more than N thousand edits a "promotor" with the power to press a button to make any given version of any article into a "Hardchip ready" status. Pick your N thousand edit number and your number n of need votes from separate promotors, so that you get the job done, each year. I would even include a "Not PG-13" toggle so promoters could vote articles off the microSD, as not being suitable for kids. More than X of these votes from X different promotors, and the article never gets onto the chip version, no matter how many other people want it there.

And that's it. A new batch of chips goes out for the new edition every year. People compete to make sure their favorite articles get in. Articles that nobody cares about, don't get in, and probably shouldn't (all those unwatched BLPs...). The porn also is filtered out.

Once the cell phone of the dusty child in Africa has advanced to having a microSD card port (I don't think it will be that long) even he or she might actually benefit from this. blink.gif

Damn, it's such an interesting idea. Eric Moeller and Jimbo will be so pleased they thought of it. Eventually....

Of course, WMF doesn't have to do this. Any private company with the parallel editor processing power to promote enough articles to microSD status, could do it. The problem is, reading and passing a million articles (let's pretend 1/3rd of WP articles are even close to being worthy to include) is real work. Without Jimbo's army of volunteer minions, I don't think any company could profitably DO it. But I'd love to be proven wrong.

Maybe you could hire enough people to do it in Ireland. And if not there, perhaps India. smile.gif

Posted by: anthony

QUOTE(Milton Roe @ Sun 4th April 2010, 1:32am) *

Most people with a smart phone would pay $5 to have it include a microSD card with all of text WP on it.


http://itunes.apple.com/app/encyclopedia/id288141564?mt=8?

The 32 gig file I mentioned is not capable of random access. But "pages-articles.xml.bz2 5.7 GB" (*) with a little bit of indexing and some intelligent (but fairly simple) software, is.

(*) Just the "current" versions of just the "articles".

QUOTE(Milton Roe @ Sun 4th April 2010, 1:32am) *

The problem is, reading and passing a million articles (let's pretend 1/3rd of WP articles are even close to being worthy to include) is real work.


Well, yeah, that's the problem with Wikipedia. Of course, I don't think reading and passing a million articles will do much good. I suppose offline Wikipedia is useful for people who are going for long periods of time without Internet access (and then, it's useful even without being vetted). For people with nearby (free) Internet access, it's fairly useless.

Maybe in China...I don't know how their ability to access the uncensored Internet is, but if it's difficult, maybe offline Wikipedia is part of a solution (for those brave enough to risk prosecution, anyway).