QUOTE(Moulton @ Fri 2nd April 2010, 9:17pm)
QUOTE(Milton Roe @ Sat 3rd April 2010, 12:12am)
So what the heck compresses down 167 times?
167 replicates of the same old same old?
Sure, but I assume WP doesn't do that. They really do store one entire Wiki every time you make a single letter change, BUT it does them no good in compression, because to do that, they'd have to have some method of storing diffs-only, along with gold-standard flagged-whole versions every so often.
They don't do that, so their high compression ratio does NOT reflect all that redundancy of many versions that are almost the same. (IMG:
smilys0b23ax56/default/blink.gif) If they could tap into THAT, their text compression would be several orders BETTER. Note that this doesn't mean their retrieval time would be that much faster, though (see below).
Kelly and Anthony discussed this some time ago, I think. If you had a decent flagged versions you trusted, you could actually use the flagged versions to give you whole-image marker points to store completely, with diffs in between (assumed to be mostly crap, anyway). The worst that happens then if something goes
bzzzzzt is you have to go back to your last good whole back-up Wiki copy, which is your last good flagged version.
So, anyway, if you do all this stuff at the same time, flagged revisions actually helps your data compression by a huge amount. Kelly opined that it's not as you'd think, since it takes computation-cycles to restore versions from last-good-image-plus-the-series-diffs-since, but still, you save a lot. Anthony gave a link to some kind of mini-max tradeoff algorithm between compression and computation time to expand it from diffs.
Actually, the whole problem is fascinating, and because of its diff-edit structure, WP is a unique case that has its own unique mini-max solution. For some bunch of gurus this could be a really cool academic project.
Too bad WMF's whole mindset is never to pay for anything you can get somebody to do for you free. (IMG:
smilys0b23ax56/default/unhappy.gif)
But eventually, somebody will tackle this for a database like WP.