FORUM WARNING [2] Division by zero (Line: 2933 of /srcsgcaop/boardclass.php)
Wikipedia and Information Theory -
     
 
The Wikipedia Review: A forum for discussion and criticism of Wikipedia
Wikipedia Review Op-Ed Pages

Welcome, Guest! ( Log In | Register )

> Wikipedia and Information Theory
anthony
post
Post #1


Postmaster
*******

Group: Regulars
Posts: 2,034
Joined:
Member No.: 2,132



The English Wikipedia database, uncompressed: 5.34 terabytes
The English Wikipedia database, compressed: 32 gigabytes
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
 
Reply to this topicStart new topic
Replies
Milton Roe
post
Post #2


Known alias of J. Random Troll
*********

Group: Regulars
Posts: 10,209
Joined:
Member No.: 5,156



QUOTE(anthony @ Fri 2nd April 2010, 8:09pm) *

The English Wikipedia database, uncompressed: 5.34 terabytes
The English Wikipedia database, compressed: 32 gigabytes

Weird. 167 to 1.

The weirdness is that the best English text compression is roughly 9 to 1 (without any external dictionary for common words like "the").

So what the heck compresses down 167 times? I would assume the database is mostly English text, no? It's too small to include any images. So what's the rest of the crud on WP that takes up so much space but is just.... space? (IMG:smilys0b23ax56/default/blink.gif)

And no, this is not a snarky comment on content. Even crappy content and vandalism should take up the same space as the best and finest prose. Provided it's not run-on letter vandalism ala zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz.

User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Moulton
post
Post #3


Anthropologist from Mars
*********

Group: Contributors
Posts: 10,222
Joined:
From: Greater Boston
Member No.: 3,670



QUOTE(Milton Roe @ Sat 3rd April 2010, 12:12am) *
So what the heck compresses down 167 times?

167 replicates of the same old same old?
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Milton Roe
post
Post #4


Known alias of J. Random Troll
*********

Group: Regulars
Posts: 10,209
Joined:
Member No.: 5,156



QUOTE(Moulton @ Fri 2nd April 2010, 9:17pm) *

QUOTE(Milton Roe @ Sat 3rd April 2010, 12:12am) *
So what the heck compresses down 167 times?

167 replicates of the same old same old?

Sure, but I assume WP doesn't do that. They really do store one entire Wiki every time you make a single letter change, BUT it does them no good in compression, because to do that, they'd have to have some method of storing diffs-only, along with gold-standard flagged-whole versions every so often.

They don't do that, so their high compression ratio does NOT reflect all that redundancy of many versions that are almost the same. (IMG:smilys0b23ax56/default/blink.gif) If they could tap into THAT, their text compression would be several orders BETTER. Note that this doesn't mean their retrieval time would be that much faster, though (see below).

Kelly and Anthony discussed this some time ago, I think. If you had a decent flagged versions you trusted, you could actually use the flagged versions to give you whole-image marker points to store completely, with diffs in between (assumed to be mostly crap, anyway). The worst that happens then if something goes bzzzzzt is you have to go back to your last good whole back-up Wiki copy, which is your last good flagged version.

So, anyway, if you do all this stuff at the same time, flagged revisions actually helps your data compression by a huge amount. Kelly opined that it's not as you'd think, since it takes computation-cycles to restore versions from last-good-image-plus-the-series-diffs-since, but still, you save a lot. Anthony gave a link to some kind of mini-max tradeoff algorithm between compression and computation time to expand it from diffs.

Actually, the whole problem is fascinating, and because of its diff-edit structure, WP is a unique case that has its own unique mini-max solution. For some bunch of gurus this could be a really cool academic project.

Too bad WMF's whole mindset is never to pay for anything you can get somebody to do for you free. (IMG:smilys0b23ax56/default/unhappy.gif)

But eventually, somebody will tackle this for a database like WP.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post



Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 

-   Lo-Fi Version Time is now:
 
     
FORUM WARNING [2] Cannot modify header information - headers already sent by (output started at /home2/wikipede/public_html/int042kj398.php:242) (Line: 0 of Unknown)