FORUM WARNING [2] Division by zero (Line: 2933 of /srcsgcaop/boardclass.php)
Wikipedia and Information Theory -
     
 
The Wikipedia Review: A forum for discussion and criticism of Wikipedia
Wikipedia Review Op-Ed Pages

Welcome, Guest! ( Log In | Register )

> Wikipedia and Information Theory
anthony
post
Post #1


Postmaster
*******

Group: Regulars
Posts: 2,034
Joined:
Member No.: 2,132



The English Wikipedia database, uncompressed: 5.34 terabytes
The English Wikipedia database, compressed: 32 gigabytes
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
 
Reply to this topicStart new topic
Replies
MZMcBride
post
Post #2


Ãœber Member
*****

Group: Regulars
Posts: 671
Joined:
Member No.: 10,962



anthony is almost certainly referring to pages-meta-history.xml.7z, which is "all pages with complete edit history" and weighs in at 31.9 GB. 7-Zip is pretty nifty compression, so much so that it's been seriously suggested lately to drop the bzip2 dumps altogether.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Kelly Martin
post
Post #3


Bring back the guttersnipes!
********

Group: Regulars
Posts: 3,270
Joined:
From: EN61bw
Member No.: 6,696



QUOTE(MZMcBride @ Sat 3rd April 2010, 1:07am) *
anthony is almost certainly referring to pages-meta-history.xml.7z, which is "all pages with complete edit history" and weighs in at 31.9 GB. 7-Zip is pretty nifty compression, so much so that it's been seriously suggested lately to drop the bzip2 dumps altogether.
If that's what's being discussed, then yeah, a good deal of the predictability of the dump's character stream is the inherent redundancy in English text, the extremely low SNR of the XML markup (XML is typically horrid in that regard), and the high degree of repetitiveness in the content because each revision is displayed in the dump in full and thus is likely substantially similar to text previously included in the dump. I'm not familiar with 7zip's specific compression approach, but most general-purpose stream compressors are likely to achieve significant compression when asked to compress a sequence of data which repeats itself either exactly or nearly exactly, as long as the repetitions are not too far separated from one another. Since the dumps are hierarchical, with all of the revisions of a given article appearing in the dump in sequential order without interleaving revisions from other articles, that condition will tend to hold for most frequently-edited articles.

The dump does not reflect how the revisions are stored in the running database; there they use a variety of compression methods in the actual running database to reduce storage needs. The wheel has been reinvented many times, in MediaWiki land, and they're still working on rounding off the corners.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post



Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 

-   Lo-Fi Version Time is now:
 
     
FORUM WARNING [2] Cannot modify header information - headers already sent by (output started at /home2/wikipede/public_html/int042kj398.php:242) (Line: 0 of Unknown)