Printable Version of Topic

Click here to view this topic in its original format

_ General Discussion _ My upcoming plagiarism report

Posted by: Daniel Brandt

I need suggestions on how to present my plagiarism report at wikipedia-watch.org. I still have several weeks of work to do, despite the fact that I've been working a few hours a day on it for the last three weeks.

I'm far enough along in terms of separating the signal from the noise, that I can now predict that the report will end up with between 100 and 300 examples. Here's a throwaway example, that will probably get corrected as soon as someone from Wikipedia sees this post:

http://en.wikipedia.org/w/index.php?title=Alain_LeRoy_Locke&oldid=71014124

http://www.britannica.com/ebc/article-9048696

Most of my examples are similar to this -- except they're not from Britannica, but rather from everywhere imaginable. Almost all of the original sources have clear copyright notices on them, and the source is not acknowledged on the Wikipedia article, and anywhere from several sentences to several paragraphs are plagiarized.

My question is, "How can I format the report so that anyone looking at it will get the picture, within a few clicks, that Wikipedia has a plagiarism problem?"

So far my best idea is to have a doorway page explaining that my examples were culled from a sampling of slightly less than one percent of the 1.4 million English-language Wikipedia articles. If I have 200 examples, then we can presume that there are about 20,000 plagiarized articles in Wikipedia that no one has yet discovered. No one has made any attempt to discover them, and no one ever will. It's just too hard. Even for programmers with a pipeline into automated Google inquiries, it's still too hard. There's an amazing amount of manual checking that's required to reduce the noise without throwing out the signal.

This doorway page will link to 200 subpages (Example 001, Example 002, ... Example 200). Each of the subpages will be titled "Plagiarism on Wikipedia - Example 001" and have a link to the source, plus a link to the version on Wikipedia as of mid-September when I grabbed the page. Then below this, the text-portion only from that page (this is easy to strip out of the XML versions of the article that I already have) will be reproduced, and the sections that are plagiarized from the source will be in highlighted in background yellow.

The effect will be that the visitor to the doorway page is given some information on how the examples were found, and is invited to click randomly on any of the 200 examples to see for themselves. I'm linking to the mid-September version, since it's possible that many editors will start cleaning up these 200 examples. One way they will try to clean it up is to acknowledge the source, but that still doesn't solve the problem that entire paragraphs were copied verbatim. They'll have to change sentences around too.

Therefore, I predict that Jimmy will claim that Wikipedia is amazingly free from plagiarism, because Wikipedia has always had a zero-tolerance policy. (This will be a lie -- there have been no efforts to identify plagiarism at Wikipedia.) Then he will zap totally all 200 articles (no history, no nothing) so that the links to the September version on my subpages won't work. That's why I have to reproduce the text from the article and highlight the plagiarized material. If I don't, my report will not be convincing after Jimmy zaps the 200 articles.

Any other ideas?

Posted by: EuroSceptic

1. Provide link, but also save all WP versions locally, replace if needed.
2. Make two column pages, with start of each paragraph at equal height.
3. Highlight (near) identical sections.
4. Give name of article in title.

Posted by: Jonny Cache

What EuroSceptic suggests sounds like the first thing that occurred to me -- if you have the technical means to recreate a Diff display or mock up a close facsimile any way you can. There should be version control systems around that would do this for you.

D'oh! Now that I thunk on it, you might even be able to use the comparison tool in the WP edit window to do this, which you can use on preview mode without saving.

Jonny

Posted by: Joey

Posted by: Skyrocket

Plagiarism? It's trivial.

What about copyright infringement? (It ain't fair use if it's repetetive, widespread, unacknowledged, etc.)

Posted by: poopooball

whats scary is taht the plagerizer here says hes a librarian. yowza.

Posted by: Daniel Brandt

Most of the plagiarism in my examples will also serve as examples of copyright infringement. Most of the original sources have copyright notices at the bottom of the page. Some even have bylines. A few are from public domain sites (government sites, the 1911 Britannica, etc.) where it is not copyright infringement, but when it is unattributed it is still plagiarism.

My examples that show that the Foundation's copyright infringement situation is a problem. What Wikipedia is doing is grabbing copyrighted material, publishing it under their own terms (GDFL), and telling the world that anyone can help themselves and do the same thing. They invite dozens of mirrors to grab the material. And many more scrapers than mirrors grab the material without even mentioning "Wikipedia" or "wiki" anywhere on their page.

No systematic effort has ever been made by Wikimedia Foundation to police new articles for possible plagiarism and/or copyright problems. They do this for images, but not for text. They should have been screening all new articles for whole-sentence hits in Google, and manually checking anything suspicious. (Actually, Google used to have a ten-word limit, but fairly recently changed to a 32-word limit in their search box, so a couple years ago this might have been difficult to do.)

Now it's too late to go back and check all the articles. For every instance of plagiarism and/or copyright violation that can be discovered in Wikipedia articles, you have to filter out about 10 to 20 Google hits on entire sentences that are noise instead of signal. In other words, these are the scrapers -- where the copying is going the other direction. By that I mean that some site grabbed Wikipedia's article, rather than some Wikipedia editor starting an article by plagiarizing another site. Google cannot tell the difference -- it takes a real person to compare the two articles. You can compile a list of scraper domains and reject them up front (it took me two weeks to do this, and I ended up with 950 scraper domains), but you still get rid of only 75 percent of the noise this way, and are left with lots of manual inspection. And my Google searches tried to exclude the obvious mirrors to start with, by placing -wikipedia -wiki in front of the complete sentence.

Some of these scraper pages even have bogus copyright notices on the bottom of what they scraped, because they use web tools that slap a notice on every page just to make it look cool (usually these say Copyright 2006, which is a giveaway). There is no way to find the needles in the haystack without manual effort.

The copyright thing is serious. If the Foundation had more money, they'd find a bunch of lawsuits at their door. YouTube has staff that do nothing but take down stuff under DMCA. Wikimedia Foundation has even a bigger problem, because their structure is designed to exercise more control over content. By virtue of this, they might have more legal liability than YouTube or a Google, where no one expects them to screen stuff before their users post it or the Googlebot grabs it. YouTube or Google can escape liability under DMCA by acting after the fact, but Wikimedia Foundation might be in a different position because it is more directly involved in generating and shaping the content to begin with. (Actually, YouTube could be in trouble now that they are attractive as a lawsuit target. A court could decide that they should be pre-approving the content.)

I thought about researching who the users were behind the plagiarism, but that's too much work. Besides, I maintain that the Foundation should be doing this themselves, and that the Foundation and the admins are responsible if they don't police for plagiarism. I think it's best to keep the emphasis on Wikipedia as a whole, instead of picking off specific editors. My hivemind page was different -- there the Foundation supports its editors. But there is no way the Foundation can support copyright infringement. Rogue editors are their problem, not mine.

Posted by: Somey

Well, I'm certainly impressed! Nice work!

I guess my advice FWIW would be to preserve the evidence in any way you can manage, including WP screen shots, possibly even presented as thumbnails. (Let me know if you need a volunteer to capture a few - you could always just send me a few article titles at a time, just in case I'm actually a Wikipedia spy who plans on reporting the contents of the list to Jimbo. ) Most browsers these days also have a "Save as Complete Web Page" feature, too, if I'm not mistaken...?

I agree, they'll delete most of the articles as soon as you release the list, so we might want to organize a vigil over the deletion logs and AfD pages to time how long it takes them to do it. Be sure to heavily emphasize the sample size, too - they'll try to say something like, "he could only find a few hundred out of 1.4 million, proving there isn't really a problem," or some-such nonsense. Then they'll claim that most of the material was "in the public domain anyway" and "no real harm was done." And they'll feel secure in the knowledge that nobody can afford to sue them because of both the bad publicity, and the fact that the plaintiffs would gain practically nothing from it financially, so that six months later they can say "sure, we found a few instances, but nobody took it all that seriously."

And they'll continue to plagiarize, of course. Some things never change...

Posted by: Somey

One more thing...

Another way they'll probably attack this will be to delete any mention of it in Wikipedia itself, under the pretense that Brandt "self-publishes" and is therefore not a "reliable source." As it turns out, plans are already underway to replace WP:NOR and WP:V with a new, "more cohesive" policy that will help them squelch any dissent over the plagiarism issue, ensuring that they can sweep the whole thing under the rug as soon as possible:

http://en.wikipedia.org/wiki/Wikipedia:Attribution#Self-published_sources

And I'll give you one guess as to who's the author of this new proposed policy:

QUOTE(Slimmy @ Right about now)

A self-published source is a published source, online or on paper, that has not been subject to any form of independent fact-checking, or where no one stands between the writer and the act of publication. Anyone can create a website or pay to have a book published and then claim to be an expert in a certain field. For that reason, self-published books, personal websites, and blogs are usually not acceptable as reliable sources.

They're awfully predictable sometimes, aren't they? It's kind of sad, in a way.

Posted by: Daniel Brandt

Actually, Wikipedia lacks tools to convert individual files from the XML version to a static HTML page. There are some tools, but they are not suitable for grabbing just a couple hundred files. They presume that you want the whole Wikipedia, and then you run a tool to shove everything into an SQL database, and the pages get parsed on the fly from the database. Total overkill for what I need -- I treat my servers with much more respect than that.

I think what I'll do is a two-column plain-text display of the text portion only, and write an offline program to highlight duplicate complete sentences in the Wikipedia version in the right-side column. It will need checking, but it will be faster to find the sentences this way than without a program. I'll link to the actual Wikipedia version as of September, and to the actual original version that was plagiarized.

I might grab cache versions from MSN as backup. Believe it or not, a cache version from MSN with the MSN header stripped out, saved as a static HTML file, is about as close as I can get to serving up a Wikipedia page from my own server. It even looks like a Wikipedia article.

If you try to save an actual page from Wikipedia as a static file, all the style formatting disappears. That may be fine if you're trying to print it out on that dot-matrix printer you bought in 1985, but when displayed in a browser it simply doesn't look like it came from Wikipedia.

Posted by: Somey

QUOTE(Daniel Brandt @ Wed 11th October 2006, 1:18am)

If you try to save an actual page from Wikipedia as a static file, all the style formatting disappears. That may be fine if you're trying to print it out on that dot-matrix printer you bought in 1985, but when displayed in a browser it simply doesn't look like it came from Wikipedia.

Are you familiar with Internet Explorer's .MHT "Single File Web Archive" file format? (Please, no "get a real browser" jokes...) They're not all that suitable for online viewing since (AFAIK) they require IE5 or higher to open them, but they could be placed on a website for downloading. All the image files are stored in the one file, along with the text.

I just saved the article on "St. Augustine Monster" to my hard drive, disconnected myself for a smidge, and opened it from there - it looked exactly the same, fonts, images, link colors, javascript, CSS, everything all embedded. It was 319K, but zipped down to 132K. Might help in cases where they delete the article(s) completely... Whaddya think?

Posted by: Uly

You'll probably want to prepare an argument for why your hosting of the copyvio examples isn't a copyvio itself.

I expect you can easily argue fair use and proper attribution - but I'm also sure this'll be one of the first attacks levelled at you.

Posted by: Daniel Brandt

Somey: If you need Explorer to read them, that means I'd have to reverse-engineer the format to serve the files from a Linux box. The MSN cache copy looks a lot easier. I know you do Microsoft stuff, but my work is more generic. I use XP on my desktop (my servers are Linux), but at least half the time on my XP I'm doing stuff from a command window. I tried saving a web page once from IE, and it took me three minutes to track down all the directories it created to save the various types of content. That was the last time I saved anything from IE without doing a "view source" first. I do use IE to see how the web pages I code by hand look on IE, but I've got everything disabled in it for security reasons, which means I can't use it online apart from looking at my own sites.

QUOTE(Uly @ Wed 11th October 2006, 9:30am)

You'll probably want to prepare an argument for why your hosting of the copyvio examples isn't a copyvio itself. I expect you can easily argue fair use and proper attribution - but I'm also sure this'll be one of the first attacks levelled at you.

That has occurred to me. But then I thought, "Well, I sure won't get any criticism from the copyright holder, because I'm trying to defend their copyright."

And the next thing I thought was, "Wow, look at all those pathetic editors at Wikipedia who insist on calling me a privacy advocate, without any evidence that I'm any such thing, just so they can launch into a clever remark about how ironic it is that I'm violating the privacy of Wikipedia editors on hivemind."

Here is the latest example, from WP:ANI, about my http://www.wikipedia-watch.org/findchat.html:

QUOTE

We can all try and amuse ourselves with the irony that Brandt is supposed to be a leading internet privacy advocate. We can also mention this irony in the press next time someone asks us about critics. --bainer (talk) 14:23, 8 October 2006 (UTC)

You are correct about this Uly. If you give the average Wikipedian any sort of opening at all, they will instantly stick their tongue into it. A single column from Wikipedia with the duplicate sentences highlighted, and just a link to the original (which, after all, will be stable), is probably the smart thing to do.

Posted by: guy

Let's hope they do say that. Daniel can point out that he's not a privacy advocate, and have a good laugh about the accuracy of Wikipedia articles.

Posted by: Daniel Brandt

QUOTE(guy @ Wed 11th October 2006, 11:14am)

Let's hope they do say that. Daniel can point out that he's not a privacy advocate, and have a good laught about the accuracy of Wikipedia articles.

They called me that for five months on my bio. It took me five months of edit wars to get the word "privacy" deleted in their description of me, and just call me an "activist." And now I get to watch that article for the rest of my life to see if it sneaks back in.

Who won? I don't feel like I won, that's for sure. I'm still not laughing over my bio. When it's number one on all the engines in a search for my name, every defense of myself that I make online is countered by 100 people reading what Wikipedia says about me. It's not a fair fight, and it's no laughing matter.

Posted by: Joey

Posted by: poopooball

looks like plagarist librerian fixed it.

http://en.wikipedia.org/w/index.php?title=Alain_LeRoy_Locke&curid=518480&diff=80839171&oldid=79514403

Posted by: Daniel Brandt

QUOTE(Joey @ Wed 11th October 2006, 11:33am)

The approach here might be to submit the matter for publication. You could either write it yourself for some reputable publication, or offer your facts to a reputable publication. Since it is a well-organized and documented study, some professional journal might take it.

It was Seigenthaler's idea to do a plagiarism study of Wikipedia in the first place. He asked a couple of journalism professors to assign a couple of students to it, and mentioned this to me in an email. I thought about this, and soon realized that any student wouldn't have a snowball's chance in hell of separating the signal from the noise. You need experience with Wikipedia, you need programming skills, you need a pipeline into automated Google queries, and you need a huge amount of time -- much more time than a student has for a single course.

I decided that Seigenthaler's idea had merit, and realized that I was one of the few people who could do this. He knows I'm doing it, and he's delighted that I'm doing it. You can be sure that before the report is made live on my website, he and I will work together to find an interested journalist from mainstream media (maybe the AP or the NYT) who would like a scoop. When the scoop is published, at that instant the study goes live. Until then, there's nothing Jimmy and Brad and Danny can do, because they don't know which 200 articles will end up as my examples.

Posted by: taiwopanfob

I guess the obvious should be said if it hasn't already: the 20,000 estimate is for web-articles. How many more are from that old-style paper stuff? (Guesses are possible.)

I don't recommend publishing all 200 you have, but only 50 or even fewer. This is because the follow-up study should track just how effective current or even future anti-plagiarism tactics employed by WP.

Posted by: Daniel Brandt

Look http://www.wikipedia-watch.org/junk/msncache.html -- I'm picking up the MSN cache copy from my wikipedia-watch server. Here's how I did it:

First I got the cache URL ID number from MSN by doing a search for: site:wikipedia.org alain leroy locke

Then I pasted this cache ID into a one-line script that ran on Linux:

CODE

curl -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)" -o "msncache.html" "http://cc.msnscache.com/cache.aspx?q=4081175923447"

That should all be on one line; this board most likely wrapped it. The only change I made to my msncache.html that curl fetched for me, was to delete the MSN header at the top.

The page looks like it came right out of Wikipedia, but the important content -- namely the text -- came from my server. The stuff that came from Wikipedia are the templates and the image -- things that Wikipedia is unable to change. This will work even if Wikipedia zaps the article and history. They could delete the image, but that's no big deal. Many of my samples don't have images. The other stuff is used so widely on Wikipedia that they have to leave it alone. They could block my server, but how embarrassing would that be for them?

I think I'll have to grab as many MSN cache copies as I can before I go live.

Posted by: Surfer

For presentation: I like EuroÂ´s suggestion, too, but take a look here: http://en.wikipedia.org/wiki/Wikipedia:Requests_for_arbitration/Deir_Yassin_massacre/Evidence#Copyright_violations (where KimvdL. proves Guy MontagÂ´s copyright infringement. (as a side-note: since GM is one of SlimVirgins proteges, this has of course increased Slims hatred of Kim..) -using colors for the copyright infringements makes it very visible.

Users Zeq and Guy Montag have on a regular basis copied material from the web, without attributing.

Posted by: Joey

Posted by: guy

That's unlikely to work for old but still in copyright books, like the 1960s edition of Encyclopaedia Britannica.

Posted by: Ashibaka

QUOTE(Daniel Brandt @ Wed 11th October 2006, 4:48pm)

It was Seigenthaler's idea to do a plagiarism study of Wikipedia in the first place. (...) You can be sure that before the report is made live on my website, he and I will work together to find an interested journalist from mainstream media (maybe the AP or the NYT) who would like a scoop.

I gotta say, that's pretty cool! Make sure to pose for a photo to improve your Wikipedia article.

Posted by: Daniel Brandt

QUOTE(Ashibaka @ Thu 12th October 2006, 11:51am)

I gotta say, that's pretty cool! Make sure to pose for a photo to improve your Wikipedia article.

Fat chance. Seigenthaler is much prettier than I, even though he has 20 years on me. The last time I declined a request from AP to send their photographer over (this was before I had ever heard of SlimVirgin), they dropped my name from their article about Google while still using information I gave them in the article. That was after I spent an hour on the phone with the AP reporter. This time I'll just refer them to Seigenthaler. I'll do the research, and he can do the talking. I hope it will be an effective combination. It worked okay last December.

Posted by: JohnA

The only problem I can see is that Wikipedia may go for the "Connelley Red Button" and remove not only the plagiarism but the evidence that the plagiarism ever existed.

I nearly put this in the discussion for Alan Locke, but I thought: "Fuck it. Put it where Wikipedophiles can see it but not delete it"

===Plagiarism===

As a lot of us suspected, Wikipedia is chock full not just of lies and propaganda, but also the lazy man's substitute for scholarship: plagiarism. But just remember the source: Daniel Brandt. He is doing to Wikipedia what Wikipedia is doing to scholarship and the historical record.

Posted by: guy

I expect they'll say that there are a handful of trolls who commit plagiarism but the alert admins have caught it - Mr Brandt will be airbrushed out.

Posted by: Daniel Brandt

Here's how I'm planning on doing each example: http://www.wikipedia-watch.org/plagiarism/0002.html

Posted by: Joey

Posted by: guy

QUOTE(Joey @ Sat 14th October 2006, 2:25am)

I'm not certain what relevance the fact that it would not work for all books has anyway.

I was replying to the following point, which suggested that it would work for all books. I quite agree that it's better than nothing.

QUOTE(Joey @ Thu 12th October 2006, 12:38am)

The same methods can be used to find stolen copy from texts if text-book publishers will cooperate by sharing electronic files of their text content.

Posted by: Joey

Posted by: guy

QUOTE(Joey @ Sat 14th October 2006, 6:43pm)

Absent a definite article that would expose the intended scope the object "texts" and the indirect object "text-book publishers" a reader has no basis to conclude that a writer writing about books was referrring to every book ever published at any time in any language.

Immediate block for Wikilawyering!

Posted by: Joey

Posted by: Somey

I say we all block ourselves for 45 minutes, go make a batch of chocolate-chip cookies, and then do some crossword puzzles.

But just in case you want my opinion, here it is. I was at a bookfair three weeks ago, shortly after Daniel mentioned that he was thinking of starting his project to track down plagiarism in Wikipedia. I bought three little "desk encyclopedias" there - all very thick paperbacks in very small print. I was trying to think like a plagiarist, you see - I figured that someone looking to add a bunch of articles to Wikipedia without having to do any original writing might simply copy something out of a little book like that, maybe paraphrasing and rewording things a little bit just to be safe.

The fact is, there are a limited number of ways to write encyclopedic articles about most subjects. A lot of the articles in these books not only look like their Wikipedia counterparts, they look like each other, too. They might not be exactly the same, but if someone is going to copy material out of a book, and type it into a Wikipedia edit box, there isn't much additional effort involved in making the kind of minor changes that would avoid at least an automated match on existing source text.

So I wouldn't sweat it, personally. If the material isn't online and accessible already, I'd say the chances of finding an exact, or near-exact match in Wikipedia are negligible. It's just too easy to make those minor alterations.

Posted by: Daniel Brandt

More tips for Wikipedia critics with their own servers:

I discovered that if you want to serve a Wikipedia article perfectly from your own server, you don't have to look for a cache copy unless you need an earlier version that cannot be found elsewhere.

Initially the page doesn't look good in your browser when you do this:

CODE

curl -O "http://en.wikipedia.org/wiki/Alain_LeRoy_Locke"

(the -O -- that's an upper-case alpha, not a zero -- means the same filename is used for the local file saved by curl)

The text is there, but none of the surrounding Wikipedia fluff that readers expect. This is fixed by adding a single line to your local file. If you add this line at the top (it should be the first line inside of the header), it will now look good because all the fluff is picked up directly from Wikipedia:

CODE

If you want an older copy and the history has been wiped, it turns out that MSN had slightly less than half of the Wikipedia articles I wanted (I checked 75 total). Google had all the ones that MSN didn't, which means that Google indexes Wikipedia better than MSN. On the Google cache copy, if you delete that last portion of Google's URL that is used to fetch the cache copy, the search-term highlighting will be gone. This is 100 times easier than taking out the highlighting later.

Posted by: Joey

Posted by: JohnA

QUOTE(Joey @ Sun 15th October 2006, 8:36pm)

QUOTE(Somey @ Sun 15th October 2006, 3:48am)

So I wouldn't sweat it, personally. If the material isn't online and accessible already, I'd say the chances of finding an exact, or near-exact match in Wikipedia are negligible. It's just too easy to make those minor alterations.

This depends on whether one is investigating plagiarism for the purpose of providing general information to the public, or for providing actionable information to a damaged publisher. The easier to find and systematic plagiarism can be more likely found in comparing word-strings from wm foundation sites with strings found though search engines. Substantive libel where the organization, structure, focus and content of a W foundation article is similar to that of an expensive text is more likely to be found reviewing expensive texts that are not available on line.

Desk-top encyclopedias are not as likely to be harmed as would an advanced text on language or science where the copied content can only be found in texts that required extensive research investments to compile. It is a safe guess that W Foundation articles that do not include verifiable citations on advanced topics were substantially copied from a small sample of related texts. If the content was substantially copied, rearranging a few words might complicate the effort to detect plagiarism, but it does not mitigate the greater harm caused by theft of more costly content, compared to the harm done by theft of content already on-line and available for free from other sources.

It would be difficult to prove that Wikipedia's plagiarism was systematic, since that would employ organization that Wikipedia clearly doesn't have. On the other hand, if Wikipedia is clearly chock full of plagiarism then it puts Jimbo in a very difficult situation - how do you deal with this can of worms?

Where does this leave Sanger's Citizendium fork of Wikipedia? Not a good start, I'd say......

Posted by: Daniel Brandt

Citizendium is one example. Another example is the often-proposed distribution of Wikipedia in various forms that will be frozen at a particular point in time. For example, I just read that a Flash version of Wikipedia will soon be available. There has been talk about a print version. Those laptops for hungry African children will have Wikipedia pre-installed.

These new forms of distribution for Wikipedia mean that errors, copyright violations, and plagiarism will also be distributed -- and frozen in time. Right now (well, not right now, but after my plagiarism report goes live), no publisher will consider a bound version of Wikipedia without factoring in the cost of hiring editors and researchers to double-check every article they plan to publish. When you add that in, it makes no sense to have a print version at all. It's hard enough to sell a print version that has to compete with free versions online, but adding the cost of screening every single article is prohibitive. You cannot risk a print run if you're going to be printing plagiarism.

The problem is that Wikipedia is too big, and by now it's too late to install meaningful controls on the freedoms that rogue editors enjoy. Forget about plagiarisim and copyright violations and errors for a second. Just the task of taking out all the pop-culture trivia, fancruft, porn, gaming esoterica -- it's overwhelming. Going through 1.4 million articles is not anyone's idea of a good time.

I think Jimmy has hyped himself into a corner. And if Larry Sanger thinks he can do better, then the only way for him to start is to delete stuff like crazy. Maybe there are 200,000 articles worth keeping out of the 1.4 million.

And the Internet noise that's been generated, by Wikipedia plus Google, with all the scrapers looking to sell something, is just shameful. I found 965 domains that scrape Wikipedia. These are the ones that don't give Wikipedia credit on their site. I had no idea that there were so many, because normally I'd think of scrapers in terms of those couple dozen sites that scrape as much of Wikipedia as they can for the ad revenue. But for every one of those, there are dozens of niche scrapers that are pushing particular types of merchandise. Art galleries scrape biographies of artists, for example, because it looks cool on their website next to an image of something by that artist that they're trying to sell. Tourism agencies add a little historical flavor to their packages by scraping some information about famous people who lived in the area. And on and on.

The GFDL is a crummy idea. There should be a "noncommercial use only" stipulation in it. Now it's probably too late for Wikipedia to change it without starting over.

What a mess. I remember that set of World Book encyclopedias that I had as a child. Pure signal, zero noise.

Posted by: Somey

QUOTE(Daniel Brandt @ Mon 16th October 2006, 7:54am)

It's hard enough to sell a print version that has to compete with free versions online, but adding the cost of screening every single article is prohibitive. You cannot risk a print run if you're going to be printing plagiarism.

I'm amazed that anyone would do it anyway. The market trend is away from print, paper costs continue to rise... To me, the question is to what extent does current copyright law apply to the redistributed electronic (i.e., Flash drive, iPod, etc.) versions. I would think it applies quite fully...

QUOTE(Daniel Brandt @ Mon 16th October 2006, 7:54am)

The GFDL is a crummy idea. There should be a "noncommercial use only" stipulation in it. Now it's probably too late for Wikipedia to change it without starting over.

Actually, the "nc" in "CC-by-nc-sa" stands for non-commercial (and "by" for the attribution requirement, and "sa" for share-alike, I believe). That's the license they should be using (http://creativecommons.org/)... In fact, it's probably the license we should be using. But by changing the license now, it would essentially require anyone who wanted to reproduce WP content commercially to research each individual article's page history, to determine at which point the article fell under the non-commercial license. Then they could only reproduce the versions prior to that point. It would be a big mess, basically... I think they've even discussed it on occasion, but I can't imagine they'd ever actually do it, and Jimbo would obviously oppose it in any case.

Either way, it doesn't protect them from well-substantiated plagiarism allegations!

Posted by: JohnA

QUOTE(Daniel Brandt @ Mon 16th October 2006, 1:54pm)

The problem is that Wikipedia is too big, and by now it's too late to install meaningful controls on the freedoms that rogue editors enjoy. Forget about plagiarisim and copyright violations and errors for a second. Just the task of taking out all the pop-culture trivia, fancruft, porn, gaming esoterica -- it's overwhelming. Going through 1.4 million articles is not anyone's idea of a good time.

I think Jimmy has hyped himself into a corner. And if Larry Sanger thinks he can do better, then the only way for him to start is to delete stuff like crazy. Maybe there are 200,000 articles worth keeping out of the 1.4 million.

I think there are. There are some excellent articles on Wikipedia, but unfortunately the people who wrote them are tarred with the same brush as the rest of Wikipedia's output.

QUOTE

What a mess. I remember that set of World Book encyclopedias that I had as a child. Pure signal, zero noise.

I learnt most in my childhood from Children's Britannica and a set of Encyclopedia Britannica from about 1960. Talk about zero noise, pure scholarship and clear language!

The trouble is Wikipedia's S/N is poor, and I don't think Sanger has a hope in hell of replacing Wikipedia.

I'm sorry but Sanger has completely missed the point of why Wikipedia is successful and why people sacrifice their time to write for Wikipedia. I think Citizendium will fail, the latest in a long line of attempted Wikiforks that went nowhere.