I need suggestions on how to present my plagiarism report at wikipedia-watch.org. I still have several weeks of work to do, despite the fact that I've been working a few hours a day on it for the last three weeks.
I'm far enough along in terms of separating the signal from the noise, that I can now predict that the report will end up with between 100 and 300 examples. Here's a throwaway example, that will probably get corrected as soon as someone from Wikipedia sees this post:
http://en.wikipedia.org/w/index.php?title=Alain_LeRoy_Locke&oldid=71014124
http://www.britannica.com/ebc/article-9048696
Most of my examples are similar to this -- except they're not from Britannica, but rather from everywhere imaginable. Almost all of the original sources have clear copyright notices on them, and the source is not acknowledged on the Wikipedia article, and anywhere from several sentences to several paragraphs are plagiarized.
My question is, "How can I format the report so that anyone looking at it will get the picture, within a few clicks, that Wikipedia has a plagiarism problem?"
So far my best idea is to have a doorway page explaining that my examples were culled from a sampling of slightly less than one percent of the 1.4 million English-language Wikipedia articles. If I have 200 examples, then we can presume that there are about 20,000 plagiarized articles in Wikipedia that no one has yet discovered. No one has made any attempt to discover them, and no one ever will. It's just too hard. Even for programmers with a pipeline into automated Google inquiries, it's still too hard. There's an amazing amount of manual checking that's required to reduce the noise without throwing out the signal.
This doorway page will link to 200 subpages (Example 001, Example 002, ... Example 200). Each of the subpages will be titled "Plagiarism on Wikipedia - Example 001" and have a link to the source, plus a link to the version on Wikipedia as of mid-September when I grabbed the page. Then below this, the text-portion only from that page (this is easy to strip out of the XML versions of the article that I already have) will be reproduced, and the sections that are plagiarized from the source will be in highlighted in background yellow.
The effect will be that the visitor to the doorway page is given some information on how the examples were found, and is invited to click randomly on any of the 200 examples to see for themselves. I'm linking to the mid-September version, since it's possible that many editors will start cleaning up these 200 examples. One way they will try to clean it up is to acknowledge the source, but that still doesn't solve the problem that entire paragraphs were copied verbatim. They'll have to change sentences around too.
Therefore, I predict that Jimmy will claim that Wikipedia is amazingly free from plagiarism, because Wikipedia has always had a zero-tolerance policy. (This will be a lie -- there have been no efforts to identify plagiarism at Wikipedia.) Then he will zap totally all 200 articles (no history, no nothing) so that the links to the September version on my subpages won't work. That's why I have to reproduce the text from the article and highlight the plagiarized material. If I don't, my report will not be convincing after Jimmy zaps the 200 articles.
Any other ideas?
1. Provide link, but also save all WP versions locally, replace if needed.
2. Make two column pages, with start of each paragraph at equal height.
3. Highlight (near) identical sections.
4. Give name of article in title.
What EuroSceptic suggests sounds like the first thing that occurred to me -- if you have the technical means to recreate a Diff display or mock up a close facsimile any way you can. There should be version control systems around that would do this for you.
D'oh! Now that I thunk on it, you might even be able to use the comparison tool in the WP edit window to do this, which you can use on preview mode without saving.
Jonny
/
Plagiarism? It's trivial.
What about copyright infringement? (It ain't fair use if it's repetetive, widespread, unacknowledged, etc.)
whats scary is taht the plagerizer here says hes a librarian. yowza.
Most of the plagiarism in my examples will also serve as examples of copyright infringement. Most of the original sources have copyright notices at the bottom of the page. Some even have bylines. A few are from public domain sites (government sites, the 1911 Britannica, etc.) where it is not copyright infringement, but when it is unattributed it is still plagiarism.
My examples that show that the Foundation's copyright infringement situation is a problem. What Wikipedia is doing is grabbing copyrighted material, publishing it under their own terms (GDFL), and telling the world that anyone can help themselves and do the same thing. They invite dozens of mirrors to grab the material. And many more scrapers than mirrors grab the material without even mentioning "Wikipedia" or "wiki" anywhere on their page.
No systematic effort has ever been made by Wikimedia Foundation to police new articles for possible plagiarism and/or copyright problems. They do this for images, but not for text. They should have been screening all new articles for whole-sentence hits in Google, and manually checking anything suspicious. (Actually, Google used to have a ten-word limit, but fairly recently changed to a 32-word limit in their search box, so a couple years ago this might have been difficult to do.)
Now it's too late to go back and check all the articles. For every instance of plagiarism and/or copyright violation that can be discovered in Wikipedia articles, you have to filter out about 10 to 20 Google hits on entire sentences that are noise instead of signal. In other words, these are the scrapers -- where the copying is going the other direction. By that I mean that some site grabbed Wikipedia's article, rather than some Wikipedia editor starting an article by plagiarizing another site. Google cannot tell the difference -- it takes a real person to compare the two articles. You can compile a list of scraper domains and reject them up front (it took me two weeks to do this, and I ended up with 950 scraper domains), but you still get rid of only 75 percent of the noise this way, and are left with lots of manual inspection. And my Google searches tried to exclude the obvious mirrors to start with, by placing -wikipedia -wiki in front of the complete sentence.
Some of these scraper pages even have bogus copyright notices on the bottom of what they scraped, because they use web tools that slap a notice on every page just to make it look cool (usually these say Copyright 2006, which is a giveaway). There is no way to find the needles in the haystack without manual effort.
The copyright thing is serious. If the Foundation had more money, they'd find a bunch of lawsuits at their door. YouTube has staff that do nothing but take down stuff under DMCA. Wikimedia Foundation has even a bigger problem, because their structure is designed to exercise more control over content. By virtue of this, they might have more legal liability than YouTube or a Google, where no one expects them to screen stuff before their users post it or the Googlebot grabs it. YouTube or Google can escape liability under DMCA by acting after the fact, but Wikimedia Foundation might be in a different position because it is more directly involved in generating and shaping the content to begin with. (Actually, YouTube could be in trouble now that they are attractive as a lawsuit target. A court could decide that they should be pre-approving the content.)
I thought about researching who the users were behind the plagiarism, but that's too much work. Besides, I maintain that the Foundation should be doing this themselves, and that the Foundation and the admins are responsible if they don't police for plagiarism. I think it's best to keep the emphasis on Wikipedia as a whole, instead of picking off specific editors. My hivemind page was different -- there the Foundation supports its editors. But there is no way the Foundation can support copyright infringement. Rogue editors are their problem, not mine.
Well, I'm certainly impressed! Nice work!
I guess my advice FWIW would be to preserve the evidence in any way you can manage, including WP screen shots, possibly even presented as thumbnails. (Let me know if you need a volunteer to capture a few - you could always just send me a few article titles at a time, just in case I'm actually a Wikipedia spy who plans on reporting the contents of the list to Jimbo. ) Most browsers these days also have a "Save as Complete Web Page" feature, too, if I'm not mistaken...?
I agree, they'll delete most of the articles as soon as you release the list, so we might want to organize a vigil over the deletion logs and AfD pages to time how long it takes them to do it. Be sure to heavily emphasize the sample size, too - they'll try to say something like, "he could only find a few hundred out of 1.4 million, proving there isn't really a problem," or some-such nonsense. Then they'll claim that most of the material was "in the public domain anyway" and "no real harm was done." And they'll feel secure in the knowledge that nobody can afford to sue them because of both the bad publicity, and the fact that the plaintiffs would gain practically nothing from it financially, so that six months later they can say "sure, we found a few instances, but nobody took it all that seriously."
And they'll continue to plagiarize, of course. Some things never change...
One more thing...
Another way they'll probably attack this will be to delete any mention of it in Wikipedia itself, under the pretense that Brandt "self-publishes" and is therefore not a "reliable source." As it turns out, plans are already underway to replace WP:NOR and WP:V with a new, "more cohesive" policy that will help them squelch any dissent over the plagiarism issue, ensuring that they can sweep the whole thing under the rug as soon as possible:
http://en.wikipedia.org/wiki/Wikipedia:Attribution#Self-published_sources
And I'll give you one guess as to who's the author of this new proposed policy:
Actually, Wikipedia lacks tools to convert individual files from the XML version to a static HTML page. There are some tools, but they are not suitable for grabbing just a couple hundred files. They presume that you want the whole Wikipedia, and then you run a tool to shove everything into an SQL database, and the pages get parsed on the fly from the database. Total overkill for what I need -- I treat my servers with much more respect than that.
I think what I'll do is a two-column plain-text display of the text portion only, and write an offline program to highlight duplicate complete sentences in the Wikipedia version in the right-side column. It will need checking, but it will be faster to find the sentences this way than without a program. I'll link to the actual Wikipedia version as of September, and to the actual original version that was plagiarized.
I might grab cache versions from MSN as backup. Believe it or not, a cache version from MSN with the MSN header stripped out, saved as a static HTML file, is about as close as I can get to serving up a Wikipedia page from my own server. It even looks like a Wikipedia article.
If you try to save an actual page from Wikipedia as a static file, all the style formatting disappears. That may be fine if you're trying to print it out on that dot-matrix printer you bought in 1985, but when displayed in a browser it simply doesn't look like it came from Wikipedia.
You'll probably want to prepare an argument for why your hosting of the copyvio examples isn't a copyvio itself.
I expect you can easily argue fair use and proper attribution - but I'm also sure this'll be one of the first attacks levelled at you.
Somey: If you need Explorer to read them, that means I'd have to reverse-engineer the format to serve the files from a Linux box. The MSN cache copy looks a lot easier. I know you do Microsoft stuff, but my work is more generic. I use XP on my desktop (my servers are Linux), but at least half the time on my XP I'm doing stuff from a command window. I tried saving a web page once from IE, and it took me three minutes to track down all the directories it created to save the various types of content. That was the last time I saved anything from IE without doing a "view source" first. I do use IE to see how the web pages I code by hand look on IE, but I've got everything disabled in it for security reasons, which means I can't use it online apart from looking at my own sites.
Let's hope they do say that. Daniel can point out that he's not a privacy advocate, and have a good laugh about the accuracy of Wikipedia articles.
/
looks like plagarist librerian fixed it.
http://en.wikipedia.org/w/index.php?title=Alain_LeRoy_Locke&curid=518480&diff=80839171&oldid=79514403
I guess the obvious should be said if it hasn't already: the 20,000 estimate is for web-articles. How many more are from that old-style paper stuff? (Guesses are possible.)
I don't recommend publishing all 200 you have, but only 50 or even fewer. This is because the follow-up study should track just how effective current or even future anti-plagiarism tactics employed by WP.
Look http://www.wikipedia-watch.org/junk/msncache.html -- I'm picking up the MSN cache copy from my wikipedia-watch server. Here's how I did it:
First I got the cache URL ID number from MSN by doing a search for: site:wikipedia.org alain leroy locke
Then I pasted this cache ID into a one-line script that ran on Linux:
For presentation: I like Euro´s suggestion, too, but take a look here: http://en.wikipedia.org/wiki/Wikipedia:Requests_for_arbitration/Deir_Yassin_massacre/Evidence#Copyright_violations (where KimvdL. proves Guy Montag´s copyright infringement. (as a side-note: since GM is one of SlimVirgins proteges, this has of course increased Slims hatred of Kim..) -using colors for the copyright infringements makes it very visible.
Users Zeq and Guy Montag have on a regular basis copied material from the web, without attributing.
/
That's unlikely to work for old but still in copyright books, like the 1960s edition of Encyclopaedia Britannica.
The only problem I can see is that Wikipedia may go for the "Connelley Red Button" and remove not only the plagiarism but the evidence that the plagiarism ever existed.
I nearly put this in the discussion for Alan Locke, but I thought: "Fuck it. Put it where Wikipedophiles can see it but not delete it"
===Plagiarism===
As a lot of us suspected, Wikipedia is chock full not just of lies and propaganda, but also the lazy man's substitute for scholarship: plagiarism. But just remember the source: Daniel Brandt. He is doing to Wikipedia what Wikipedia is doing to scholarship and the historical record.
I expect they'll say that there are a handful of trolls who commit plagiarism but the alert admins have caught it - Mr Brandt will be airbrushed out.
Here's how I'm planning on doing each example: http://www.wikipedia-watch.org/plagiarism/0002.html
/
/
/
I say we all block ourselves for 45 minutes, go make a batch of chocolate-chip cookies, and then do some crossword puzzles.
But just in case you want my opinion, here it is. I was at a bookfair three weeks ago, shortly after Daniel mentioned that he was thinking of starting his project to track down plagiarism in Wikipedia. I bought three little "desk encyclopedias" there - all very thick paperbacks in very small print. I was trying to think like a plagiarist, you see - I figured that someone looking to add a bunch of articles to Wikipedia without having to do any original writing might simply copy something out of a little book like that, maybe paraphrasing and rewording things a little bit just to be safe.
The fact is, there are a limited number of ways to write encyclopedic articles about most subjects. A lot of the articles in these books not only look like their Wikipedia counterparts, they look like each other, too. They might not be exactly the same, but if someone is going to copy material out of a book, and type it into a Wikipedia edit box, there isn't much additional effort involved in making the kind of minor changes that would avoid at least an automated match on existing source text.
So I wouldn't sweat it, personally. If the material isn't online and accessible already, I'd say the chances of finding an exact, or near-exact match in Wikipedia are negligible. It's just too easy to make those minor alterations.
More tips for Wikipedia critics with their own servers:
I discovered that if you want to serve a Wikipedia article perfectly from your own server, you don't have to look for a cache copy unless you need an earlier version that cannot be found elsewhere.
Initially the page doesn't look good in your browser when you do this:
?
Citizendium is one example. Another example is the often-proposed distribution of Wikipedia in various forms that will be frozen at a particular point in time. For example, I just read that a Flash version of Wikipedia will soon be available. There has been talk about a print version. Those laptops for hungry African children will have Wikipedia pre-installed.
These new forms of distribution for Wikipedia mean that errors, copyright violations, and plagiarism will also be distributed -- and frozen in time. Right now (well, not right now, but after my plagiarism report goes live), no publisher will consider a bound version of Wikipedia without factoring in the cost of hiring editors and researchers to double-check every article they plan to publish. When you add that in, it makes no sense to have a print version at all. It's hard enough to sell a print version that has to compete with free versions online, but adding the cost of screening every single article is prohibitive. You cannot risk a print run if you're going to be printing plagiarism.
The problem is that Wikipedia is too big, and by now it's too late to install meaningful controls on the freedoms that rogue editors enjoy. Forget about plagiarisim and copyright violations and errors for a second. Just the task of taking out all the pop-culture trivia, fancruft, porn, gaming esoterica -- it's overwhelming. Going through 1.4 million articles is not anyone's idea of a good time.
I think Jimmy has hyped himself into a corner. And if Larry Sanger thinks he can do better, then the only way for him to start is to delete stuff like crazy. Maybe there are 200,000 articles worth keeping out of the 1.4 million.
And the Internet noise that's been generated, by Wikipedia plus Google, with all the scrapers looking to sell something, is just shameful. I found 965 domains that scrape Wikipedia. These are the ones that don't give Wikipedia credit on their site. I had no idea that there were so many, because normally I'd think of scrapers in terms of those couple dozen sites that scrape as much of Wikipedia as they can for the ad revenue. But for every one of those, there are dozens of niche scrapers that are pushing particular types of merchandise. Art galleries scrape biographies of artists, for example, because it looks cool on their website next to an image of something by that artist that they're trying to sell. Tourism agencies add a little historical flavor to their packages by scraping some information about famous people who lived in the area. And on and on.
The GFDL is a crummy idea. There should be a "noncommercial use only" stipulation in it. Now it's probably too late for Wikipedia to change it without starting over.
What a mess. I remember that set of World Book encyclopedias that I had as a child. Pure signal, zero noise.