The Wikipedia Review: A forum for discussion and criticism of Wikipedia
Wikipedia Review Op-Ed Pages

Welcome, Guest! ( Log In | Register )

> General Discussion? What's that all about?

This subforum is for general discussion of Wikipedia and other Wikimedia projects. For a glossary of terms frequently used in such discussions, please refer to Wikipedia:Glossary. For a glossary of musical terms, see here. Other useful links:

Akahele.orgWikipedia-WatchWikitruthWP:ANWikiEN-L/Foundation-L (mailing lists) • Citizendium forums

2 Pages V  1 2 >  
Reply to this topicStart new topic
> My upcoming plagiarism report, How should I present it?
Daniel Brandt
post Tue 10th October 2006, 6:47pm
Post #1


Postmaster
*******

Group: Regulars
Posts: 2,473
Joined: Fri 24th Mar 2006, 12:23am
Member No.: 77



I need suggestions on how to present my plagiarism report at wikipedia-watch.org. I still have several weeks of work to do, despite the fact that I've been working a few hours a day on it for the last three weeks.

I'm far enough along in terms of separating the signal from the noise, that I can now predict that the report will end up with between 100 and 300 examples. Here's a throwaway example, that will probably get corrected as soon as someone from Wikipedia sees this post:

Wikipedia version as of mid-September, 2006

Source that was plagiarized

Most of my examples are similar to this -- except they're not from Britannica, but rather from everywhere imaginable. Almost all of the original sources have clear copyright notices on them, and the source is not acknowledged on the Wikipedia article, and anywhere from several sentences to several paragraphs are plagiarized.

My question is, "How can I format the report so that anyone looking at it will get the picture, within a few clicks, that Wikipedia has a plagiarism problem?"

So far my best idea is to have a doorway page explaining that my examples were culled from a sampling of slightly less than one percent of the 1.4 million English-language Wikipedia articles. If I have 200 examples, then we can presume that there are about 20,000 plagiarized articles in Wikipedia that no one has yet discovered. No one has made any attempt to discover them, and no one ever will. It's just too hard. Even for programmers with a pipeline into automated Google inquiries, it's still too hard. There's an amazing amount of manual checking that's required to reduce the noise without throwing out the signal.

This doorway page will link to 200 subpages (Example 001, Example 002, ... Example 200). Each of the subpages will be titled "Plagiarism on Wikipedia - Example 001" and have a link to the source, plus a link to the version on Wikipedia as of mid-September when I grabbed the page. Then below this, the text-portion only from that page (this is easy to strip out of the XML versions of the article that I already have) will be reproduced, and the sections that are plagiarized from the source will be in highlighted in background yellow.

The effect will be that the visitor to the doorway page is given some information on how the examples were found, and is invited to click randomly on any of the 200 examples to see for themselves. I'm linking to the mid-September version, since it's possible that many editors will start cleaning up these 200 examples. One way they will try to clean it up is to acknowledge the source, but that still doesn't solve the problem that entire paragraphs were copied verbatim. They'll have to change sentences around too.

Therefore, I predict that Jimmy will claim that Wikipedia is amazingly free from plagiarism, because Wikipedia has always had a zero-tolerance policy. (This will be a lie -- there have been no efforts to identify plagiarism at Wikipedia.) Then he will zap totally all 200 articles (no history, no nothing) so that the links to the September version on my subpages won't work. That's why I have to reproduce the text from the article and highlight the plagiarized material. If I don't, my report will not be convincing after Jimmy zaps the 200 articles.

Any other ideas?
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
EuroSceptic
post Tue 10th October 2006, 7:02pm
Post #2


Member
***

Group: Contributors
Posts: 134
Joined: Mon 7th Aug 2006, 2:50pm
From: Europe
Member No.: 322



1. Provide link, but also save all WP versions locally, replace if needed.
2. Make two column pages, with start of each paragraph at equal height.
3. Highlight (near) identical sections.
4. Give name of article in title.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Jonny Cache
post Tue 10th October 2006, 7:13pm
Post #3


τα δε μοι παθήματα μαθήματα γέγονε
*********

Group: Contributors
Posts: 5,100
Joined: Sat 9th Sep 2006, 1:52am
Member No.: 398

WP user page - talk
check - contribs



What EuroSceptic suggests sounds like the first thing that occurred to me -- if you have the technical means to recreate a Diff display or mock up a close facsimile any way you can. There should be version control systems around that would do this for you.

D'oh! Now that I thunk on it, you might even be able to use the comparison tool in the WP edit window to do this, which you can use on preview mode without saving.

Jonny cool.gif

This post has been edited by Jonny Cache: Tue 10th October 2006, 7:27pm
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Placeholder
post Wed 11th October 2006, 1:23am
Post #4


Member
***

Group: On Vacation
Posts: 204
Joined: Sun 25th Jun 2006, 7:29pm
Member No.: 287



/

This post has been edited by Joey: Sun 15th October 2006, 7:17pm
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Skyrocket
post Wed 11th October 2006, 1:41am
Post #5


Member
***

Group: Contributors
Posts: 104
Joined: Fri 6th Oct 2006, 3:20pm
From: Bishkek
Member No.: 460



Plagiarism? It's trivial.

What about copyright infringement? (It ain't fair use if it's repetetive, widespread, unacknowledged, etc.)
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
poopooball
post Wed 11th October 2006, 2:32am
Post #6


Junior Member
**

Group: Contributors
Posts: 51
Joined: Thu 10th Aug 2006, 3:11pm
Member No.: 329



whats scary is taht the plagerizer here says hes a librarian. yowza.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Daniel Brandt
post Wed 11th October 2006, 2:54am
Post #7


Postmaster
*******

Group: Regulars
Posts: 2,473
Joined: Fri 24th Mar 2006, 12:23am
Member No.: 77



Most of the plagiarism in my examples will also serve as examples of copyright infringement. Most of the original sources have copyright notices at the bottom of the page. Some even have bylines. A few are from public domain sites (government sites, the 1911 Britannica, etc.) where it is not copyright infringement, but when it is unattributed it is still plagiarism.

My examples that show that the Foundation's copyright infringement situation is a problem. What Wikipedia is doing is grabbing copyrighted material, publishing it under their own terms (GDFL), and telling the world that anyone can help themselves and do the same thing. They invite dozens of mirrors to grab the material. And many more scrapers than mirrors grab the material without even mentioning "Wikipedia" or "wiki" anywhere on their page.

No systematic effort has ever been made by Wikimedia Foundation to police new articles for possible plagiarism and/or copyright problems. They do this for images, but not for text. They should have been screening all new articles for whole-sentence hits in Google, and manually checking anything suspicious. (Actually, Google used to have a ten-word limit, but fairly recently changed to a 32-word limit in their search box, so a couple years ago this might have been difficult to do.)

Now it's too late to go back and check all the articles. For every instance of plagiarism and/or copyright violation that can be discovered in Wikipedia articles, you have to filter out about 10 to 20 Google hits on entire sentences that are noise instead of signal. In other words, these are the scrapers -- where the copying is going the other direction. By that I mean that some site grabbed Wikipedia's article, rather than some Wikipedia editor starting an article by plagiarizing another site. Google cannot tell the difference -- it takes a real person to compare the two articles. You can compile a list of scraper domains and reject them up front (it took me two weeks to do this, and I ended up with 950 scraper domains), but you still get rid of only 75 percent of the noise this way, and are left with lots of manual inspection. And my Google searches tried to exclude the obvious mirrors to start with, by placing -wikipedia -wiki in front of the complete sentence.

Some of these scraper pages even have bogus copyright notices on the bottom of what they scraped, because they use web tools that slap a notice on every page just to make it look cool (usually these say Copyright 2006, which is a giveaway). There is no way to find the needles in the haystack without manual effort.

The copyright thing is serious. If the Foundation had more money, they'd find a bunch of lawsuits at their door. YouTube has staff that do nothing but take down stuff under DMCA. Wikimedia Foundation has even a bigger problem, because their structure is designed to exercise more control over content. By virtue of this, they might have more legal liability than YouTube or a Google, where no one expects them to screen stuff before their users post it or the Googlebot grabs it. YouTube or Google can escape liability under DMCA by acting after the fact, but Wikimedia Foundation might be in a different position because it is more directly involved in generating and shaping the content to begin with. (Actually, YouTube could be in trouble now that they are attractive as a lawsuit target. A court could decide that they should be pre-approving the content.)

I thought about researching who the users were behind the plagiarism, but that's too much work. Besides, I maintain that the Foundation should be doing this themselves, and that the Foundation and the admins are responsible if they don't police for plagiarism. I think it's best to keep the emphasis on Wikipedia as a whole, instead of picking off specific editors. My hivemind page was different -- there the Foundation supports its editors. But there is no way the Foundation can support copyright infringement. Rogue editors are their problem, not mine.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Somey
post Wed 11th October 2006, 4:15am
Post #8


Can't actually moderate (or even post)
*********

Group: Moderators
Posts: 11,815
Joined: Sat 17th Jun 2006, 7:47pm
From: Dreamland
Member No.: 275



Well, I'm certainly impressed! Nice work!

I guess my advice FWIW would be to preserve the evidence in any way you can manage, including WP screen shots, possibly even presented as thumbnails. (Let me know if you need a volunteer to capture a few - you could always just send me a few article titles at a time, just in case I'm actually a Wikipedia spy who plans on reporting the contents of the list to Jimbo. unsure.gif ) Most browsers these days also have a "Save as Complete Web Page" feature, too, if I'm not mistaken...?

I agree, they'll delete most of the articles as soon as you release the list, so we might want to organize a vigil over the deletion logs and AfD pages to time how long it takes them to do it. Be sure to heavily emphasize the sample size, too - they'll try to say something like, "he could only find a few hundred out of 1.4 million, proving there isn't really a problem," or some-such nonsense. Then they'll claim that most of the material was "in the public domain anyway" and "no real harm was done." And they'll feel secure in the knowledge that nobody can afford to sue them because of both the bad publicity, and the fact that the plaintiffs would gain practically nothing from it financially, so that six months later they can say "sure, we found a few instances, but nobody took it all that seriously."

And they'll continue to plagiarize, of course. Some things never change...
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Somey
post Wed 11th October 2006, 5:54am
Post #9


Can't actually moderate (or even post)
*********

Group: Moderators
Posts: 11,815
Joined: Sat 17th Jun 2006, 7:47pm
From: Dreamland
Member No.: 275



One more thing...

Another way they'll probably attack this will be to delete any mention of it in Wikipedia itself, under the pretense that Brandt "self-publishes" and is therefore not a "reliable source." As it turns out, plans are already underway to replace WP:NOR and WP:V with a new, "more cohesive" policy that will help them squelch any dissent over the plagiarism issue, ensuring that they can sweep the whole thing under the rug as soon as possible:

http://en.wikipedia.org/wiki/Wikipedia:Att...blished_sources

And I'll give you one guess as to who's the author of this new proposed policy:

QUOTE(Slimmy @ Right about now)
A self-published source is a published source, online or on paper, that has not been subject to any form of independent fact-checking, or where no one stands between the writer and the act of publication. Anyone can create a website or pay to have a book published and then claim to be an expert in a certain field. For that reason, self-published books, personal websites, and blogs are usually not acceptable as reliable sources.

They're awfully predictable sometimes, aren't they? It's kind of sad, in a way.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Daniel Brandt
post Wed 11th October 2006, 6:18am
Post #10


Postmaster
*******

Group: Regulars
Posts: 2,473
Joined: Fri 24th Mar 2006, 12:23am
Member No.: 77



Actually, Wikipedia lacks tools to convert individual files from the XML version to a static HTML page. There are some tools, but they are not suitable for grabbing just a couple hundred files. They presume that you want the whole Wikipedia, and then you run a tool to shove everything into an SQL database, and the pages get parsed on the fly from the database. Total overkill for what I need -- I treat my servers with much more respect than that.

I think what I'll do is a two-column plain-text display of the text portion only, and write an offline program to highlight duplicate complete sentences in the Wikipedia version in the right-side column. It will need checking, but it will be faster to find the sentences this way than without a program. I'll link to the actual Wikipedia version as of September, and to the actual original version that was plagiarized.

I might grab cache versions from MSN as backup. Believe it or not, a cache version from MSN with the MSN header stripped out, saved as a static HTML file, is about as close as I can get to serving up a Wikipedia page from my own server. It even looks like a Wikipedia article.

If you try to save an actual page from Wikipedia as a static file, all the style formatting disappears. That may be fine if you're trying to print it out on that dot-matrix printer you bought in 1985, but when displayed in a browser it simply doesn't look like it came from Wikipedia.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Somey
post Wed 11th October 2006, 6:34am
Post #11


Can't actually moderate (or even post)
*********

Group: Moderators
Posts: 11,815
Joined: Sat 17th Jun 2006, 7:47pm
From: Dreamland
Member No.: 275



QUOTE(Daniel Brandt @ Wed 11th October 2006, 1:18am) *
If you try to save an actual page from Wikipedia as a static file, all the style formatting disappears. That may be fine if you're trying to print it out on that dot-matrix printer you bought in 1985, but when displayed in a browser it simply doesn't look like it came from Wikipedia.

Are you familiar with Internet Explorer's .MHT "Single File Web Archive" file format? (Please, no "get a real browser" jokes...) They're not all that suitable for online viewing since (AFAIK) they require IE5 or higher to open them, but they could be placed on a website for downloading. All the image files are stored in the one file, along with the text.

I just saved the article on "St. Augustine Monster" to my hard drive, disconnected myself for a smidge, and opened it from there - it looked exactly the same, fonts, images, link colors, javascript, CSS, everything all embedded. It was 319K, but zipped down to 132K. Might help in cases where they delete the article(s) completely... Whaddya think?
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Uly
post Wed 11th October 2006, 2:30pm
Post #12


Junior Member
**

Group: Contributors
Posts: 80
Joined: Wed 7th Jun 2006, 8:01pm
Member No.: 250



You'll probably want to prepare an argument for why your hosting of the copyvio examples isn't a copyvio itself.

I expect you can easily argue fair use and proper attribution - but I'm also sure this'll be one of the first attacks levelled at you.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Daniel Brandt
post Wed 11th October 2006, 3:04pm
Post #13


Postmaster
*******

Group: Regulars
Posts: 2,473
Joined: Fri 24th Mar 2006, 12:23am
Member No.: 77



Somey: If you need Explorer to read them, that means I'd have to reverse-engineer the format to serve the files from a Linux box. The MSN cache copy looks a lot easier. I know you do Microsoft stuff, but my work is more generic. I use XP on my desktop (my servers are Linux), but at least half the time on my XP I'm doing stuff from a command window. I tried saving a web page once from IE, and it took me three minutes to track down all the directories it created to save the various types of content. That was the last time I saved anything from IE without doing a "view source" first. I do use IE to see how the web pages I code by hand look on IE, but I've got everything disabled in it for security reasons, which means I can't use it online apart from looking at my own sites.

QUOTE(Uly @ Wed 11th October 2006, 9:30am) *
You'll probably want to prepare an argument for why your hosting of the copyvio examples isn't a copyvio itself. I expect you can easily argue fair use and proper attribution - but I'm also sure this'll be one of the first attacks levelled at you.

That has occurred to me. But then I thought, "Well, I sure won't get any criticism from the copyright holder, because I'm trying to defend their copyright."

And the next thing I thought was, "Wow, look at all those pathetic editors at Wikipedia who insist on calling me a privacy advocate, without any evidence that I'm any such thing, just so they can launch into a clever remark about how ironic it is that I'm violating the privacy of Wikipedia editors on hivemind."

Here is the latest example, from WP:ANI, about my IRC logs:
QUOTE
We can all try and amuse ourselves with the irony that Brandt is supposed to be a leading internet privacy advocate. We can also mention this irony in the press next time someone asks us about critics. --bainer (talk) 14:23, 8 October 2006 (UTC)

You are correct about this Uly. If you give the average Wikipedian any sort of opening at all, they will instantly stick their tongue into it. A single column from Wikipedia with the duplicate sentences highlighted, and just a link to the original (which, after all, will be stable), is probably the smart thing to do.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
guy
post Wed 11th October 2006, 4:14pm
Post #14


Postmaster General
*********

Group: Inactive
Posts: 4,294
Joined: Mon 27th Feb 2006, 8:52pm
From: London
Member No.: 23



Let's hope they do say that. Daniel can point out that he's not a privacy advocate, and have a good laugh about the accuracy of Wikipedia articles.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Daniel Brandt
post Wed 11th October 2006, 4:30pm
Post #15


Postmaster
*******

Group: Regulars
Posts: 2,473
Joined: Fri 24th Mar 2006, 12:23am
Member No.: 77



QUOTE(guy @ Wed 11th October 2006, 11:14am) *
Let's hope they do say that. Daniel can point out that he's not a privacy advocate, and have a good laught about the accuracy of Wikipedia articles.

They called me that for five months on my bio. It took me five months of edit wars to get the word "privacy" deleted in their description of me, and just call me an "activist." And now I get to watch that article for the rest of my life to see if it sneaks back in.

Who won? I don't feel like I won, that's for sure. I'm still not laughing over my bio. When it's number one on all the engines in a search for my name, every defense of myself that I make online is countered by 100 people reading what Wikipedia says about me. It's not a fair fight, and it's no laughing matter.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Placeholder
post Wed 11th October 2006, 4:33pm
Post #16


Member
***

Group: On Vacation
Posts: 204
Joined: Sun 25th Jun 2006, 7:29pm
Member No.: 287



/

This post has been edited by Joey: Sun 15th October 2006, 7:17pm
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
poopooball
post Wed 11th October 2006, 4:45pm
Post #17


Junior Member
**

Group: Contributors
Posts: 51
Joined: Thu 10th Aug 2006, 3:11pm
Member No.: 329



looks like plagarist librerian fixed it.

http://en.wikipedia.org/w/index.php?title=...&oldid=79514403
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Daniel Brandt
post Wed 11th October 2006, 4:48pm
Post #18


Postmaster
*******

Group: Regulars
Posts: 2,473
Joined: Fri 24th Mar 2006, 12:23am
Member No.: 77



QUOTE(Joey @ Wed 11th October 2006, 11:33am) *
The approach here might be to submit the matter for publication. You could either write it yourself for some reputable publication, or offer your facts to a reputable publication. Since it is a well-organized and documented study, some professional journal might take it.

It was Seigenthaler's idea to do a plagiarism study of Wikipedia in the first place. He asked a couple of journalism professors to assign a couple of students to it, and mentioned this to me in an email. I thought about this, and soon realized that any student wouldn't have a snowball's chance in hell of separating the signal from the noise. You need experience with Wikipedia, you need programming skills, you need a pipeline into automated Google queries, and you need a huge amount of time -- much more time than a student has for a single course.

I decided that Seigenthaler's idea had merit, and realized that I was one of the few people who could do this. He knows I'm doing it, and he's delighted that I'm doing it. You can be sure that before the report is made live on my website, he and I will work together to find an interested journalist from mainstream media (maybe the AP or the NYT) who would like a scoop. When the scoop is published, at that instant the study goes live. Until then, there's nothing Jimmy and Brad and Danny can do, because they don't know which 200 articles will end up as my examples.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
taiwopanfob
post Wed 11th October 2006, 5:22pm
Post #19


Über Member
*****

Group: Regulars
Posts: 643
Joined: Fri 26th May 2006, 12:21pm
Member No.: 214



I guess the obvious should be said if it hasn't already: the 20,000 estimate is for web-articles. How many more are from that old-style paper stuff? (Guesses are possible.)

I don't recommend publishing all 200 you have, but only 50 or even fewer. This is because the follow-up study should track just how effective current or even future anti-plagiarism tactics employed by WP.

This post has been edited by taiwopanfob: Wed 11th October 2006, 5:23pm
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Daniel Brandt
post Wed 11th October 2006, 6:16pm
Post #20


Postmaster
*******

Group: Regulars
Posts: 2,473
Joined: Fri 24th Mar 2006, 12:23am
Member No.: 77



Look here -- I'm picking up the MSN cache copy from my wikipedia-watch server. Here's how I did it:

First I got the cache URL ID number from MSN by doing a search for: site:wikipedia.org alain leroy locke

Then I pasted this cache ID into a one-line script that ran on Linux:
CODE
curl -A "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98)" -o "msncache.html" "http://cc.msnscache.com/cache.aspx?q=4081175923447"

That should all be on one line; this board most likely wrapped it. The only change I made to my msncache.html that curl fetched for me, was to delete the MSN header at the top.

The page looks like it came right out of Wikipedia, but the important content -- namely the text -- came from my server. The stuff that came from Wikipedia are the templates and the image -- things that Wikipedia is unable to change. This will work even if Wikipedia zaps the article and history. They could delete the image, but that's no big deal. Many of my samples don't have images. The other stuff is used so widely on Wikipedia that they have to leave it alone. They could block my server, but how embarrassing would that be for them?

I think I'll have to grab as many MSN cache copies as I can before I go live.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

2 Pages V  1 2 >
Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 

-   Lo-Fi Version Time is now: 22nd 8 14, 9:37pm