FORUM WARNING [2] Division by zero (Line: 2933 of /srcsgcaop/boardclass.php)
My upcoming plagiarism report -
     
 
The Wikipedia Review: A forum for discussion and criticism of Wikipedia
Wikipedia Review Op-Ed Pages

Welcome, Guest! ( Log In | Register )

> General Discussion? What's that all about?

This subforum is for general discussion of Wikipedia and other Wikimedia projects. For a glossary of terms frequently used in such discussions, please refer to Wikipedia:Glossary. For a glossary of musical terms, see here. Other useful links:

Akahele.orgWikipedia-WatchWikitruthWP:ANWikiEN-L/Foundation-L (mailing lists) • Citizendium forums

> My upcoming plagiarism report, How should I present it?
Daniel Brandt
post
Post #1


Postmaster
*******

Group: Regulars
Posts: 2,473
Joined:
Member No.: 77



I need suggestions on how to present my plagiarism report at wikipedia-watch.org. I still have several weeks of work to do, despite the fact that I've been working a few hours a day on it for the last three weeks.

I'm far enough along in terms of separating the signal from the noise, that I can now predict that the report will end up with between 100 and 300 examples. Here's a throwaway example, that will probably get corrected as soon as someone from Wikipedia sees this post:

Wikipedia version as of mid-September, 2006

Source that was plagiarized

Most of my examples are similar to this -- except they're not from Britannica, but rather from everywhere imaginable. Almost all of the original sources have clear copyright notices on them, and the source is not acknowledged on the Wikipedia article, and anywhere from several sentences to several paragraphs are plagiarized.

My question is, "How can I format the report so that anyone looking at it will get the picture, within a few clicks, that Wikipedia has a plagiarism problem?"

So far my best idea is to have a doorway page explaining that my examples were culled from a sampling of slightly less than one percent of the 1.4 million English-language Wikipedia articles. If I have 200 examples, then we can presume that there are about 20,000 plagiarized articles in Wikipedia that no one has yet discovered. No one has made any attempt to discover them, and no one ever will. It's just too hard. Even for programmers with a pipeline into automated Google inquiries, it's still too hard. There's an amazing amount of manual checking that's required to reduce the noise without throwing out the signal.

This doorway page will link to 200 subpages (Example 001, Example 002, ... Example 200). Each of the subpages will be titled "Plagiarism on Wikipedia - Example 001" and have a link to the source, plus a link to the version on Wikipedia as of mid-September when I grabbed the page. Then below this, the text-portion only from that page (this is easy to strip out of the XML versions of the article that I already have) will be reproduced, and the sections that are plagiarized from the source will be in highlighted in background yellow.

The effect will be that the visitor to the doorway page is given some information on how the examples were found, and is invited to click randomly on any of the 200 examples to see for themselves. I'm linking to the mid-September version, since it's possible that many editors will start cleaning up these 200 examples. One way they will try to clean it up is to acknowledge the source, but that still doesn't solve the problem that entire paragraphs were copied verbatim. They'll have to change sentences around too.

Therefore, I predict that Jimmy will claim that Wikipedia is amazingly free from plagiarism, because Wikipedia has always had a zero-tolerance policy. (This will be a lie -- there have been no efforts to identify plagiarism at Wikipedia.) Then he will zap totally all 200 articles (no history, no nothing) so that the links to the September version on my subpages won't work. That's why I have to reproduce the text from the article and highlight the plagiarized material. If I don't, my report will not be convincing after Jimmy zaps the 200 articles.

Any other ideas?
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
 
Reply to this topicStart new topic
Replies
Daniel Brandt
post
Post #2


Postmaster
*******

Group: Regulars
Posts: 2,473
Joined:
Member No.: 77



Most of the plagiarism in my examples will also serve as examples of copyright infringement. Most of the original sources have copyright notices at the bottom of the page. Some even have bylines. A few are from public domain sites (government sites, the 1911 Britannica, etc.) where it is not copyright infringement, but when it is unattributed it is still plagiarism.

My examples that show that the Foundation's copyright infringement situation is a problem. What Wikipedia is doing is grabbing copyrighted material, publishing it under their own terms (GDFL), and telling the world that anyone can help themselves and do the same thing. They invite dozens of mirrors to grab the material. And many more scrapers than mirrors grab the material without even mentioning "Wikipedia" or "wiki" anywhere on their page.

No systematic effort has ever been made by Wikimedia Foundation to police new articles for possible plagiarism and/or copyright problems. They do this for images, but not for text. They should have been screening all new articles for whole-sentence hits in Google, and manually checking anything suspicious. (Actually, Google used to have a ten-word limit, but fairly recently changed to a 32-word limit in their search box, so a couple years ago this might have been difficult to do.)

Now it's too late to go back and check all the articles. For every instance of plagiarism and/or copyright violation that can be discovered in Wikipedia articles, you have to filter out about 10 to 20 Google hits on entire sentences that are noise instead of signal. In other words, these are the scrapers -- where the copying is going the other direction. By that I mean that some site grabbed Wikipedia's article, rather than some Wikipedia editor starting an article by plagiarizing another site. Google cannot tell the difference -- it takes a real person to compare the two articles. You can compile a list of scraper domains and reject them up front (it took me two weeks to do this, and I ended up with 950 scraper domains), but you still get rid of only 75 percent of the noise this way, and are left with lots of manual inspection. And my Google searches tried to exclude the obvious mirrors to start with, by placing -wikipedia -wiki in front of the complete sentence.

Some of these scraper pages even have bogus copyright notices on the bottom of what they scraped, because they use web tools that slap a notice on every page just to make it look cool (usually these say Copyright 2006, which is a giveaway). There is no way to find the needles in the haystack without manual effort.

The copyright thing is serious. If the Foundation had more money, they'd find a bunch of lawsuits at their door. YouTube has staff that do nothing but take down stuff under DMCA. Wikimedia Foundation has even a bigger problem, because their structure is designed to exercise more control over content. By virtue of this, they might have more legal liability than YouTube or a Google, where no one expects them to screen stuff before their users post it or the Googlebot grabs it. YouTube or Google can escape liability under DMCA by acting after the fact, but Wikimedia Foundation might be in a different position because it is more directly involved in generating and shaping the content to begin with. (Actually, YouTube could be in trouble now that they are attractive as a lawsuit target. A court could decide that they should be pre-approving the content.)

I thought about researching who the users were behind the plagiarism, but that's too much work. Besides, I maintain that the Foundation should be doing this themselves, and that the Foundation and the admins are responsible if they don't police for plagiarism. I think it's best to keep the emphasis on Wikipedia as a whole, instead of picking off specific editors. My hivemind page was different -- there the Foundation supports its editors. But there is no way the Foundation can support copyright infringement. Rogue editors are their problem, not mine.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

Posts in this topic
Daniel Brandt   My upcoming plagiarism report  
EuroSceptic   1. Provide link, but also save all WP versions loc...  
Jonny Cache   What EuroSceptic suggests sounds like the first th...  
Joey   /  
Skyrocket   Plagiarism? It's trivial. What about copyrig...  
poopooball   whats scary is taht the plagerizer here says hes a...  
Somey   Well, I'm certainly impressed! Nice work...  
Somey   One more thing... Another way they'll probabl...  
Joey   /  
Daniel Brandt   The approach here might be to submit the matter fo...  
Ashibaka   It was Seigenthaler's idea to do a plagiarism...  
Daniel Brandt   I gotta say, that's pretty cool! Make sure...  
Daniel Brandt   Actually, Wikipedia lacks tools to convert individ...  
Somey   If you try to save an actual page from Wikipedia a...  
Uly   You'll probably want to prepare an argument fo...  
Daniel Brandt   Somey: If you need Explorer to read them, that mea...  
guy   Let's hope they do say that. Daniel can point...  
Daniel Brandt   Let's hope they do say that. Daniel can point...  
poopooball   looks like plagarist librerian fixed it. http://...  
taiwopanfob   I guess the obvious should be said if it hasn...  
Joey   /  
Daniel Brandt   Look here -- I'm picking up the MSN cache copy...  
Surfer   For presentation: I like Euro´s suggestion, too...  
guy   That's unlikely to work for old but still in c...  
Joey   /  
guy   I'm not certain what relevance the fact that ...  
Joey   /  
guy   Absent a definite article that would expose the i...  
Joey   /  
JohnA   The only problem I can see is that Wikipedia may g...  
guy   I expect they'll say that there are a handful ...  
Daniel Brandt   Here's how I'm planning on doing each exam...  
Somey   I say we all block ourselves for 45 minutes, go ma...  
Joey   ?  
JohnA   So I wouldn't sweat it, personally. If the m...  
Daniel Brandt   More tips for Wikipedia critics with their own ser...  
Daniel Brandt   Citizendium is one example. Another example is the...  
Somey   It's hard enough to sell a print version that ...  
JohnA   The problem is that Wikipedia is too big, and by...  


Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 

-   Lo-Fi Version Time is now:
 
     
FORUM WARNING [2] Cannot modify header information - headers already sent by (output started at /home2/wikipede/public_html/int042kj398.php:242) (Line: 0 of Unknown)