Help - Search - Members - Calendar
Full Version: A couple of requests for Daniel Brandt
Wikipedia Review > Wikimedia Discussion > General Discussion
JohnA
Daniel,

Can I make a couple of requests?

1. From your research, you claim that there are 956 domains scraping Wikipedia. Can you list them (on your website)? Is is possible to exclude such a list of domains from a search of Google?

2. Would it be possible for (some of) us to see the next list of plagiarism so that a) we can check who added the plagiarism B) capture the edit histories for the articles before the admins wield the "Connelley Red Button"?

Somey: can you create a sooper-sekrit private forum on this board so that Daniel can do this (if he's willing)?
Somey
QUOTE(JohnA @ Wed 25th October 2006, 9:16am) *
Somey: can you create a sooper-sekrit private forum on this board so that Daniel can do this (if he's willing)?

Sure, sooper-sekrit private forums are possible, and it's OK by me I guess. But it's Brandt's research, so it'll have to be OK with him too, along with who gets invited in.

Hey, this is almost cabal-like! Kewl!

Btw, I think the only really feasible way to exclude that many domains from a search would be to write a special script/program that uses Google's fancy API. That appears to be how Daniel did it... I've not tried it myself, but I've been told it's not excessively hard to use once you get it set up... Here's a couple of links:

http://code.google.com/apis.html

http://code.google.com/enterprise/
Daniel Brandt
QUOTE(JohnA @ Wed 25th October 2006, 9:16am) *

1. From your research, you claim that there are 965 domains scraping Wikipedia. Can you list them (on your website)? Is is possible to exclude such a list of domains from a search of Google?

2. Would it be possible for (some of) us to see the next list of plagiarism so that a) we can check who added the plagiarism cool.gif capture the edit histories for the articles before the admins wield the "Connelley Red Button"?

Sorry, but I can't help you.

1. Google has a 32-word limit in the search box. Yahoo has a 50-word limit. You do the search with -wikipedia -wiki in front of the sentence, with the sentence in quotes, then you throw out any URLs from that search that contain one of the 965 domains. If you are left without any results after doing this, then that means your search came up with nothing. Based on my experience, when you investigate your results manually after using this method, you will still reject 90 percent of the Wikipedia articles that survive this process, leaving you with ten percent that are worthy of further action.

2. I have only about 45 titles yet to list. Admins are working on identifying the plagiarizers. Wikipedians would just call you (or me) a troll if we identified a plagiarizer. They'd delete the evidence and then claim that you falsely accused an innocent teenager who is only trying to help poor children in Africa. I've been around the block on this.

For example, my bio talks about Wikipedia-Watch. It mainly mentions two things about the site: hivemind and the IRC logging. They mention these because it makes me look naughty. Last night someone tried to add the fact that there's an important new page that lists dozens of examples of plagiarism by Wikipedia editors. REVERT! REVERT! You see, that additional item makes Wikipedia look naughty instead of making me look naughty. My entire biography is spun the same way.

Sorry, but you cannot have my list of 965 domains. It took me two weeks to compile it, and that was after a week of writing programs and collecting data. I'm holding my list hostage.

If Wikipedia takes down my bio, I'll take down hivemind.html (but not hive2.html), and I'll install a search engine on Wikipedia-Watch that will do all this within a second or two:

1) accept an exact title of a Wikipedia article
2) fetch the XML version of the article
3) extract from one to five clean sentences from the text, if possible (they have to be clean or they will trip up Google)
4) run up to five simultaneous searches on Google (one search for each extracted sentence)
5) purge the results against my list of scrapers
6) report any URLs found by Google that survived this purge

By the way, Yahoo's API is easier to set up and has a limit of 5,000 searches a day. Google's API uses SOAP and has a limit of 1,000 a day. I have never used either, so I don't know if they would work for this task. You'd first want to do some experiments with Yahoo and Google and determine how easily each gets tripped up by subtle variations in a sentence. In other words, the complete sentence you extract from a Wikipedia article, and use inside of quotes to search Yahoo or Google, may hit or may not hit. It depends on the respective algorithms used by Yahoo or Google. A punctuation mark or an accent over a character could mean the difference between a hit and no-hit. Also, you should determine whether Yahoo crawls as deeply as Google. I think they're fairly close. MSN, on the other hand, doesn't go deep.

I haven't made these comparisons because I had to use Google, due to the volume of searching I needed. Scroogle.org was already doing a fine job of scraping Google, and can handle high volume.
Somey
QUOTE(Daniel Brandt @ Wed 25th October 2006, 11:50am) *
Wikipedians would just call you (or me) a troll if we identified a plagiarizer. They'd delete the evidence and then claim that you falsely accused an innocent teenager who is only trying to help poor children in Africa.

When of course, the real point for them is to keep the all-night party train of anonymity and consequence-free, morality-free "participation" rolling as long as they possibly can!

QUOTE
Sorry, but you cannot have my list of 965 domains. It took me two weeks to compile it, and that was after a week of writing programs and collecting data. I'm holding my list hostage.

I wonder if something like that could be worth something, to the right company? I guess it would depend on how far a publisher like Britannica was willing to go in taking action against Wikipedia. The problem for them is that they probably wouldn't make a lot of money from it, and there might be negative PR implications.

Anyway, if you should change your mind about having some of us research the next list of articles prior to general release, the offer still stands as far as I'm concerned. I'm a little busy this weekend, but I figure 3 or 4 of us can still probably get through 45 articles in a couple of days, and still manage to maintain some semblance of a life.

QUOTE
By the way, Yahoo's API is easier to set up and has a limit of 5,000 searches a day. Google's API uses SOAP and has a limit of 1,000 a day. I have never used either, so I don't know if they would work for this task.

I got the impression that they just might, though again, not from personal experience. I can't say it would make for an interesting project compared to, say, my utterly diabolical master plan for turning the world's oceans into masses of flaming slime, but it might make the top ten or so...?
JohnA
If I could make another request to Daniel Brandt, it would be interesting to survey the "Featured Articles" for plagiarism, wouldn't it?
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.