Daniel Brandt
Wed 25th October 2006, 4:50pm
QUOTE(JohnA @ Wed 25th October 2006, 9:16am)

1. From your research, you claim that there are 965 domains scraping Wikipedia. Can you list them (on your website)? Is is possible to exclude such a list of domains from a search of Google?
2. Would it be possible for (some of) us to see the next list of plagiarism so that a) we can check who added the plagiarism

capture the edit histories for the articles before the admins wield the "Connelley Red Button"?
Sorry, but I can't help you.
1. Google has a 32-word limit in the search box. Yahoo has a 50-word limit. You do the search with
-wikipedia -wiki in front of the sentence, with the sentence in quotes, then you throw out any URLs from that search that contain one of the 965 domains. If you are left without any results after doing this, then that means your search came up with nothing. Based on my experience, when you investigate your results manually after using this method, you will still reject 90 percent of the Wikipedia articles that survive this process, leaving you with ten percent that are worthy of further action.
2. I have only about 45 titles yet to list.
Admins are working on identifying the plagiarizers. Wikipedians would just call you (or me) a troll if we identified a plagiarizer. They'd delete the evidence and then claim that you falsely accused an innocent teenager who is only trying to help poor children in Africa. I've been around the block on this.
For example, my bio talks about Wikipedia-Watch. It mainly mentions two things about the site: hivemind and the IRC logging. They mention these because it makes me look naughty. Last night someone tried to add the fact that there's an important new page that lists dozens of examples of plagiarism by Wikipedia editors. REVERT! REVERT! You see, that additional item makes Wikipedia look naughty instead of making me look naughty. My entire biography is spun the same way.
Sorry, but you cannot have my list of 965 domains. It took me two weeks to compile it, and that was after a week of writing programs and collecting data. I'm holding my list hostage.
If Wikipedia takes down my bio, I'll take down hivemind.html (but not hive2.html), and I'll install a search engine on Wikipedia-Watch that will do all this within a second or two:
1) accept an exact title of a Wikipedia article
2) fetch the XML version of the article
3) extract from one to five clean sentences from the text, if possible (they have to be clean or they will trip up Google)
4) run up to five simultaneous searches on Google (one search for each extracted sentence)
5) purge the results against my list of scrapers
6) report any URLs found by Google that survived this purge
By the way,
Yahoo's API is easier to set up and has a limit of 5,000 searches a day. Google's API uses SOAP and has a limit of 1,000 a day. I have never used either, so I don't know if they would work for this task. You'd first want to do some experiments with Yahoo and Google and determine how easily each gets tripped up by subtle variations in a sentence. In other words, the complete sentence you extract from a Wikipedia article, and use inside of quotes to search Yahoo or Google, may hit or may not hit. It depends on the respective algorithms used by Yahoo or Google. A punctuation mark or an accent over a character could mean the difference between a hit and no-hit. Also, you should determine whether Yahoo crawls as deeply as Google. I think they're fairly close. MSN, on the other hand, doesn't go deep.
I haven't made these comparisons because I had to use Google, due to the volume of searching I needed.
Scroogle.org was already doing a fine job of scraping Google, and can handle high volume.