FORUM WARNING [2] Division by zero (Line: 2933 of /srcsgcaop/boardclass.php)
Wikipedia-sponsored denial of service attack to this site, organized from #wikipedia IRC channel -
     
 
The Wikipedia Review: A forum for discussion and criticism of Wikipedia
Wikipedia Review Op-Ed Pages

Welcome, Guest! ( Log In | Register )

> Wikipedia-sponsored denial of service attack to this site, organized from #wikipedia IRC channel, Info on illegal attack planned by Wikipedia users against this site
Locke85
post
Post #1


Neophyte


Group: Contributors
Posts: 12
Joined:
Member No.: 210



I just wanted to suggest that we do more to weed out fake usernames that are either meant to be used as read only accounts (to get access to the pit and other places that are now or may in the future be only available to members) and accounts used for write access but only to post things that are all but trolling. I know the blatant troll accounts are immediately block as I have seen happen to admins who have tried to infiltrate this just to spam us but over IRC (#wikipedia on irc.freenode.net for those of you who are interested) I heard a discussion in regards to creating fake usernames specifically to spam this site and if nothing else to crowd the userlist with useless names with no posts.

This post has been edited by Locke85:
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
 
Reply to this topicStart new topic
Replies
Daniel Brandt
post
Post #2


Postmaster
*******

Group: Regulars
Posts: 2,473
Joined:
Member No.: 77



Wget will do recursive crawling too. Looking at the site, I don't think the recursive crawl would work well. Even if it works, you'd end up with too much garbage.

I'd write a program that shells to Lynx -dump URL > outputfile

The first step is to make a text file of each of the 28 days, plus how many pages in each day.

Use this file to drive the program. This info is all you need to construct the URL for each fetch. By using Lynx, you already avoid about 35 percent of the characters in each file, because Lynx strips out the HTML.

After you get all your files, you can write a routine to parse out more noise at the top and bottom of each file that Lynx didn't delete. Make each line flush left by deleting any tabs or white space. Add the date to the time if it isn't there already. (The latest files include the date, but the early ones I looked at have the time only on each line.) Concatenate each of the 28 days into a single file for that day.

It would take a day of work, but it's probably worth doing. I just saw some of the stuff they were saying about me today on #wikipedia, and it's not very kind. Keeping a log like this could be evidence of their intent.

I know that Wikipedia considers these logs private, but as far as I know, there is no legal standing behind this policy. Is anyone aware of any legal problems with logging this stuff and making it searchable?

Finally, does anyone have a Linux program that can run stand-alone (no browser) that will log the channel? I don't know much about IRC -- have only played with it for a few weeks, and only with ChatZilla.

Better PM me on this; I'm sure they're listening.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post



Reply to this topicStart new topic
1 User(s) are reading this topic (1 Guests and 0 Anonymous Users)
0 Members:

 

-   Lo-Fi Version Time is now:
 
     
FORUM WARNING [2] Cannot modify header information - headers already sent by (output started at /home2/wikipede/public_html/int042kj398.php:242) (Line: 0 of Unknown)