Sun 30th October 2011, 8:28pm
Quite correct. That is evident from the pie chart - certain editors have a high 'blue' proportion, which is the WP: prefixed pages. There is no way round that except by selective querying of the database to get only article contributions.
Well, there's no perfect way of doing it but you could just subtract off the blue to get a probably better estimate.
So edits per article page would be (1-(%wikipedia+%wikipedia talk))*average edits per page
Ideally you'd want to adjust the number of "pages" as well by subtracting AN/I and AE or whatever, but since there aren't that many of these pages it won't get too skewed.
The only possible exception is FAR pages which also count as "wikipedia" (blue) even though a lot of that is obviously content related.
The real difficulty is adjusting for # of edits on users' talk, since there's no way to tell how many different user talk pages a particular person posted to. And a lot of these admins basically spend the majority of their time politickin' on each others' talk pages so that's really something which should be taken into account. For example, Fetchcommons has 28.26% of his posts to user's talk. Sandstein has 24.71%. SarekOfVulcan has 24.09%, Jechochman (who has a pretty high average edits per page - but that's not cause he edits articles a lot) 28.93%, Georgewilliamherbert 34.69% BWilkins 39.26% etc.
For comparison, only 6.62% of my edits are to users talk.
So none of the above have anything to do with average edits per ARTICLE page. Again, the difficulty is in adjusting both the numerator and denominator here.
Still I think the formula above would give a somewhat better picture of actual edits per article page.
And there are many other ways this figure is skewed. E.g. YellowMonkey has the highest number of FAs, yet a (relatively) low average e.p.p. of 3.69. All I can hope to give is a blunt figure that shows some correlation with our intuitive idea of 'content', namely something that cannot be produced by flitting from page to page, and which requires a long look at a single article, concerning the summary, the meaning of the parts.
Yes, and some people will work on articles on their word processor or sandbox and then just post the ready thing. Others (like me) like to do it bit by bit. So the measure is obviously going to miss that.
Yes, you could use the % of namespace totals as a proxy, but I can think of several reasons why that might be skewed.
At the end of the day, I am trying to give one of many reasons why the concept of 'crowdsourcing' is badly flawed.
Well, any statistic summarizes information, almost by definition. And when you summarize information, by definition, you're going to loose some information (the only alternative is to somehow look at every single edit ever made at Wikipedia simultaneously). That doesn't mean that describing data with statistics is useless.