The Wikipedia Review: A forum for discussion and criticism of Wikipedia
Wikipedia Review Op-Ed Pages

Welcome, Guest! ( Log In | Register )

> General Discussion? What's that all about?

This subforum is for general discussion of Wikipedia and other Wikimedia projects. For a glossary of terms frequently used in such discussions, please refer to Wikipedia:Glossary. For a glossary of musical terms, see here. Other useful links:

Akahele.orgWikipedia-WatchWikitruthWP:ANWikiEN-L/Foundation-L (mailing lists) • Citizendium forums

3 Pages V < 1 2 3  
Reply to this topicStart new topic
> It's slowly dying
Kelly Martin
post Mon 31st October 2011, 1:44am
Post #41


Bring back the guttersnipes!
********

Group: Regulars
Posts: 3,270
Joined: Sun 22nd Jun 2008, 4:41am
From: EN61bw
Member No.: 6,696



QUOTE(Guido den Broeder @ Sun 30th October 2011, 8:27pm) *

Note that the random number generator that the software uses is very poor, with high autocorrelation (and note that no article on Wikipedia tells you anything about the quality of RNGs).
In addition, the randomization process that MediaWiki uses results in a nonuniform distribution: some articles will be returned *far* more often than other articles.

Each page, when initially created, is assigned a random number between 0 and 1. The "random page" function works by generating another random number between 0 and 1, and finding the page whose random number is the least such number greater than the random number just created. Because the random numbers assigned to pages are random, rather than uniform, some pages will be returned more than others. This is in addition to the fairly poor PRNG that PHP uses.

From a statistical sampling standpoint this bias can probably be ignored because it's almost certainly uncorrelated to any metric of actual interest. The only exception would be that I would expect a slightly higher view and edit rates for pages with greater random exposure, on the grounds that they'll get slightly more views, from people using "random page", but I suspect this effect is extremely small.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
radek
post Mon 31st October 2011, 2:00am
Post #42


Über Member
*****

Group: Regulars
Posts: 699
Joined: Sat 28th Nov 2009, 10:40pm
Member No.: 15,651

WP user page - talk
check - contribs



QUOTE(Guido den Broeder @ Sun 30th October 2011, 8:27pm) *

Note that the random number generator that the software uses is very poor, with high autocorrelation (and note that no article on Wikipedia tells you anything about the quality of RNGs).

A proper sample should therefore contain far more than 100 articles.


I've been wondering about that! One thing I noticed while clicking through the thing is that if you get an article named "xxx blah blah blah blah ARCTIC blah blah blah xxx" then quite often the next article you get is named "blah xxx blah blah xxx ARCTIC blah xxx blah". Which does suggest some weird autocorrelation in the algorithm. So it might very well be that if you click ten articles and only 4 of them show up as stubs, then it is quite likely that out of the next 10 the most likely value is 4 as well.

Any idea on how the Random Article feature actually works? Like what's the script or the random number generator underlying it?

Having said that, my sample is pretty close to 1000 now and the 60% of articles being stub is fairly robust so I'm pretty sure it's representative.

QUOTE(Kelly Martin @ Sun 30th October 2011, 8:44pm) *

QUOTE(Guido den Broeder @ Sun 30th October 2011, 8:27pm) *

Note that the random number generator that the software uses is very poor, with high autocorrelation (and note that no article on Wikipedia tells you anything about the quality of RNGs).
In addition, the randomization process that MediaWiki uses results in a nonuniform distribution: some articles will be returned *far* more often than other articles.

Each page, when initially created, is assigned a random number between 0 and 1. The "random page" function works by generating another random number between 0 and 1, and finding the page whose random number is the least such number greater than the random number just created. Because the random numbers assigned to pages are random, rather than uniform, some pages will be returned more than others. This is in addition to the fairly poor PRNG that PHP uses.

From a statistical sampling standpoint this bias can probably be ignored because it's almost certainly uncorrelated to any metric of actual interest. The only exception would be that I would expect a slightly higher view and edit rates for pages with greater random exposure, on the grounds that they'll get slightly more views, from people using "random page", but I suspect this effect is extremely small.


Yeah, this is why I invoked the Central Limit Theorem above - with sufficiently high sample (which isn't that high, 30 or so) the underlying distribution of the randomization process shouldn't matter, uniform or not. But autocorrelation CAN.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Silver seren
post Mon 31st October 2011, 2:32am
Post #43


Senior Member
****

Group: Contributors
Posts: 470
Joined: Thu 30th Dec 2010, 2:09pm
Member No.: 36,940

WP user page - talk
check - contribs



I felt like trying this too with 100 articles. The Not Sure's are ones that are in between a stub and not being a stub (around 2-3 paragraphs long). I was rather surprised at how many of the biographies were not contemporary people, but actually historic figures and such.

Stub - 29

Not stub - 46

Disambiguation - 8

List - 6

Not sure - 10

Good article - 1

Types:

Places - 20

Living things - 4

Television/Film/ect. - 1

Company - 4

Music - 9

Biography - 27

Dance - 1

Politics - 4

Transportation - 1

Sports - 2

Conflict - 1

Buildings/Monuments/ect. - 2

Space - 2

Education - 1

Television/Radio/ect. Station - 2

Books/Magazines/ect. - 2

Dates - 1

Laws - 1

Newspapers - 1

Board games - 1

Internet - 1

Ships - 1

Weather - 2

Genetics - 1

This post has been edited by Silver seren: Mon 31st October 2011, 2:33am
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Kelly Martin
post Mon 31st October 2011, 2:37am
Post #44


Bring back the guttersnipes!
********

Group: Regulars
Posts: 3,270
Joined: Sun 22nd Jun 2008, 4:41am
From: EN61bw
Member No.: 6,696



QUOTE(radek @ Sun 30th October 2011, 9:00pm) *
\Yeah, this is why I invoked the Central Limit Theorem above - with sufficiently high sample (which isn't that high, 30 or so) the underlying distribution of the randomization process shouldn't matter, uniform or not. But autocorrelation CAN.
I would not expect there to be any observable autocorrelation in a sequence of invocations of Special:Random. Each request is going to be processed by a new instance of the PHP engine, with a new random seed initialized from system entropy (probably the system clock) at the start of the request. The random number generator is reseeded for each request, rather than being maintained and reused, as far as I know. Also, because of the large number of webservers in use at the Wikimedia farm, any given request is unlikely to be serviced by the same webserver that serviced a previous request, and the random seed is definitely not stored in portable session data. There is no call to srand (or mt_srand) in MediaWiki's code, at least not in code normally used during routine operations (there are calls to srand in certain maintenance and test procedures, but those are not used in normal operations).
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
radek
post Mon 31st October 2011, 2:44am
Post #45


Über Member
*****

Group: Regulars
Posts: 699
Joined: Sat 28th Nov 2009, 10:40pm
Member No.: 15,651

WP user page - talk
check - contribs



QUOTE(Silver seren @ Sun 30th October 2011, 9:32pm) *

I felt like trying this too with 100 articles. The Not Sure's are ones that are in between a stub and not being a stub (around 2-3 paragraphs long). I was rather surprised at how many of the biographies were not contemporary people, but actually historic figures and such.

Stub - 29

Not stub - 46

Disambiguation - 8

List - 6

Not sure - 10

Good article - 1


Ignoring disambigs, lists, and your "not sure" we get 29/47=61.7% which is again close to 60% stubs

This post has been edited by radek: Mon 31st October 2011, 2:48am
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Silver seren
post Mon 31st October 2011, 3:05am
Post #46


Senior Member
****

Group: Contributors
Posts: 470
Joined: Thu 30th Dec 2010, 2:09pm
Member No.: 36,940

WP user page - talk
check - contribs



QUOTE(radek @ Mon 31st October 2011, 2:44am) *

QUOTE(Silver seren @ Sun 30th October 2011, 9:32pm) *

I felt like trying this too with 100 articles. The Not Sure's are ones that are in between a stub and not being a stub (around 2-3 paragraphs long). I was rather surprised at how many of the biographies were not contemporary people, but actually historic figures and such.

Stub - 29

Not stub - 46

Disambiguation - 8

List - 6

Not sure - 10

Good article - 1


Ignoring disambigs, lists, and your "not sure" we get 29/47=61.7% which is again close to 60% stubs


Um...I think you have it backwards. It would be 40% stubs and 60% not stubs.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
radek
post Mon 31st October 2011, 3:15am
Post #47


Über Member
*****

Group: Regulars
Posts: 699
Joined: Sat 28th Nov 2009, 10:40pm
Member No.: 15,651

WP user page - talk
check - contribs



QUOTE(Silver seren @ Sun 30th October 2011, 10:05pm) *

QUOTE(radek @ Mon 31st October 2011, 2:44am) *

QUOTE(Silver seren @ Sun 30th October 2011, 9:32pm) *

I felt like trying this too with 100 articles. The Not Sure's are ones that are in between a stub and not being a stub (around 2-3 paragraphs long). I was rather surprised at how many of the biographies were not contemporary people, but actually historic figures and such.

Stub - 29

Not stub - 46

Disambiguation - 8

List - 6

Not sure - 10

Good article - 1



You're right, take it as stubs/not subs+not sure. Same thing.
Ignoring disambigs, lists, and your "not sure" we get 29/47=61.7% which is again close to 60% stubs


Um...I think you have it backwards. It would be 40% stubs and 60% not stubs.


You're right, but take it as stubs/not stubs + not sure. Same thing.

This post has been edited by radek: Mon 31st October 2011, 3:16am
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Guido den Broeder
post Mon 31st October 2011, 11:41am
Post #48


Senior Member
****

Group: Regulars
Posts: 425
Joined: Thu 19th Feb 2009, 7:31pm
Member No.: 10,371



QUOTE(Kelly Martin @ Mon 31st October 2011, 3:37am) *

QUOTE(radek @ Sun 30th October 2011, 9:00pm) *
\Yeah, this is why I invoked the Central Limit Theorem above - with sufficiently high sample (which isn't that high, 30 or so) the underlying distribution of the randomization process shouldn't matter, uniform or not. But autocorrelation CAN.
I would not expect there to be any observable autocorrelation in a sequence of invocations of Special:Random. Each request is going to be processed by a new instance of the PHP engine, with a new random seed initialized from system entropy (probably the system clock) at the start of the request. The random number generator is reseeded for each request, rather than being maintained and reused, as far as I know. Also, because of the large number of webservers in use at the Wikimedia farm, any given request is unlikely to be serviced by the same webserver that serviced a previous request, and the random seed is definitely not stored in portable session data. There is no call to srand (or mt_srand) in MediaWiki's code, at least not in code normally used during routine operations (there are calls to srand in certain maintenance and test procedures, but those are not used in normal operations).

I didn't expect it either, but it is clearly observable. Could the cache interfere with the reseeding?
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Kelly Martin
post Mon 31st October 2011, 1:06pm
Post #49


Bring back the guttersnipes!
********

Group: Regulars
Posts: 3,270
Joined: Sun 22nd Jun 2008, 4:41am
From: EN61bw
Member No.: 6,696



QUOTE(Guido den Broeder @ Mon 31st October 2011, 6:41am) *
I didn't expect it either, but it is clearly observable. Could the cache interfere with the reseeding?
Special page results are uncacheable, so that shouldn't be happening. The only thing that makes sense is that the squids are using session affinity to send you back to the same webserver for each request, and the servers are preserving the random seed instead of reseeding. Wikimedia is running an accelerated version of PHP; it's possible that the engine doesn't reseed at the start of *every* script run, and is instead persisting the random seed between sessions. I looked around the web last night for some information on how the random seed is managed when PHP is embedded in a webserver, but couldn't find anything.

I still don't understand why they don't just periodically reseed from /dev/random. /dev/random is a fairly high quality entropy source, and on a system with a high level of network activity the rate at which entropy is created is fairly high, so there should be enough entropy available to reseed fairly often.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
radek
post Mon 31st October 2011, 5:26pm
Post #50


Über Member
*****

Group: Regulars
Posts: 699
Joined: Sat 28th Nov 2009, 10:40pm
Member No.: 15,651

WP user page - talk
check - contribs



QUOTE(radek @ Sun 30th October 2011, 10:15pm) *

QUOTE(Silver seren @ Sun 30th October 2011, 10:05pm) *

QUOTE(radek @ Mon 31st October 2011, 2:44am) *

QUOTE(Silver seren @ Sun 30th October 2011, 9:32pm) *

I felt like trying this too with 100 articles. The Not Sure's are ones that are in between a stub and not being a stub (around 2-3 paragraphs long). I was rather surprised at how many of the biographies were not contemporary people, but actually historic figures and such.

Stub - 29

Not stub - 46

Disambiguation - 8

List - 6

Not sure - 10

Good article - 1



You're right, take it as stubs/not subs+not sure. Same thing.
Ignoring disambigs, lists, and your "not sure" we get 29/47=61.7% which is again close to 60% stubs


Um...I think you have it backwards. It would be 40% stubs and 60% not stubs.


You're right, but take it as stubs/not stubs + not sure. Same thing.


Actually you're right. In fact I woke up this morning thinking "Silver seren is right", strangely enough, which is weird first thought to have upon waking up. Anyway, the difference is that you're taking stubs out of ALL articles while I'm talking about stubs out of non-disamig, non-list articles. So really % of Wikipedia articles that are stubs should be something like 60%*85% (with the other 15% disambig and lists) so 51%
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
radek
post Mon 31st October 2011, 7:54pm
Post #51


Über Member
*****

Group: Regulars
Posts: 699
Joined: Sat 28th Nov 2009, 10:40pm
Member No.: 15,651

WP user page - talk
check - contribs



QUOTE


Yeah, ok, I just said that to see if my off-Wiki activities were being stalked, which they apparently are. Some of those pictures are pretty neat-o.
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
Abd
post Mon 31st October 2011, 8:46pm
Post #52


Postmaster
*******

Group: Regulars
Posts: 1,915
Joined: Tue 18th Nov 2008, 10:52pm
From: Northampton, MA, USA
Member No.: 9,019

WP user page - talk
check - contribs



QUOTE(radek @ Mon 31st October 2011, 2:54pm) *
QUOTE
Yeah, ok, I just said that to see if my off-Wiki activities were being stalked, which they apparently are. Some of those pictures are pretty neat-o.
Yeah, the response. Bitchipedia = Wikipedia Review. Apparently they allow that kind of thing on Wikipedia. If someone wrote that on Wikiversity, it would be considered uncivil. It's importing an external dispute. It's one of the few things we warn or even block non-vandals/non-spammers for.

Will anyone on Wikipedia even notice?

Actually, now I notice that Russavia is blocked, he's using his ability to edit his talk page to troll for conflict on-wiki. And he's been joined by Arsenikk (T-C-L-K-R-D) . Let me put it this way: he's almost demanding to be indeffed, talk page access removed, and Arsenikk is tossing gasoline on the smouldering embers. Just saying.

This post has been edited by Abd: Mon 31st October 2011, 8:55pm
User is offlineProfile CardPM
Go to the top of the page
+Quote Post
EricBarbour
post Tue 1st November 2011, 2:27am
Post #53


blah
*********

Group: Regulars
Posts: 5,919
Joined: Mon 25th Feb 2008, 2:31am
Member No.: 5,066

WP user page - talk
check - contribs



Would anyone like to see my test results of Wikipedia article length?
Or is Russavia's idiocy more entertaining?
User is offlineProfile CardPM
Go to the top of the page
+Quote Post

3 Pages V < 1 2 3
Reply to this topicStart new topic
2 User(s) are reading this topic (2 Guests and 0 Anonymous Users)
0 Members:

 

-   Lo-Fi Version Time is now: 25th 5 13, 12:01am