|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() ![]() |
| Kelly Martin |
Mon 31st October 2011, 1:44am
Post
#41
|
|
Bring back the guttersnipes! ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Regulars Posts: 3,270 Joined: Sun 22nd Jun 2008, 4:41am From: EN61bw Member No.: 6,696 |
Note that the random number generator that the software uses is very poor, with high autocorrelation (and note that no article on Wikipedia tells you anything about the quality of RNGs). Each page, when initially created, is assigned a random number between 0 and 1. The "random page" function works by generating another random number between 0 and 1, and finding the page whose random number is the least such number greater than the random number just created. Because the random numbers assigned to pages are random, rather than uniform, some pages will be returned more than others. This is in addition to the fairly poor PRNG that PHP uses. From a statistical sampling standpoint this bias can probably be ignored because it's almost certainly uncorrelated to any metric of actual interest. The only exception would be that I would expect a slightly higher view and edit rates for pages with greater random exposure, on the grounds that they'll get slightly more views, from people using "random page", but I suspect this effect is extremely small. |
| radek |
Mon 31st October 2011, 2:00am
Post
#42
|
|
Über Member ![]() ![]() ![]() ![]() ![]() Group: Regulars Posts: 699 Joined: Sat 28th Nov 2009, 10:40pm Member No.: 15,651 WP user page - talk check - contribs |
Note that the random number generator that the software uses is very poor, with high autocorrelation (and note that no article on Wikipedia tells you anything about the quality of RNGs). A proper sample should therefore contain far more than 100 articles. I've been wondering about that! One thing I noticed while clicking through the thing is that if you get an article named "xxx blah blah blah blah ARCTIC blah blah blah xxx" then quite often the next article you get is named "blah xxx blah blah xxx ARCTIC blah xxx blah". Which does suggest some weird autocorrelation in the algorithm. So it might very well be that if you click ten articles and only 4 of them show up as stubs, then it is quite likely that out of the next 10 the most likely value is 4 as well. Any idea on how the Random Article feature actually works? Like what's the script or the random number generator underlying it? Having said that, my sample is pretty close to 1000 now and the 60% of articles being stub is fairly robust so I'm pretty sure it's representative. Note that the random number generator that the software uses is very poor, with high autocorrelation (and note that no article on Wikipedia tells you anything about the quality of RNGs). Each page, when initially created, is assigned a random number between 0 and 1. The "random page" function works by generating another random number between 0 and 1, and finding the page whose random number is the least such number greater than the random number just created. Because the random numbers assigned to pages are random, rather than uniform, some pages will be returned more than others. This is in addition to the fairly poor PRNG that PHP uses. From a statistical sampling standpoint this bias can probably be ignored because it's almost certainly uncorrelated to any metric of actual interest. The only exception would be that I would expect a slightly higher view and edit rates for pages with greater random exposure, on the grounds that they'll get slightly more views, from people using "random page", but I suspect this effect is extremely small. Yeah, this is why I invoked the Central Limit Theorem above - with sufficiently high sample (which isn't that high, 30 or so) the underlying distribution of the randomization process shouldn't matter, uniform or not. But autocorrelation CAN. |
| Silver seren |
Mon 31st October 2011, 2:32am
Post
#43
|
|
Senior Member ![]() ![]() ![]() ![]() Group: Contributors Posts: 470 Joined: Thu 30th Dec 2010, 2:09pm Member No.: 36,940 WP user page - talk check - contribs |
I felt like trying this too with 100 articles. The Not Sure's are ones that are in between a stub and not being a stub (around 2-3 paragraphs long). I was rather surprised at how many of the biographies were not contemporary people, but actually historic figures and such.
Stub - 29 Not stub - 46 Disambiguation - 8 List - 6 Not sure - 10 Good article - 1 Types: Places - 20 Living things - 4 Television/Film/ect. - 1 Company - 4 Music - 9 Biography - 27 Dance - 1 Politics - 4 Transportation - 1 Sports - 2 Conflict - 1 Buildings/Monuments/ect. - 2 Space - 2 Education - 1 Television/Radio/ect. Station - 2 Books/Magazines/ect. - 2 Dates - 1 Laws - 1 Newspapers - 1 Board games - 1 Internet - 1 Ships - 1 Weather - 2 Genetics - 1 This post has been edited by Silver seren: Mon 31st October 2011, 2:33am |
| Kelly Martin |
Mon 31st October 2011, 2:37am
Post
#44
|
|
Bring back the guttersnipes! ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Regulars Posts: 3,270 Joined: Sun 22nd Jun 2008, 4:41am From: EN61bw Member No.: 6,696 |
\Yeah, this is why I invoked the Central Limit Theorem above - with sufficiently high sample (which isn't that high, 30 or so) the underlying distribution of the randomization process shouldn't matter, uniform or not. But autocorrelation CAN. I would not expect there to be any observable autocorrelation in a sequence of invocations of Special:Random. Each request is going to be processed by a new instance of the PHP engine, with a new random seed initialized from system entropy (probably the system clock) at the start of the request. The random number generator is reseeded for each request, rather than being maintained and reused, as far as I know. Also, because of the large number of webservers in use at the Wikimedia farm, any given request is unlikely to be serviced by the same webserver that serviced a previous request, and the random seed is definitely not stored in portable session data. There is no call to srand (or mt_srand) in MediaWiki's code, at least not in code normally used during routine operations (there are calls to srand in certain maintenance and test procedures, but those are not used in normal operations). |
| radek |
Mon 31st October 2011, 2:44am
Post
#45
|
|
Über Member ![]() ![]() ![]() ![]() ![]() Group: Regulars Posts: 699 Joined: Sat 28th Nov 2009, 10:40pm Member No.: 15,651 WP user page - talk check - contribs |
I felt like trying this too with 100 articles. The Not Sure's are ones that are in between a stub and not being a stub (around 2-3 paragraphs long). I was rather surprised at how many of the biographies were not contemporary people, but actually historic figures and such. Stub - 29 Not stub - 46 Disambiguation - 8 List - 6 Not sure - 10 Good article - 1 Ignoring disambigs, lists, and your "not sure" we get 29/47=61.7% which is again close to 60% stubs This post has been edited by radek: Mon 31st October 2011, 2:48am |
| Silver seren |
Mon 31st October 2011, 3:05am
Post
#46
|
|
Senior Member ![]() ![]() ![]() ![]() Group: Contributors Posts: 470 Joined: Thu 30th Dec 2010, 2:09pm Member No.: 36,940 WP user page - talk check - contribs |
I felt like trying this too with 100 articles. The Not Sure's are ones that are in between a stub and not being a stub (around 2-3 paragraphs long). I was rather surprised at how many of the biographies were not contemporary people, but actually historic figures and such. Stub - 29 Not stub - 46 Disambiguation - 8 List - 6 Not sure - 10 Good article - 1 Ignoring disambigs, lists, and your "not sure" we get 29/47=61.7% which is again close to 60% stubs Um...I think you have it backwards. It would be 40% stubs and 60% not stubs. |
| radek |
Mon 31st October 2011, 3:15am
Post
#47
|
|
Über Member ![]() ![]() ![]() ![]() ![]() Group: Regulars Posts: 699 Joined: Sat 28th Nov 2009, 10:40pm Member No.: 15,651 WP user page - talk check - contribs |
I felt like trying this too with 100 articles. The Not Sure's are ones that are in between a stub and not being a stub (around 2-3 paragraphs long). I was rather surprised at how many of the biographies were not contemporary people, but actually historic figures and such. Stub - 29 Not stub - 46 Disambiguation - 8 List - 6 Not sure - 10 Good article - 1 You're right, take it as stubs/not subs+not sure. Same thing. Ignoring disambigs, lists, and your "not sure" we get 29/47=61.7% which is again close to 60% stubs Um...I think you have it backwards. It would be 40% stubs and 60% not stubs. You're right, but take it as stubs/not stubs + not sure. Same thing. This post has been edited by radek: Mon 31st October 2011, 3:16am |
| Guido den Broeder |
Mon 31st October 2011, 11:41am
Post
#48
|
|
Senior Member ![]() ![]() ![]() ![]() Group: Regulars Posts: 425 Joined: Thu 19th Feb 2009, 7:31pm Member No.: 10,371 |
\Yeah, this is why I invoked the Central Limit Theorem above - with sufficiently high sample (which isn't that high, 30 or so) the underlying distribution of the randomization process shouldn't matter, uniform or not. But autocorrelation CAN. I would not expect there to be any observable autocorrelation in a sequence of invocations of Special:Random. Each request is going to be processed by a new instance of the PHP engine, with a new random seed initialized from system entropy (probably the system clock) at the start of the request. The random number generator is reseeded for each request, rather than being maintained and reused, as far as I know. Also, because of the large number of webservers in use at the Wikimedia farm, any given request is unlikely to be serviced by the same webserver that serviced a previous request, and the random seed is definitely not stored in portable session data. There is no call to srand (or mt_srand) in MediaWiki's code, at least not in code normally used during routine operations (there are calls to srand in certain maintenance and test procedures, but those are not used in normal operations).I didn't expect it either, but it is clearly observable. Could the cache interfere with the reseeding? |
| Kelly Martin |
Mon 31st October 2011, 1:06pm
Post
#49
|
|
Bring back the guttersnipes! ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Regulars Posts: 3,270 Joined: Sun 22nd Jun 2008, 4:41am From: EN61bw Member No.: 6,696 |
I didn't expect it either, but it is clearly observable. Could the cache interfere with the reseeding? Special page results are uncacheable, so that shouldn't be happening. The only thing that makes sense is that the squids are using session affinity to send you back to the same webserver for each request, and the servers are preserving the random seed instead of reseeding. Wikimedia is running an accelerated version of PHP; it's possible that the engine doesn't reseed at the start of *every* script run, and is instead persisting the random seed between sessions. I looked around the web last night for some information on how the random seed is managed when PHP is embedded in a webserver, but couldn't find anything.I still don't understand why they don't just periodically reseed from /dev/random. /dev/random is a fairly high quality entropy source, and on a system with a high level of network activity the rate at which entropy is created is fairly high, so there should be enough entropy available to reseed fairly often. |
| radek |
Mon 31st October 2011, 5:26pm
Post
#50
|
|
Über Member ![]() ![]() ![]() ![]() ![]() Group: Regulars Posts: 699 Joined: Sat 28th Nov 2009, 10:40pm Member No.: 15,651 WP user page - talk check - contribs |
I felt like trying this too with 100 articles. The Not Sure's are ones that are in between a stub and not being a stub (around 2-3 paragraphs long). I was rather surprised at how many of the biographies were not contemporary people, but actually historic figures and such. Stub - 29 Not stub - 46 Disambiguation - 8 List - 6 Not sure - 10 Good article - 1 You're right, take it as stubs/not subs+not sure. Same thing. Ignoring disambigs, lists, and your "not sure" we get 29/47=61.7% which is again close to 60% stubs Um...I think you have it backwards. It would be 40% stubs and 60% not stubs. You're right, but take it as stubs/not stubs + not sure. Same thing. Actually you're right. In fact I woke up this morning thinking "Silver seren is right", strangely enough, which is weird first thought to have upon waking up. Anyway, the difference is that you're taking stubs out of ALL articles while I'm talking about stubs out of non-disamig, non-list articles. So really % of Wikipedia articles that are stubs should be something like 60%*85% (with the other 15% disambig and lists) so 51% |
| radek |
Mon 31st October 2011, 7:54pm
Post
#51
|
|
Über Member ![]() ![]() ![]() ![]() ![]() Group: Regulars Posts: 699 Joined: Sat 28th Nov 2009, 10:40pm Member No.: 15,651 WP user page - talk check - contribs |
QUOTE Yeah, ok, I just said that to see if my off-Wiki activities were being stalked, which they apparently are. Some of those pictures are pretty neat-o. |
| Abd |
Mon 31st October 2011, 8:46pm
Post
#52
|
|
Postmaster ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Regulars Posts: 1,915 Joined: Tue 18th Nov 2008, 10:52pm From: Northampton, MA, USA Member No.: 9,019 WP user page - talk check - contribs |
QUOTE Yeah, ok, I just said that to see if my off-Wiki activities were being stalked, which they apparently are. Some of those pictures are pretty neat-o.Will anyone on Wikipedia even notice? Actually, now I notice that Russavia is blocked, he's using his ability to edit his talk page to troll for conflict on-wiki. And he's been joined by Arsenikk (T-C-L-K-R-D) . Let me put it this way: he's almost demanding to be indeffed, talk page access removed, and Arsenikk is tossing gasoline on the smouldering embers. Just saying. This post has been edited by Abd: Mon 31st October 2011, 8:55pm |
| EricBarbour |
Tue 1st November 2011, 2:27am
Post
#53
|
|
blah ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Group: Regulars Posts: 5,919 Joined: Mon 25th Feb 2008, 2:31am Member No.: 5,066 WP user page - talk check - contribs |
Would anyone like to see my test results of Wikipedia article length?
Or is Russavia's idiocy more entertaining? |
![]() ![]() |
|
Lo-Fi Version | Time is now: 24th 5 13, 4:39am |