QUOTE(Somey @ Tue 8th March 2011, 10:38pm)
I know J2EE isn't bad, but they could do better than that, couldn't they? And isn't it the database (MySQL) that causes the real bottlenecks, as opposed to PHP, or is it the fact that they use both, or that they use an interpreted server-side language in the first place?
Given that MediaWiki doesn't have any real need for a fully ACID compliant database, there is no good reason not to use something like HBase (which is what Facebook uses) for the database. The other big performance win would be to rewrite the parser; right now it's a crazy mess of regular expression abuse combined with an XML parser that ends up being expensive in both time and space. Recoding it using either traditional or more modern parsing techniques would likely be a big win on multiple fronts; however, doing so would likely require making some small changes to the markup language. MediaWiki markup is definitely not in LL(n) for any n, and I think it's also not in LR(n) for any n; also, the parser currently requires database access, as the correct parsing of some constructs is dependent on database content. Making a few minor changes to the language "specification" (there really isn't one, just a reference implementation) would avoid both of these problems and make writing a proper parser much easier (that is, possible), but there is considerable reticence to making any change that would "break" Wikipedia.