Take a look at the
server admin log (personally I'm subscribed to the RSS feed).
QUOTE
18:00 brion: things seem at least semi-working.
1. everything hung
2. suda had some kind of kernel crash
3. after reboot, it was found to have a couple flaky disks
4. brion hacked up MW config files to skip the NFS logging
5. mark set up an alternate /home NFS server
QUOTE
15:00 mark: Site down completely. Post-mortem:
1. Rob is untangling power cables in rack B2, and both asw-b2-pmtpa and asw3-pmtpa (in B4) lose power
2. Two racks unreachable, PyBal sees too many hosts down and won't depool more
3. Rob brings power to asw-b2-pmtpa back up, but connectivity loss to B4 is not noticed
4. Mark investigates why LVS isn't working, adjusts PyBal parameters, until PyBal pools not a single server
5. Apaches are unhappy about completely missing ES clusters
6. Connectivity loss to B4 discovered, restored
7. Site back online
There is an oversighted edit though, that read:
QUOTE
14:45 godwin/gardner: Prepare downtime donation message and take a hammer to a few hard drives.
Just kidding.