This troll's colocation box was physically moved to a new location today. In preparation for this, extensive backups were taken, but this troll did not find time to sort out niceties such as "backup MX". There is now a VPS second instance which will become usable as a backup, but more work is required.
The machine, "redoubt", is in Amsterdam, NL, where I used to live and where I still have friends; it has moved less often than I have, since I moved to the USA, while providing continuity of service. It's housed with Coloclue, a vereniging (loosely translates to “cooperative”) for very tech savvy folks. Some volunteers today moved the machines of members who couldn't get in to move the machines themselves; I owe cognac and whiskey. Bedankt, heren.
I shut things down over an hour before-hand, to give the components time to cool a little before being subjected to the shock of moving. I then went to bed and slept, a lot, in preparation for staying up tonight to catch an early morning flight. “If it doesn't come back, a few more hours of outage won't make much difference.” … it's nice to be able to take that attitude and treat it as a non-essential system. Even if it does get the bulk of my mail. I'll find out which mailing-list providers can't handle a 15 hour down-time without deciding to unsubscribe me. Critical services, such as this blog, are hosted on resilient systems built from redundant components, with dedicated teams of experts available around the clock to jump on any problem. Thanks, Google Blogger.
Perhaps because there are two disks in a mirror configuration, newer than the rest of the system, perhaps because backups were taken and verified, perhaps because of the cooling, perhaps because of the phase of the moon, whatever, the system came back with all disks and data intact.
Alas, a problem I had observed before struck again: the battery backing up the system clock appears to be dead. The time in "redoubt" reset to 2003.
I would expect a number of issues to arise from this. Kerberos failing? Sure, I can see that. I expect it. OpenNTPD failing to set the clock? Regrettable, but I can see it. Ideally, time-keeping systems would periodically write to a journal, "it is now November 2011", "it is now December 2011", "it is now January 2012" so that if on start-up the system time is before then, something must be very wrong and it is inherently safe to step the clock forward. This didn't happen. It was easy enough to fix (stop ntpd, run ntpdate(1), restart ntpd).
What I did not expect to fail was DNS, and every component based upon DNS. More fool me.
Of course, unbound is a validating DNS resolver, which verifies DNSSEC signatures. With a date so far in the past, the signatures upon the root zone failed to verify. Thus no root zone, thus no DNS. The house of cards fell down.
Things came back as soon as time was restored to something which would agree with the consensus view of time shared by the rest of the synchronised world.
This troll ponders if perhaps any server-class production-ready software which is going to be critically dependent upon time should be aware of a "minimum sane time" file and degrade behaviour in documented ways if system time fails to concur. For instance, lose the validation of DNS, but complain loudly to log-files. Any software which depends upon validated DNS will have checked the AD bit in the response and seen a lack of validation. Or does this just move failure modes around, spreading them out into more systems and creating more problems? Should that happen anyway, and have problems be fixed? Or does that move the domain of possible problems beyond the comprehension of mere mortal system administrators and it's better to just let things fail hard? There are no right answers here.
Beyond “get around to buying a replacement system for one which was not cutting edge in 2006, and get a working battery-backed clock into the bargain”.
-The Grumpy Troll