
For those who were not aware, MontrealTechWatch was down from yesterday early morning till 8.00pm today 17th of July.
It might seem normal and in the-order-of-things that the server comes back; for most sys admins, it’s just a matter of opening a ticket and the tech support would restart somehow the whole thing. But this time, it was radically different. Just a few hours ago, it was considered to be un-recoverable *sweats* , and with it databases **shivers** plus all generated files for the past 2 years ***faints***. We tried one last hack, which miraculously worked.
For those curious about technical details, this server hosts many websites and services. It hosts for instance a RoR site, graciously hosted since it’s a friend’s, plus another experiment, using Phusion Passenger. I’ve discovered that mod_rails has a big memory problems and leaves around dead processes; which I intended to solve by writing a god-like ruby script that would kill & clean processes, and even if the parent process was defunct and couldn’t be killed. Fast-forward, yesterday morning, this script launched the system command kill -9 1 … with the script owned by root user… which is the equivalent of shooting yourself in the head … while jumping from a plane 30000 feet high. XenServer can’t even restart, reinstall snapshot backups, relaunch, nor be re-setup, and all files & databases were deemed lost and inaccessible.
MTW is taken very seriously and I know of its importance; and this should never happen again. There’s one thing to blame here, which is trying to use experimental scripts on a production server. If this was a company, I would have fired the Linux idiot who wrote the script. Oh wait… Anyway, thanks for everyone who were there, it’s much appreciated. I’ll look into getting an additional resource as a sandbox and get a bulletproof environment for MTW




Comments
Montreal Tech Watch July 18, 2008
Back! http://tinyurl.com/5ubqaw
Denis Canuel July 18, 2008
Glad you’re back! That was a close one. I’ll take some time myself to make sure that I have an offline backup as well…
Mark MacLeod July 18, 2008
Heri July 18, 2008
Hopefully will catch up and publish a couple more articles
Ed July 18, 2008
Glad to hear things are back!
Smiling Triton July 18, 2008
SmTt
Mehdi Akiki July 20, 2008
Fred Brunel July 22, 2008