Author Topic: Regarding the down time...  (Read 10115 times)

Offline Happy Cat

  • *
  • Posts: 3856
  • Total Meseta: 48
Regarding the down time...
« on: April 05, 2013, 10:52:48 pm »
just copying and pasting this from seanieb over at Sonic Retro for those who are curious.

Quote from: seanieb
Hey there, everyone --

I'm just gonna write up a quick thing about today's chain of events.

We've been tracking a dying pair of the original Samsung 500GB SATA drives inside retro's web server for a few days now. Around 1:30-2:00 AM Pacific this morning, the server dropped off the face of the earth. I wasn't notified about this because my Nagios was only set to notify me of Retro's status by e-mail and pager text-message during work hours.

I'm not entirely sure if it was a coincidence, a bit more disk failure, software failure, or maybe a hack gone wrong, but no logs of the problem were made, and the OS on the machine was definitely damaged in some way.

For those familiar with Linux, conventional System V style init uses a series of init scripts to accomplish the boot sequence, and handling of run level requests. For a still undiscovered reason, on initial run level, not all of the daemons start, but if SSH is adjusted to start immediately following networking, manually changing run level to one of the other unoccupied multiuser run levels on Debian will start the remaining daemons and allow logins to work as normal from the console.

I verified the drives with my personal copy of Spinrite and they did turn up a couple more errors, but no outright failure. RAM tests also passed.

Some of Retro's ancillary public and private services have been disabled until the restore to the new drives and OS is complete. Additional backup plans are being carried out to make sure we're prepared for catastrophe. Expect a lot of slowdown in the next 24 hours. I will be setting my Nagios to report problems 24/7 until Retro goes back to normal.

New drives have been ordered, and tentatively we hope this process can get started by Wednesday night, but that's an estimate. Expect another full day's downtime at that time, but I will give full warning. I am going to fully test the new drives before putting them into service, I think it would be prudent to spare some time for confidence in the stability of the new drives.

As far as I know at this time no information was lost in this incident. A backup finished very close to the time of failure, which may have also been a root cause.

Repo, FTP, the Scans service, and a couple of under utilized user accounts have been disabled for security, and will reappear later this evening or some time tomorrow.

I'm willing to accept questions in this thread

Thanks for your understanding.