Friday, December 31, 2010

Multiple Points of Failure

I wrote in an earlier post about how sometimes it's best to rely on well known technologies rather than on carefully tailored solutions. Last week we had just the perfect example for this concept and this time from the hardware domain.

So what happened?
One of our system crashed and won't boot, probably because of a major OS failure since the mirrored disc seemed to be corrupted as well, the only guess we have for the underlying cause is lack of space on the C: drive but we don't really have a way of knowing for sure post-mortem.
The system was all on this single disc so we had to use the backup. Bad news, apparently the last backup was from like a month ago. Recovering data from the corrupted discs we've managed to narrow the damage to two weeks of lost data, but that's not the point, losing data is something that should never happen and we really try hard to prevent those kind of things from happening.

So where did we go wrong? Well, most of our systems use EMC for storage, not in this case, why?
Usually when we configure a new system we ask ourselves two main questions:
1. Is the system gonna need an SRDF/CE configuration?
2. Is the required storage for the system is more than what comes with the target server's disk drives?
If the answer to either question is YES, we use EMC otherwise we usually use it only as a location for storing the backup of the system. In our case it was the 'otherwise' option.
The problem is that when we use EMC for storage, we also use EMC's RM to take backup and have an already existing system of emails that let us know if the backup was successful.
In our case not only the data wasn't on the EMC but on a local disk but also monitoring on backups didn't exist, of course it is our bad, but that's exactly what I was talking about when I said that it's better to rely on something you're well familiar with.

I guess one of the conclusions is to try and use EMC for every system no matter how low the high availability requirements are and how little disk space is required.