Wednesday, February 22, 2012

Report on Service Issues 2.15.2012


This report is intentionally written in layman’s terms. It is after all, being written by a layman, with respect to these more complex networking issues. We are available to discuss with our technical support the occurrences with anyone who would like a call in order to obtain any better understanding.

Here goes.

Tuesday at approximately 11:30PM our servers were completing a re-indexing to help retain operating efficiencies.

During this approximate two hour window, that this process takes, a NAS (Network Attached Storage) device failed. This device controls multiple network storage devices. It is identified as “enterprise level”, a designation reserved for devices that are as close to bullet proof as possible.

The device failed. Not only did it physically fail, as unlikely as that is to happen. It did so at the very moment that our servers were re-indexing.

The device had safe guards and a fail over that came on line. However the resulting impact to our db server was a corruption in the index. The index is the Dewey decimal filing system that helps the hard drives find and access data quickly.

This corruption to our index started an unseen internal loop where the server was trying to repair itself and dumping huge log files until the drive actually filled up, a second issue was created and at around 4AM on what would now be Wednesday, the server shut down.

An Emergency repair, an actual term for an internal SQL process, was implemented. At the same time our most recent backup, the db just before the 11PM incident was being reloaded to a second server. The restore of our backup to a second drive completed first, but we waited for the emergency repair sequence to complete. The primary server was still struggling; it finally completed the emergency repair, but was still not re-indexed. The re-index was now going to take more than the typical two hours as it was completely lost.

After waiting perhaps a bit too long, this solution was abandoned and plan B, the cut over to restored backup was implemented. The system operated reasonably well on this backup drive. Actually better than anticipated, only adding insult to the decision to wait for plan A to complete.

A decision was made to run on the reserve servers till the weekend. Operations were slow, but having missed several hours of operation already, we made the decision that access to slow data was better than access to no data at all. We limped into the weekend, where we felt the impact of completely pulling everything off line would be far less inconvenient.

As this was unfolding, the technical side of the issue was being addressed. A new NAS and upgraded NIC (Network Interface Controller) cards were ordered. The act of obtaining these hardware pieces and the necessary firmware upgrades required to make them fully compatible with our existing system took about 48 hours. This was a contributor to when we could make a full hardware replacement, conduct a total re-index and bring our primary server back on-line. This work was completed by late Sunday morning, with a couple of tweaks added late Monday night.

The system did not lose data. The backups did restore and the system was operable, to a degree despite the catastrophic hardware failure. The length of time in getting back on line and the limited performance until the full swap out on Sunday was part hardware related and a business decision, as mentioned.

The failed NAS has been replaced. New NIC cards have been installed, the db was fully re-indexed.

We realize the negative impact of the downtime. The failure of this NAS device is I am told very rare. The fact that it failed at such an inopportune moment in time is, well just a fact. There is no really explanation, nor excuse. It failed. It is replaced.

We took every step we could to protect the data. Immediately took actions to replace the bad hardware and complete the firmware upgrades. The decision to allow access to the system on Friday rather than go fully off line to conduct the needed repairs was completely ours.

To the best of our knowledge everything has been done to prevent a re-occurrence.
We have already scheduled a few calls with some of our sites to discuss this occurrence in greater detail. If you would like to schedule such a call please feel free to contact me directly.

Again our apology for the impact we had on your business operations.