This report is intentionally written in layman’s terms. It is
after all, being written by a layman, with respect to these more complex
networking issues. We are available to discuss with our technical support the occurrences
with anyone who would like a call in order to obtain any better understanding.
Here goes.
Tuesday at approximately 11:30PM our servers were completing
a re-indexing to help retain operating efficiencies.
During this approximate two hour window, that this process
takes, a NAS (Network Attached Storage) device failed. This device controls
multiple network storage devices. It is identified as “enterprise level”,
a designation reserved for devices that are as close to bullet proof as
possible.
The device failed. Not only did it physically fail, as
unlikely as that is to happen. It did so at the very moment that our servers
were re-indexing.
The device had safe guards and a fail over that came on
line. However the resulting impact to our db server was a corruption in the
index. The index is the Dewey decimal filing system that helps the hard drives
find and access data quickly.
This corruption to our index started an unseen internal loop
where the server was trying to repair itself and dumping huge log files until the
drive actually filled up, a second issue was created and at around 4AM on
what would now be Wednesday, the server shut down.
An Emergency repair, an actual term for an internal SQL
process, was implemented. At the same time our most recent backup, the db just
before the 11PM incident was being reloaded to a second server. The restore of
our backup to a second drive completed first, but we waited for the emergency repair
sequence to complete. The primary server was still struggling; it finally
completed the emergency repair, but was still not re-indexed. The re-index was
now going to take more than the typical two hours as it was completely lost.
After waiting perhaps a bit too long, this solution was abandoned
and plan B, the cut over to restored backup was implemented. The system
operated reasonably well on this backup drive. Actually better than anticipated,
only adding insult to the decision to wait for plan A to complete.
A decision was made to run on the reserve servers till the
weekend. Operations were slow, but having missed several hours of operation already,
we made the decision that access to slow data was better than access to no data
at all. We limped into the weekend, where we felt the impact of completely
pulling everything off line would be far less inconvenient.
As this was unfolding, the technical side of the issue was
being addressed. A new NAS and upgraded NIC (Network Interface Controller) cards
were ordered. The act of obtaining these hardware pieces and the necessary
firmware upgrades required to make them fully compatible with our existing
system took about 48 hours. This was a contributor to when we could make a full
hardware replacement, conduct a total re-index and bring our primary server
back on-line. This work was completed by late Sunday morning, with a couple of
tweaks added late Monday night.
The system did not lose data. The backups did restore and the
system was operable, to a degree despite the catastrophic hardware failure. The length of time in getting back on line and the limited
performance until the full swap out on Sunday was part hardware related and a
business decision, as mentioned.
The failed NAS has been replaced. New NIC cards have been
installed, the db was fully re-indexed.
We realize the negative impact of the downtime. The failure
of this NAS device is I am told very rare. The fact that it failed at such an inopportune
moment in time is, well just a fact. There is no really explanation, nor
excuse. It failed. It is replaced.
We took every step we could to protect the data. Immediately
took actions to replace the bad hardware and complete the firmware upgrades.
The decision to allow access to the system on Friday rather than go fully off
line to conduct the needed repairs was completely ours.
To the best of our knowledge everything has been done to
prevent a re-occurrence.
We have already scheduled a few calls with some of our sites
to discuss this occurrence in greater detail. If you would like to schedule
such a call please feel free to contact me directly.
Again our apology for the impact we had on your business
operations.