I am about to start the final step in the recovery process, combining the mail that was restored from tape with the mail in your account. The mail server will be off line from about 2AM-6AM while this process is completed.
UPDATE: All of the mail has been recovered and the mail server is back on-line
What else can I say besides I am sorry, and I understand your frustration. I believe one of the bloggers already said this, but we do what we can with the hand we’re dealt. There are other issues to consider, besides our intellectual capacity (or lack thereof, as some of you claim). There are budgetary issues. There are visibility issues. There’s always the “oh, this will never happen again” factor that seems to plague upper management at times.
Choices to be made
The current situation boils down to lack of funds. Some limited funds are available, but where do you apply them? I say limited because currently my budget is basically used up paying for yearly hardware and software contracts. When the end of FY 05/06 came along in June we scraped up what was left and purchased some much need machines to handle the huge amount of mail we receive, between 400 and 3,000 messages per minute. Keep in mind that these are ONLY the incoming messages, not the outgoing. Incoming messages are also scanned for viruses and spam. You can check out the stats here. Without those new machines we would probably be in a slightly modified version of where we are right now. The old boxes simply could not handle the load, and would crash.
I am sure someone is thinking “well, they should plan ahead and budget for end-of-life after 3 years.” You don’t understand: there is NO EOL budget, because there is no surplus available. All the “special projects” money we receive is one-time stuff. Is it right? No, it is not, but that’s the way it is and that’s what we have to work with.
There are also other places to spend the end of year money, plenty of places. UPS’s need to be replaced. Generators to handle extended power failures. Additional power to handle new machines. Air conditioning units need to be revamped. And on average, at the end of the year, after all the bills are paid, we’re left with about $5k to $10k.
Bad timing
The unfortunate thing was the timing. About 2-3 weeks ago we ordered a new storage solution that would replace the current one. I had to cancel some of the software licenses, because I did not have enough funds to renew them, and the difference freed up some funds for the purchase of a new server/backend storage solution. Not redundancy yet, mind you. That will have to wait until the end of the fiscal year. Since last year we have not entirely trusted the mail storage we’ve had (for obvious reasons) are we were all eager to replace it. Unfortunate the new unit did not arrive in time.
Open thread
Ultimately, I am responsible for what is happening. The final decision on where to spend the money, what to do, and when to do it is mine. I will keep this thread opened and will be checking on it through the next couple of days, answering any questions or comments on how this was handled. Also, feel free to email the Director of Academic Computing, Dr. Llewellyn, tony@usf.edu, with any issues as to the termination of my employment, but please know that both Chance and Eric ARE doing all they can and working very hard to get things back up as soon as possible. I have been working on campus for 12 years and I can honestly say there are few people I know that can match their expertise and dedication. Everything that can be done will be done.
And now, fire away.
Updates
- What does the storage looks like? We should have more redundancy.
Well, here a cut and paste of the description of the StorEdge 3500:
“Sun StorEdge 3500 RAID systems rank among the fastest and highest-reliability RAID systems in existence. Sun StorEdge 3500 are compact, rack-optimized RAID storage solutions with end-to-end 2Gbit Fibre Channel technology supporting both storage area network (SAN) and direct-attached storage (DAS) architectures. Features: single/dual RAID controllers; 12 disk drive bays; 5 or 12 disk drives; dual power supplies; 6 2Gbit host ports; 2U (3.5 inches high) enclosure. Sun StorEdge 3500 delivers up to 160,000 transactions per second from cache. It provides industry-leading 99.9998+ percent uptime and redundant, hot-swappable components including drives, RAID controllers, power supplies, fans event monitoring units, and battery-backed cache memory to prevent data loss.”
Sounds great on paper, doesn’t it? and, in all fairness, it did work fine for 2-3 years, until last winter.
- I don’t understand why all the user accounts go out at once, are they all stored on a single very large hard drive?
They looks as one drive (one partition) to the two servers attached to it, but it the StorEdge is a hardware RAID device. That much has worked flawlessly: drives on the array have died before and you guys never noticed. The array warned us, we called Sun and got the dead drive replaced without impact to the users, besides a slight performance degradation.
- And is this the same drive that died only 1 year ago over winter break?
The StorEdge has a few hard drives, redundant power suplies, and redundant controllers. What has failed (both times) were the controllers, not the drives. When one of the controllers fail the appliance is “supposed” to notify us and start operating from the other controller. Both back in December and now, controllers failed and we received no warning beyond the machine totally crashing on us.
The controllers “control” the data into and out of the hard drives themselves. Not only the controllers dies, but a good chunk of the inbound and outbound data corrupted the info on the disks. As a result, when we brought the servers back up, the all the data was all poisoned and thus unusable. The details will come out on the investigation we will do ‘after’ we are done retrieving everything from tape.
- Also seems strange that the outages always occur on holidays, why is that?
A well known principle called “Murphy’s Law.” I wish I knew why.
- Even if there is an outage, it sbould not take this long to restore backups, they must be using some primitive slow equipment. 40,000 accounts at 50mb each should be about 2tb of data. I ran a backup system before that could backup or restore that much data in 24 hrs but we were using external hard drives, not tapes. They really need to upgrade their equipment.
Correct. Think of it as listening to a tape — I know, many of you may not have done it ever. We basically have to listen to all songs. No way to speed things up. We cancelled all backup jobs since Tuesday and are using 4 drives to read tapes.
First, the drive has to read what is know is the “Full Backup.” After that is done, it starts reading the “Incremental,” files that have changes since the “Full” was run. All that takes time.
Yes, in the past 5 years drives have become cheaper and they are definitely faster to use. A solution to replace our existing system has been proposed for the past, hmmm, 2 years, but funding is yet to be allocated. We are all painfully aware of the consequences of not being able to update our backup system.
We’ve recovered about 1/3 of the data from tape, so we expect to have everything back by Monday morning. The last backup took place on Tuesday morning, so messages received between about 4AM and 4:30PM on Tuesday (11/21) will not be recoverable. If you were expecting an important message during this period, you should contact the sender, if possible, to have them resend it. Again, we apologize for the inconvenience and we appreciate your patience.
- NOTE: The re-creation of email account has been postponed until 7:00PM today.
Due to hardware errors mail services have been disrupted since 4:30PM yesterday. The recovery plan includes the following:
- Re-Create all of the mail accounts (this should be completed by 7:00PM today and will allow us to receive new incoming mail)
- Restore previously existing mail from tape (this will take several days and has already begun)
- Integrate old email with the newly re-created email accounts (this should be completed by Thursday of next week)
We will post updates as they occur.
We are experiencing a problem with the mail server for mail.usf.edu accounts. Sun Engineers are investigating the problem, but we do not have an estimated time for restoration of services yet. We will post more information as it becomes available.
-Eric
UPDATE: 11:30PM We’re waiting on a part from Sun, but we are trying to recover as much as possible in the meantime. It looks like we’re going to have to restore from backups, but we hope that at least some of the data that was on the drive is recoverable. We will continue to post updates here.