(Dec 11 Updates)
Well, we had decided to move ahead of schedule and introduce the new server, since all test data we had pointed to a much better performance. The servers were switched at around midnight and the new server brought into operation. Of course, Murphy being Murphy, the disk performance fell through the floor.
Right now, 1:53pm, we have reached our plateau of about 2,500 concurrent users at any time. We have been uping that value from our old 1,000 in order to reduce the number of dropped IMAP connection. This seems to be holding up ok. The difficulty is that, by increasing that value, something has to give. So, users are able to check their email without being dropped and having to click on links again at the expense of the speed in which emails are delivered to their mailboxes.
We have already contacted Communigate (our mail server developer) and are still working on the performance issue.
Meanwhile, we have a working solution to all the accounts with corrupted mailboxes and are working through those issues right now.
I will post follow ups as I have them.
————————-
I’ll post a more detailed entry later, but here’s what’s going on, some of the errors folks are seeing, and what we’re doing about it:
New Storage on the new server
As many of you know, we received the new hardware on Wednesday afternoon. Yesterday during the day we installed the Operating System, configured the Email Server software and ran stress tests on the machine. Last night we started transferring accounts to the storage on the new system.
As the transfer process continues and more and more accounts are transferred we should see some relief in the load of the system as a whole and many of the problems should disappear.
(As of 9:30am, we have moved 5,300 accounts out of a total of 76,000 accounts housed on the server)
Monitoring errors
When you see an error pop up on your screen an error message is logged on a file on the server. We are monitoring this file constantly, trying to pick up problems and resolve them before or as they are being reported. Spot checks of the error log is normal operating procedure for us when the server is healthy, but we have stepped up the monitoring during this period of time.
Load Issues
These problems seem to show up at odd times of the day. There are two contributing factors to increase in load:
- Number of users checking their email
- Amount of messages being received
Because of (2) it is hard to predict when these problems will occur, but the situation should start easing up as we transfer more and more accounts to the storage on the new server.
Failed to Parse INBOX error
We’re trying to track this down, but so far we are unable to reproduce the error. It is also possible that they are load-related, but we’re still investigating. The exact error message is this:
ERROR: Could not complete request.
Query: SELECT “INBOX”
Reason Given: failed to parse the mailbox
Last night we went through every single reported complaint, logging in through WebMail, and could not repeat the error. This error does show up on our error log files, but we’ve only seen it happening once this morning so far.
(Update, 10:15am): This error seems to be generic and have multiple causes. We found one of the causes, affecting potentially 630 accounts. I say ‘potentially’ because mailboxes seem to have healed themselves. We are going through the list of accounts and fixing the ones we find with this problem.
Connection dropped by IMAP server
This is the most common load-related error we’ve seen so far. The IMAP server is what mail reader software (Outlook, WebMail, Eudora) use to read messages from the mail server. There’s a limit of 1,000 concurrent connection on the server. During peak mail-reading hours we easily reach this limit.
Again, this is an error that should go away once accounts are migrated.
Crashes on certain accounts using Outlook
Last night we also went through and fixed several accounts that were corrupted in a way which would cause the user’s connection to crash if they were reading their message from Outlook (Error number 0×800CCC0F in Outlook/Outlook Express). Users reading messages with Webmail would not crash, but they would see an additional blank “message” (Unknown Sender) that could not be deleted from their INBOX.
Other errors and reporting
We are also seeing people reporting errors that are not related to the current state of the server. Nevertheless, if you see an error you can either post to the blogs or send us an email, usg@lists.acomp.usf.edu. All reports are being investigated individually.