December 2006


The mail server seems to be behaving pretty well since last Wedsnesday. We are still working with the software developer trying to figure out the cause of apparently random crashes happening here and there:

  • Friday, 12/15, 10:23 am
  • Friday, 12/15, 2:14 pm
  • Sunday, 12/17, 12:40 pm
  • Monday, 12/18, 0:55 am
  • Tuesday, 12/19, 0:49 am

We now have an automated script that will detect when the server application goes down and restart if.

Some of the machines that handle the IMAP/POP caching have been acting up in the past few days. This did not affect most of the users (IMAP users) reading emails, but it seemed to have intermittent problems with folks using POP to check their messages, usually through client such as Outlook and others. We are looking into fixes today.

Overnight we transferred the remaining local accounts back to a shared drive. This will allow us to switch servers temporarily and apply firmware updates to the new Dell server without interruption of service. You may see a blip here and there but we are not expecting any extensive downtime at this point.

I think that’s it for now. More updates as they are needed.

The system has been operating well since Wednesday. There are two known issues as of this morning.

We are trying to track down a problem with the help of the software vendor. Last Wednesday we had a crash of the software that lasted about 5 minutes, until we logged in and restarted the server application. We sent some diagnostics to the vendor back then. The same type of crash happened on Friday, Saturday, and overnight last night. We have tracked down the problem to, again, mailboxes corrupted when we were forced to use the shared storage during the first days of the disaster back during Thanskgiving. We are now working to find a way to pinpoint and eliminate the corruption of the mailboxes. Again, no messages are lost during this downtime, but users are not able to login to check their messages until we restart the server.

The second minor issue that cropped up this weeked is the expiration of the security certificate. The certificates are issued for a year and it is a simple matter for us to renew it once it has expired. Your login is still being encrypted, but we must pay Verisign a bit more money for them to be able to vouch for us. That should happen some time today/tomorrow.

(Update, 11 am)

The server is holding without problems.

If you cannot see your folders, try to re-subscribe to them by using the Folders option on WebMail. If that does not solve your problem, post it here and we’ll look into it.

————————

The mail server held pretty steady yesterday throughout the day on Wednesday. We had about 1/2 hour downtime at night around 8pm, but performance during the day remained better than pre-Thanksgiving.

On the error logs we did not see any additional NULL-padded files, which usually for you guys show up as the ‘cannot parse mailbox’ error. We did find some indexes corrupted on individual mailboxes. Some are automatically fixed by the server, others have to be fixed by hand. Please note that in either situation no mail is being lost, the user just cannot read it.

Again, we will continue monitoring the logs aggressively for any errors that may appear.

(Update, 3:30 pm)

Well, we faced the rush hour and the system survived. We will continue to monitor things aggressively over the next couple of days. There’s no system-wide problems we are aware of, individual issues here and there but nothing outside the day-to-day norm.

Please make sure to post if you are having issues, or contact the Help Desk at 974-1222. The HD has extended hours this week, I believe, until 2am.

(Update, 11am)

The system is holding as it should. No wait on delivery on WebMail. By the time I send myself a note, hit the send button, and get back to the INBOX listing, the message is already listed. By this time yesterday we were already bogged down on the queue and on IMAP connections.

(9am)

Last night starting at 8:30 pm we worked on the server yet one more time. We installed a new revision for the driver, created a separate partition for the incoming mail queue, and brought the server back up. We did not apply the firmware revision because those are usually dangerous and we did not want to risk loosing the data. There was no noticeable improvement in the performance.

At that point we had 150,000 message in the queue, waiting to be delivered. At 11pm we started copying some of the mailboxes to a shared drive we borrowed from Blackboard. We were not able to direct-attach the drives because of the way they were setup — attaching our server would crash Blackboard — so we are using the shared drive approach again.

Half of the accounts were copied over by 4am and we switched the mail server software back on. Off the 150,000, 80,000 were spam. The remaining messages were all delivered within 40 minutes!!!

Right now, 9 am, we’re at the same point we were yesterday. Performance is stunning, no messages in the queue, no errors in the log file. Messages are delivered in 3 seconds.

We are waiting for the 10 am mark to see if the server performance will start deteriorating. If it does, we may restart the machine again, since it seems to resolve the problem short-term. The reboot takes about 10 minutes.

(Update, 4 pm)

We decided to move the driver update to 8pm tonight. We will hold on the firmware update. We also borrowed some storage from Blackboard and will be trying to use that instead, just in case the driver update does not work. The storage will also be direct-attached to the mail server tonight. The direct attach will keep mailboxes from getting corrupted over the network.

(Update, 3:22 pm)

We will be installing a driver and a firmware update early tomorrow morning hoping that it will solve the performance issues we are seeing.

(Update, 1:45 pm, workaround)

We’re still pegged at 2,300+ concurrent connections, with the only problem being the copy to sent-folder issue after the email is sent. Here’s a suggestion that may also help us decrease the load on the system.

If you don’t use the sent-folder, ever, remove the copy to sent-folder feature, which is setup by default:

  1. Select Options on WebMail
  2. Select Folder Preferences
  3. Under Sent Folder, drop down the menu options and select [Do Not Use Sent]
  4. Save your changes

If you are like me you like to have a copy of your sent messages around for reference later. Two options:

  1. If you have an outside account such as Verizon or Brighthouse, CC yourself on the messages you send out using the outside account. You will receive a copy of your sent message on that account right away.
  2. If you don’t have an outside account, CC yourself to the email using your @mail.usf.edu account. You will not receive a copy right away, but you will as soon as the queue is run.

Removing the sent-mail folder from the equation will get you back from the Compose window to the INBOX in about 5 to 10 seconds. 

(Update, 1:00 pm)

The only error we’ve been seeing in the logs is is the reset after sending a message, when the serving is trying to copy the message to sent-mail. The actual message is going out for delivery within a few seconds, but the copy process is sitting and waiting until it times out.

(Update, 12:00)

Messages are being sent out, no problem, but the system is failing when it tries to get a copy of the message to the sent-mail folder. It is just the sheer number of connections it is having to handle. As a whole we’re still doing a lot better today than we were at the same time yesterday, I know that may be hard to believe.

(Update, 11:30 am)

Today has been a roller coaster so far, and now we’re on the down hill part. We hit 2,300 concurrent connections and investigating why people are not being able to send out messages through WebMail. This is traditionally the worst part of the day for the mail server.

One thing is for certain: the old, temporary setup we had before Sunday would not be able to handle this increase in load. We’d be seeing corrupted mailboxes popping up everywhere.

(Update, 11:15am)

  •   Average time to load INDEX page on WebMail: 30 seconds
  •   Concurrent connections (clicks at once): 1,600
  •   Queue size: 31,000 messages
  •   Queue wait (inbound, mail.usf.edu): 80 minutes

(Update, 10:45am)

As of right now, we see no problems with users being dropped, which is good news.

I confirmed with Chance and his script fixing the ‘IMAP parse’ error on corrupted mailboxes ran last night and cleared up about 460 accounts. We gathered those account names from the error logs on the server. It is possible that, if you haven’t logged in during the past couple of days, you accounts is still corrupted. We are still monitoring the error log for this error and any other that may show up.

We had an issue after the reboot this morning with services competing for the mail delivery on mail. That was resolved around 9:20am and mail started flowing in. Because of this problem mail gathered on the receiving servers, waiting to be accepted by mail.usf.edu. Right now, the delivery queue to mail.usf.edu accounts is at 60 minutes wait, with about 20,000 messages on queue. Mails sent from mail.usf.edu to another server are not affected. All messages to mail.usf.edu are being delivered, but the wait right now is 60 minutes. That time will decrease in the next couple of hours, once the mail server catches up with the backlog of messages.

Finally, we found a group of about 20 accounts, all starting with an ‘a,’ that were not transferred to the new server. These are being transferred right now.

(Initial posting)

We implemented two modifications to improve the disk response on the server. The slow response was causing conenctions to time out, thus all the “IMAP dropped conennection” everyone was observing Monday morning. Another side effect was the delivery queue: as of 5pm yestarday there were 19,000 messages waiting to be delivered to the server.

Last night we performed some modifications on the ‘read’ side of the server, making checking email a bit faster. That resulted in only marginal improvements to the disk situation. This morning we restarted the server twice and did some modifications on the actual firmware of the new server.

As a result, the number of connections from people reading messages is back down to normal. In addition, the number of messages waiting to be delivered is down to zero, messages are being delivered as soon as they are received.

We will continue monitoring the server closely, especially to see how it reacts during peak usage hours at lunchtime.

(Dec 11 Updates)

Well, we had decided to move ahead of schedule and introduce the new server, since all test data we had pointed to a much better performance. The servers were switched at around midnight and the new server brought into operation. Of course, Murphy being Murphy, the disk performance fell through the floor.

Right now, 1:53pm, we have reached our plateau of about 2,500 concurrent users at any time. We have been uping that value from our old 1,000 in order to reduce the number of dropped IMAP connection. This seems to be holding up ok. The difficulty is that, by increasing that value, something has to give. So, users are able to check their email without being dropped and having to click on links again at the expense of the speed in which emails are delivered to their mailboxes.

We have already contacted Communigate (our mail server developer) and are still working on the performance issue.

Meanwhile, we have a working solution to all the accounts with corrupted mailboxes and are working through those issues right now.

I will post follow ups as I have them.

————————-

I’ll post a more detailed entry later, but here’s what’s going on, some of the errors folks are seeing, and what we’re doing about it:

New Storage on the new server

As many of you know, we received the new hardware on Wednesday afternoon. Yesterday during the day we installed the Operating System, configured the Email Server software and ran stress tests on the machine. Last night we started transferring accounts to the storage on the new system.

As the transfer process continues and more and more accounts are transferred we should see some relief in the load of the system as a whole and many of the problems should disappear.

(As of 9:30am, we have moved 5,300 accounts out of a total of 76,000 accounts housed on the server)

Monitoring errors

When you see an error pop up on your screen an error message is logged on a file on the server. We are monitoring this file constantly, trying to pick up problems and resolve them before or as they are being reported. Spot checks of the error log is normal operating procedure for us when the server is healthy, but we have stepped up the monitoring during this period of time.

Load Issues

These problems seem to show up at odd times of the day. There are two contributing factors to increase in load:

  1. Number of users checking their email
  2. Amount of messages being received

Because of (2) it is hard to predict when these problems will occur, but the situation should start easing up as we transfer more and more accounts to the storage on the new server.

Failed to Parse INBOX error

We’re trying to track this down, but so far we are unable to reproduce the error. It is also possible that they are load-related, but we’re still investigating. The exact error message is this:

ERROR: Could not complete request.
Query: SELECT “INBOX”
Reason Given: failed to parse the mailbox

Last night we went through every single reported complaint, logging in through WebMail, and could not repeat the error. This error does show up on our error log files, but we’ve only seen it happening once this morning so far.

(Update, 10:15am): This error seems to be generic and have multiple causes. We found one of the causes, affecting potentially 630 accounts. I say ‘potentially’ because mailboxes seem to have healed themselves. We are going through the list of accounts and fixing the ones we find with this problem.

Connection dropped by IMAP server

This is the most common load-related error we’ve seen so far. The IMAP server is what mail reader software (Outlook, WebMail, Eudora) use to read messages from the mail server. There’s a limit of 1,000 concurrent connection on the server. During peak mail-reading hours we easily reach this limit.

Again, this is an error that should go away once accounts are migrated.

Crashes on certain accounts using Outlook

Last night we also went through and fixed several accounts that were corrupted in a way which would cause the user’s connection to crash if they were reading their message from Outlook (Error number 0×800CCC0F in Outlook/Outlook Express). Users reading messages with Webmail would not crash, but they would see an additional blank “message” (Unknown Sender) that could not be deleted from their INBOX.

Other errors and reporting

We are also seeing people reporting errors that are not related to the current state of the server. Nevertheless, if you see an error you can either post to the blogs or send us an email, usg@lists.acomp.usf.edu. All reports are being investigated individually.

We’re in the final stages of the recovery process, we’re almost done with everything but there are a few remaining issues:

Someone just sent me a message and it hasn’t been delivered, why is mail delivery taking so long?

We are waiting on a new server to replace the existing mail server and until the new hardware is in place, there will be some slowdowns. The problems will be at their worst from about 10AM-3PM during the week, as that is when most users are checking their mail. We should have the new system up and running in a few weeks.

I have a message from “unknown sender” that I can’t delete!

This will be fixed soon. The problem is an extra message separator in your INBOX that was added by the restore process. We’ve located all of the accounts with the extra line and we are now testing a program to repair the INBOXes. The testing process will be completed today and the mailboxes should all be repaired this weekend.

I get Error number 0800CCC0F in Outlook/Outlook Express!

This is caused by the mailbox problem mentioned above. Outlook & Outlook Express can not handle the extra message separator and will not read the mailbox. This will be fixed this weekend, but you can use WebMail as a workaround in the meantime.

If you are having any other problems, please post them here or send them to usg@mailman.acomp.usf.edu”