2008-04-03

Mind the gap - Date Slash Time bugs

Beware of time bugs. We humans have developed systems for keeping track of time that are positively arcane when you look at it from an algorithmic point of view, and there are many subtle and unsubtle problems applications can run into. Even if you use libraries as much as possible. In this case, the culprit was Daylight Saving Time. But I'm getting ahead of myself, and the story of how we'll get there is quite interesting as well.

Last Saturday evening I received a mail from work. A server application was eating up 100% CPU and an important customer was reporting throughput issues, and if I had any idea what could be the matter? Some mailed logs provided no insight in the problem. The application was still processing data just fine, except obviously for the customer with problems. This particular application had been running well for years, so what could be going wrong now? Time to log in remotely and find out.

Well, it was eating up 100% CPU all right. In case of trouble with your Windows application, break glass and take out Process Explorer. I make sure to leave a copy on every server I touch at work, for the simple reason that there's strictly better than Task Manager. The only niggle is that you'll want to switch back to Task Manager to see what regular users see.

Process Explorer showed that three threads were eating up the CPU, with tens of thousands of context switches each. After a while, though, these threads would disappear and the CPU usage would drop. This would last for only a short time before the CPU toasting began anew.

Now, tens of thousands of context switches is a hint. There's no way the scheduler would switch invididual threads in and out that frequently, unless it was asked to. So what we're looking at in such a case is threads that are spin waiting: they're in a tight loop checking for some condition and yielding the processor between checks. The theory behind spin waiting, however, is that the spinning will take a very short time to prevent context switches and reduce latency. If it looks like it's going to take longer, you're supposed to enter a wait state, so as not to waste CPU cycles with busywork.

At this point I could cheat a bit on the debugging process, because I'd happened to have witnessed this behavior before — but only in the testing environment. If a client on the same computer disconnected rudely, the server would briefly spike the CPU while the threads that were supposed to handle the disconnect didn't. In short, the server was busy-waiting for the receiver logic to time out and see that the connection was gone. Yes, this was bad — but the problem never manifested itself operationally and fixing it looked like a daunting task, so I saved it for "later". As you guessed, later never came. Until it knocked on the door.

So now we're one step further in identifying the problem: there were clients who were rudely disconnecting. No wonder the customer wasn't getting throughput: they kept dropping their own connections just after making them! But why?

At this point, I reconfigured the application's log level to !Ludicrous!, to see if there was anything that could point to a problem. And sure enough, there was:

2008-03-29 21:41:12,656 DEBUG - Exception parsing absolute date 080330020050000+ ParseException: Unparseable date: "080330020050"

(Logging output changed slightly to protect the guilty, but not much.)

Now, I already gave away the crucial insight at the beginning, but would you have guessed the problem from this? That's right: the time that's being parsed, 2008-03-30 02:00:50, doesn't exist: it's between 2 AM and 3 AM on March 30, when the Netherlands switches to Summer Time. I actually had to reproduce the code that was failing on this and still didn't get the problem — the colleague who had originally called me in finally spotted it. The idea of discontinuous time was so foreign to me as a programmer that I hadn't even considered this, despite being well aware of DST.

But why would this cause disconnects? Because the error handling was poorly written: the server considered the request containing this invalid date to be malformatted, so it dropped it without logging — and more importantly, without informing the client that it had sent an invalid request. So the client would send a request and see no reply, time out, and break off the connection. And because the client could send no requests, they kept queueing up, hence the complaints about throughput.

The resolution? As it turned out, the parsing itself was in error. The time indicated by the client was in absolute time, complete with time zone offset of "00+", and 2008-03-30 02:00:50 UTC is just fine. So I hotfixed the server to skip this particular check (since any actual errors with the date would still be caught at a later stage) and the client's queues drained in minutes.

In postmortem, it's interesting to see how things transpired. The server contained a bug in parsing times, causing it to reject valid times as invalid because they fall in the DST gap. It contained another bug in that it didn't inform the client that it had rejected requests containing such times. It contained another bug where it couldn't handle rude disconnects from the client, spiking the CPU instead. And all this in an application that had been running fine for years, handling all sorts of traffic! Speaking of that — how did it run fine for years, given the fact that DST returns every year? Well, this remains speculation, but I'd bet good money that our clients simply didn't use the field containing this particular time format before (there's only one).

In the end, I'm left with the conclusion that DST robbed me twice this year: on top of the hour lost to Summer Time, it's another five hours for triggering this obscure bug...

No comments: