Thursday, 5 September 2013

Microsleeps and operating systems

As an anaesthetist, I spend a lot of time and effort dealing with and understanding sleep. So it's mildly amusing that I spend a lot of time dealing with sleeps in various forms in code. Previously it was the effect sleep has on scheduling while developing and working on the linux kernel scheduler, and now it's dealing with writing drivers for various hardware for cgminer.

What I'm predominantly interested in is dealing with microsleeps - which ironically is also the name the airline safety industry calls it when an airline pilot nods off temporarily when they're meant to be awake. I'm sure you've all had that experience at some stage, hopefully not while driving.

Anyway the mining hardware scene for bitcoin has moved to the first generation of ASIC devices, and in that scene, the faster you can market and sell your product, the greater the potential profit for the manufacturers. Thus, not a lot of care often goes into the interface between the ASIC chips and the operating system, leading to really poorly designed MCUs and FPGA firmware. The task is simple enough - send the device some work, get one or more responses back, load more work, rinse and repeat. Some devices don't have buffers for queueing work, and some don't have buffers for responses, leading to scenarios where time to getting the response and loading more work becomes more and more critical. Some are so badly designed that they have response codes 64 bytes long and send it out on a device with a buffer that only fits 62 bytes. Basically most of them expect you to repeatedly poll the device for results and retrieve them, followed by sending them more results.

Now controlled small amounts of polling has its place in certain circumstances and busy waiting on a condition is potentially going to be faster than waiting in sleep due to scheduling wake up delays and so on. However this is only for microsecond timeframes and provided you don't need to be doing it continuously, and the small amount of extra power usage over that period is not significant, and you don't get yourself into a priority inversion somehow.

None of the hardware I am dealing with really works in those kind of timeframes, and repeatedly polling would be incredibly wasteful of CPU, and I have a horrible aversion to wasted CPU cycles just asking a device if it's ready or not in an infinite loop. However, because of the lousy designs of some of this hardware, we are dealing with sleeps in the order of 1-40ms. So it's just outside the microsecond resolution time frames, but only just in the worst case scenario. Those of you who've coded sleeps in these sized sleeps would know that the jitter in the kernel timers alone is often in the order of 2ms, and scheduling delays can be much larger under load conditions. Some hardware is much worse, and some operating systems (eg windows) by default have only 15ms granularity unless you force it to operate at higher resolution.

Lots of hardware has asynchronous mechanisms so it can get far more complicated, but we can demonstrate the issues even with the simplest of designs, so let's assume a simple 1 work item, 1 result synchronous design with say a minimum of 10ms between work and results (contrived example for demonstration purposes) on a uniprocessor machine.

So here lies the problem:

1. Send work.
2. Sleep 10ms.
3. Check for results.

Looks simple enough. Even assuming 2ms jitter, the worst thing that can happen is we wait an extra 2ms which is not a profound delay. Let's ignore the jitter for this discussion.

However the following could happen on the uniprocessor machine:

1. Send work.
1a. Application gets descheduled for another process to take its place by the operating system kernel for 10ms.
2. Sleep 10ms.
2a. Application gets descheduled again for another 10ms.
3. Check for results.

Admittedly this is the worst case scenario, but our 10ms wait becomes 30ms, so this is no longer insignificant. Assuming the common scenario is we only sleep 10ms and occasionally 1a (more likely) OR  2a happens, with the worst case scenario almost never happening, we can mitigate the disaster by making the intermediate sleep smaller, to say half of what it was. Now we have something that sleeps somewhere between 5 and 15ms. Not too bad, except that for the common case we are polling twice as often now, and worst case scenario is still a lot longer than we'd like.

Even if we accept this, we encounter a new problem with sleep, assuming we use a readily available accurate timer such as nanosleep(). Nanosleep does not guarantee we will sleep the amount we asked for and happily gets interrupted by signals, then returning how much it had slept instead of sleeping for the amount we asked. Therefore we have to handle nanosleep() returning having slept less than asked for, retrieve how much we slept for, calculate how much more we need to sleep, and then run nanosleep() again.

2a. Ask for sleep 5ms
2a1. Signal interrupts sleep
2a2. Returns after 2ms
2b. Calculate we need to sleep another 3ms
2c. Ask to sleep 3ms
etc.
.
.
3. Check for results

Now you'd be real  unlucky for this to happen multiple times over, but even this happening once we now have quite a few more potential places where the application can get descheduled, thus making it possible that we make it to step 3 much later than intended.

So what do we do? Do we halve the sleep even further to 2ms? That makes it far less likely we'll get a signal but we're now down to close to the resolution of the kernel sleep timers themselves and we run into a new problem should we run on multiprocessor systems (which for all intents and purposes, pretty much everything is these days). The clock on each CPU is transiently slightly different and the kernel cheats by picking the smallest difference that can guarantee it's not going to sleep too long. In my instrumenting of this, I found that most calls to nanosleep only slept 2/3 of the time I asked for on a quad core CPU.

At 1-2ms sleep times, we are now getting dangerously close to just busy waiting anyway and substantially increase the risk of our application getting descheduled because it is now CPU bound.

So after much angst it was clear that the only way to minimise this problem was to move to absolute timers so that we left it up to the operating system to figure out how much it should (or should not!) sleep for. To guarantee we never slept too much, and allowed ourselves some freedom to poll after the sleep period I initially chose the following:

1. Get time
2. Send work.
3. clock_nanosleep() to an absolute time 10ms after that retrieved in 1.
4. Check for results.

We still have the potential for descheduling and extra delays to occur after 3 and before 4, but most operating system kernels will give precedence to a process that has slept and is waking up so this is actually likely to be relatively rare. It also means should we get descheduled somewhere between 1 and 3, the operating system actually won't put our process to sleep at all.

Only...

clock_nanosleep() being POSIX.1-2001 doesn't guarantee it will work everywhere of course. And indeed this time only linux supported it. Unlike the function calls for anonymous semaphores which I mentioned in my previous post that were present blank functions that returned ENOSYS, these functions did not exist at all on OSX. Nor were they present on mingw32 on windows. (This time I will not pass judgement on this...)

Now I am aware there are other timers on the other operating systems, but I needed a workaround that would perform no worse than just calling nanosleep() till I tackled them one operating system at a time since I don't know really know these operating systems intimately. So for the time being what I do on windows and OSX is:

1. Get time.
2. Send work
3. Calculate how long to sleep 10ms relative to time retrieved in 1.
4. Nanosleep that duration.
5. Check for results.

So far this is performing better than using ordinary nanosleep was. Why? I mean it just looks like it is a more complicated way of doing a relative nanosleep and should perform worse. It turns out that task 2 - send work, takes a variable amount of time to perform itself, and we should start timing from when we first started the function call to send work.

Well, I was going to talk about clocks as well, but this post ended up being much longer than I anticipated, so I'll save that for next time (no pun intended).

2 comments:

  1. Do these kind of traps appear with RT kernels?
    ( guess not, because somehow "hard-wired"? )

    I never tried RT myself, but I ever wondered, why RT users claim having a much better experience with doing audio tasks. Why the hell such cheap task as audio, which shouldn't need more than 5 percentage of resources on a modern multi-core system, wouldn't be better served using BFS, which could provide dynamically more resources if needed, instead of some "hard-wired" RT schedule.

    But I am not sure if I understood your issue at all ...

    Ralph Ulrich, Hamburg

    ReplyDelete
  2. Funny, no comment from the OSX people this time. I guess my post didn't look as offensive.

    ReplyDelete