Thursday, 5 September 2013

Microsleeps and operating systems

As an anaesthetist, I spend a lot of time and effort dealing with and understanding sleep. So it's mildly amusing that I spend a lot of time dealing with sleeps in various forms in code. Previously it was the effect sleep has on scheduling while developing and working on the linux kernel scheduler, and now it's dealing with writing drivers for various hardware for cgminer.

What I'm predominantly interested in is dealing with microsleeps - which ironically is also the name the airline safety industry calls it when an airline pilot nods off temporarily when they're meant to be awake. I'm sure you've all had that experience at some stage, hopefully not while driving.

Anyway the mining hardware scene for bitcoin has moved to the first generation of ASIC devices, and in that scene, the faster you can market and sell your product, the greater the potential profit for the manufacturers. Thus, not a lot of care often goes into the interface between the ASIC chips and the operating system, leading to really poorly designed MCUs and FPGA firmware. The task is simple enough - send the device some work, get one or more responses back, load more work, rinse and repeat. Some devices don't have buffers for queueing work, and some don't have buffers for responses, leading to scenarios where time to getting the response and loading more work becomes more and more critical. Some are so badly designed that they have response codes 64 bytes long and send it out on a device with a buffer that only fits 62 bytes. Basically most of them expect you to repeatedly poll the device for results and retrieve them, followed by sending them more results.

Now controlled small amounts of polling has its place in certain circumstances and busy waiting on a condition is potentially going to be faster than waiting in sleep due to scheduling wake up delays and so on. However this is only for microsecond timeframes and provided you don't need to be doing it continuously, and the small amount of extra power usage over that period is not significant, and you don't get yourself into a priority inversion somehow.

None of the hardware I am dealing with really works in those kind of timeframes, and repeatedly polling would be incredibly wasteful of CPU, and I have a horrible aversion to wasted CPU cycles just asking a device if it's ready or not in an infinite loop. However, because of the lousy designs of some of this hardware, we are dealing with sleeps in the order of 1-40ms. So it's just outside the microsecond resolution time frames, but only just in the worst case scenario. Those of you who've coded sleeps in these sized sleeps would know that the jitter in the kernel timers alone is often in the order of 2ms, and scheduling delays can be much larger under load conditions. Some hardware is much worse, and some operating systems (eg windows) by default have only 15ms granularity unless you force it to operate at higher resolution.

Lots of hardware has asynchronous mechanisms so it can get far more complicated, but we can demonstrate the issues even with the simplest of designs, so let's assume a simple 1 work item, 1 result synchronous design with say a minimum of 10ms between work and results (contrived example for demonstration purposes) on a uniprocessor machine.

So here lies the problem:

1. Send work.
2. Sleep 10ms.
3. Check for results.

Looks simple enough. Even assuming 2ms jitter, the worst thing that can happen is we wait an extra 2ms which is not a profound delay. Let's ignore the jitter for this discussion.

However the following could happen on the uniprocessor machine:

1. Send work.
1a. Application gets descheduled for another process to take its place by the operating system kernel for 10ms.
2. Sleep 10ms.
2a. Application gets descheduled again for another 10ms.
3. Check for results.

Admittedly this is the worst case scenario, but our 10ms wait becomes 30ms, so this is no longer insignificant. Assuming the common scenario is we only sleep 10ms and occasionally 1a (more likely) OR  2a happens, with the worst case scenario almost never happening, we can mitigate the disaster by making the intermediate sleep smaller, to say half of what it was. Now we have something that sleeps somewhere between 5 and 15ms. Not too bad, except that for the common case we are polling twice as often now, and worst case scenario is still a lot longer than we'd like.

Even if we accept this, we encounter a new problem with sleep, assuming we use a readily available accurate timer such as nanosleep(). Nanosleep does not guarantee we will sleep the amount we asked for and happily gets interrupted by signals, then returning how much it had slept instead of sleeping for the amount we asked. Therefore we have to handle nanosleep() returning having slept less than asked for, retrieve how much we slept for, calculate how much more we need to sleep, and then run nanosleep() again.

2a. Ask for sleep 5ms
2a1. Signal interrupts sleep
2a2. Returns after 2ms
2b. Calculate we need to sleep another 3ms
2c. Ask to sleep 3ms
etc.
.
.
3. Check for results

Now you'd be real  unlucky for this to happen multiple times over, but even this happening once we now have quite a few more potential places where the application can get descheduled, thus making it possible that we make it to step 3 much later than intended.

So what do we do? Do we halve the sleep even further to 2ms? That makes it far less likely we'll get a signal but we're now down to close to the resolution of the kernel sleep timers themselves and we run into a new problem should we run on multiprocessor systems (which for all intents and purposes, pretty much everything is these days). The clock on each CPU is transiently slightly different and the kernel cheats by picking the smallest difference that can guarantee it's not going to sleep too long. In my instrumenting of this, I found that most calls to nanosleep only slept 2/3 of the time I asked for on a quad core CPU.

At 1-2ms sleep times, we are now getting dangerously close to just busy waiting anyway and substantially increase the risk of our application getting descheduled because it is now CPU bound.

So after much angst it was clear that the only way to minimise this problem was to move to absolute timers so that we left it up to the operating system to figure out how much it should (or should not!) sleep for. To guarantee we never slept too much, and allowed ourselves some freedom to poll after the sleep period I initially chose the following:

1. Get time
2. Send work.
3. clock_nanosleep() to an absolute time 10ms after that retrieved in 1.
4. Check for results.

We still have the potential for descheduling and extra delays to occur after 3 and before 4, but most operating system kernels will give precedence to a process that has slept and is waking up so this is actually likely to be relatively rare. It also means should we get descheduled somewhere between 1 and 3, the operating system actually won't put our process to sleep at all.

Only...

clock_nanosleep() being POSIX.1-2001 doesn't guarantee it will work everywhere of course. And indeed this time only linux supported it. Unlike the function calls for anonymous semaphores which I mentioned in my previous post that were present blank functions that returned ENOSYS, these functions did not exist at all on OSX. Nor were they present on mingw32 on windows. (This time I will not pass judgement on this...)

Now I am aware there are other timers on the other operating systems, but I needed a workaround that would perform no worse than just calling nanosleep() till I tackled them one operating system at a time since I don't know really know these operating systems intimately. So for the time being what I do on windows and OSX is:

1. Get time.
2. Send work
3. Calculate how long to sleep 10ms relative to time retrieved in 1.
4. Nanosleep that duration.
5. Check for results.

So far this is performing better than using ordinary nanosleep was. Why? I mean it just looks like it is a more complicated way of doing a relative nanosleep and should perform worse. It turns out that task 2 - send work, takes a variable amount of time to perform itself, and we should start timing from when we first started the function call to send work.

Well, I was going to talk about clocks as well, but this post ended up being much longer than I anticipated, so I'll save that for next time (no pun intended).

Monday, 2 September 2013

Unnamed semaphores and POSOSX

During the development of my bitcoin mining software, cgminer, I've used just about every synchronisation primitive due to it being heavily multithreaded. A few months back I used some semaphores and the first thing I reached for was the much more useful unnamed semaphores commonly in use today. Classic SYSV IPC semaphores are limited in number, require allocating of shared memory, stay in use till destroyed or the system rebooted etc. etc. that make them real awkward to use and far less flexible so I never even considered using them. For some reason, though, I had a vague memory of trying to use them on lrzip years ago and deciding not too. Well that memory came back to bite me.

Cgminer is cross platform, working reasonably well on various architectures with Linux, windows (via mingw32) and OSX mainly, though other brave souls have used it on all sorts of things. I've often heard OSX described as the "Fischer Price" unix, AKA "My first unix" because of its restricted subset of unix capabilities that it has, although I'm led to believe it claimed to have POSIX compliance at some stage - though I never really investigated it nor does it really matter since Linux is only POSIXy at best.

So the interesting thing was that I had written some code for cgminer which used unnamed semaphores and it compiled fine across the 3 main platforms, but it failed miserably when it came to working on OSX. Of note, the unnamed semaphore functions conform to POSIX.1-2001. All of the functions compiled perfectly fine, but the application refused to run properly, and finally when I got some of the OSX users to investigate further, every single unnamed semaphore function, such as sem_init, sem_post, sem_wait etc, would return a unique OSX error which when deciphered it was actually "Unimplemented feature". Quite amusing that to get POSIX compliance it only had to implement the functions, but not the actual features of those functions... You may go wild with speculation as to why this may be. This is why I coined the term POSOSX.

After toying with the idea of using SYSV semaphores and being disgusted at the thought, I finally decided that I should just implement really basic fake unnamed semaphores using pipes on OSX to imitate their behaviour.

Simplified code from cgminer for OSX follows (real code checks return values etc.):

struct cgsem {
    int pipefd[2];
};

typedef struct cgsem cgsem_t;
 

void cgsem_init(cgsem_t *cgsem)
{
    int flags, fd, i;

    pipe(cgsem->pipefd);

    /* Make the pipes FD_CLOEXEC to allow them to close should we call
     * execv on restart. */
    for (i = 0; i < 2; i++) {
        fd = cgsem->pipefd[i];
        flags = fcntl(fd, F_GETFD, 0);
        flags |= FD_CLOEXEC;
        fcntl(fd, F_SETFD, flags);
    }
}

void cgsem_post(cgsem_t *cgsem)
{
    const char buf = 1;

    write(cgsem->pipefd[1], &buf, 1);
}

void cgsem_wait(cgsem_t *cgsem)
{
    char buf;

    read(cgsem->pipefd[0], &buf, 1);
}

void cgsem_destroy(cgsem_t *cgsem)
{
    close(cgsem->pipefd[1]);
    close(cgsem->pipefd[0]);
}


Lrzip version 0.615

So finally I found some time to give my other pet project, lrzip, some love and attention that it sorely deserved. Over the past year since the last release, a number of bugs have shown up and some generous souls have contributed bugfixes where possible for them, or at least instrumented where they come from.

This latest release addresses all known bugs that were reported in that time, and includes a few performance micro-optimisations.


The project page is here (0.615 will appear there shortly):

https://freecode.com/projects/long-range-zip

The news update follows:

Fixed -O not working on lrztar.
Made it less likely to run out of ram when working with STDIN/OUT.
Fixed running out of ram when using -U on huge files.
Fixed corrupt archives being generated from incompressible data.
Fixed corrupt archives being generated from very small files.
Fixed endianness on various platforms for MD5 calculation to work.
Fixed rare corruption when compressing with lzma from STDIN.
Fixed all blank data being generated when compressing from STDIN on OSX.
Performance micro-optimisations.
Fixed corrupt archive being generated when all the same non-zero bytes exist on
large files.
Enjoy!
お楽しみください

Wednesday, 10 July 2013

BFS 0.440, -ck1 for linux-3.10

I finally managed to set up some 3g wireless internet in this remote mountain village I'm staying in (probably the first to ever do so). After a few revisions I was able to bring BFS into line with mainline. There are no significant changes to the design itself, but hopefully a few minor fixes have come along as a result of the resync as I also carved out bits of code not relevant to BFS and tinkered with the shutdown mechanism a bit more. As for the new tickless on busy CPU feature from mainline, it is not being offered in BFS as it is quite orthogonal to a design that so easily moves tasks from one CPU to another, and it provides no advantage for desktop/laptop/tablet/PDA/mobile device/phone/router etc. which BFS is targeted towards.

Some of the configuration code was also changed since the last version allowed you to generate an invalid configuration. You might get some strange warnings about the IRQ TIME ACCOUNTING configuration option but it should be harmless.

Get BFS by itself for 3.10.0 here:
3.10-sched-bfs-440.patch

 After careful consideration, I've decided to remove the remaining -ck patches and just make the -ck patchset BFS with some extra default config options and the -ck tag. As I've said previously, those other patches were from long ago, the kernel has changed a lot since then, and I've been unable to confirm they do anything useful any more, whereas there have been reports of regressions with them.

Get the -ck tagged patchset here:
3.10-ck1

Enjoy!
お楽しみください