Saturday 26 March 2011

BFS and optimising a global scheduler for per-cpu frequency scaling

TL;DR BFS now faster with cpu frequency scaling

The last update to BFS brought some performance enhancements to threaded CPUs like the i5/i7 etc. One thing that has troubled me for some time with BFS has been that CPU frequency scaling in the form of ondemand and friends, is based on the "local load" on a CPU, and scales one CPU up in the presence of load on that CPU alone, yet BFS has almost no notion of "local load" since it's a global scheduler. While there is some rudimentary code in BFS to say whether one CPU is currently slightly more busy than the total load, the fact that load is more a total system concept rather than per-CPU in BFS means that the CPU frequency scaling is not going to work as well. So I set out to try and tackle that exact issue with this latest test patch.

bfs363-test2.patch

In this patch, a special deadline selection process is invoked only when a "scaling" type cpu frequency governor is invoked on a per-cpu basis. What this special deadline selection process does is has a kind of soft affinity for CPU bound tasks (i.e. ones that don't ever sleep) and tries to keep them on one CPU over and above the normal amount in BFS, to the point of allowing a CPU to go idle instead of pulling that task away from its last CPU. It does this by using a simple counter for the number of times the task would otherwise have been selected by another CPU and keeps it under the total number of other CPUs. If it goes over this counter it will allow it to move. It still moves very very easily compared to the mainline scheduler which has a very strong notion of locality, and works the other way around (moving tasks only if things look really badly unbalanced). Freshly waking tasks also completely ignore this and just work as per usual BFS, thereby maintaining BFS' very low latency, and if a non-scaling governor is used (currently performance or powersave), BFS reverts to the regular deadline selection. This can increase the latency slightly of CPU bound tasks, but because the CPU they're bound to will now more easily increase its frequency with (for example) the ondemand governor, this latency is offset by faster completion of its work.

All in all it addresses a weakness in the fact that the CPU frequency scaling in mainline is designed around per-CPU scheduling and not global scheduling which BFS does. The most extreme example of this in testing on a heavily threaded multicore multithread CPU (i7) shows a 30% increase in throughput on a single threaded task when run on a 4 thread CPU. That's a massive change. Not only will the workload complete faster, but almost certainly battery usage will be reduced. In practice most workloads are more mixed than this, so don't expect a sudden 30% overall boost in performance. It also has no effect without cpu frequency scaling, and no effect on single CPU machines, but the more cores and threads you have, the greater the benefit. Since some new Linux distributions now no longer even allow you to change the governor and just set it to ondemand by default, this change is something that will be essential. After a bit of testing, and possibly more tweaking, it will constitute the changes for the next BFS release.

Please test and report back how this patch performs for you if you have more than one core and run CPU frequency scaling.

9 comments:

  1. Hey ck, I'm super excited about this patch. I too have noticed this issue. I get %30-%50 performance differences depending on which governor I select, and on CPU bound tasks. This looks like the right approach to fix the issue. Again, I'm excited!

    ReplyDelete
  2. this is so awesome. can't wait for the final version. again, good job ck. always impressed by your work.

    ReplyDelete
  3. although, i can't download it for some reason. problem loading page it says.

    ReplyDelete
  4. Not sure what the problem is, as it downloads fine here. If you're also wondering about the safety of this patch, I've been unable to break it so far but I thought it best to make it available as a test patch before releasing it as the new BFS. Likely I'll just change some comments in the patch and leave the actual code unchanged.

    ReplyDelete
  5. nvm, i just used the cached version by searching the link on google. now, to try out this patch.

    ReplyDelete
  6. so wait, i can only use ondemand, conservative, and userspace governors? those are the only ones i see in the patch.

    ReplyDelete
  7. sweet, i can now watch 1080p videos smooth with stress -c 2 and make -j2 on my core 2 duo machine. i would say thats about 10 - 15 % performance gain for me. i also noticed that firefox' opening time is reduced by a second. impressive so far. its a noticeable improvement.

    ReplyDelete
  8. You can use whatever governor you like. The other non-scaling governors don't suffer from a performance penalty with a global queue scheduler (like BFS). This patch addresses a regression that occurs when you plug in one of the 3 scaling governors. Thanks for feedback.

    ReplyDelete
  9. you've released 0.370 for 2.6.35. what's the diff with 0.363 ?

    ReplyDelete