Saturday, 14 May 2011

BFS 0.403 test for 2.6.39-rc7

BFS 0.402 test has proven very stable on 2.6.39-rc7 but a minor issue came up with respect to the new accurate IRQ accounting where some CPU time did not get accounted. So I went in and revised the way it worked to be cheaper and more accurate. There has also been a problem in the accounting that the total cpu did not always add up to 100%. The reason for this was the small inaccuracies of each respective CPU usage (user, system, wait etc.) all were exacerbated when added together. I've put in a total CPU percentage counter that checks the total adds up to 100 and if not, it rounds the values up so they should add up to 100%.

There was also a change I considered doing with the sticky flag that is used to minimise task movement to different CPUs that I've committed to 403 test. Instead of it being a binary on/off flag, I made it a stepped flag going from CACHE_COLD through CACHE_WARM to CACHE_HOT. Basically any task that is knocked off a CPU but is still waiting for more CPU is immediately labelled hot. Only one task is considered hot and previously as soon as a new cache hot task appeared, the sticky flag was cleared. Now, instead of it being cleared, it is set to warm, and only cleared to cold when the task sleeps. Forked child processes are now also labelled cache warm since they share many structures with their parent process. Any task that is cache warm or cache hot is biased against moving to another cpu by offsetting its relative deadline. Any task that is cache hot will not move cpu to a different cpu if that different one is scaled down in speed (as for example when ondemand cpu frequency governor slows it down). Basically this new change should improve throughput more in the overloaded case (when jobs > CPUs), but that's just a generic comment as I haven't benchmarked it yet.

Anyway give the new BFS a try. Everything appears to be running nice and stable, and as a bonus, my feel-good-o-meter is reading quite high with the upcoming 2.6.39! The magnitude of changes going into it seemed a lot less than previous kernels and I've had no issues with the -rc7 version so far.

As per previously, I've compressed the patch with lrzip as part of my evil plot to force you all to use it. Get it here:
2.6.39-rc7-sched-bfs-403-test.patch.lrz

Enjoy, and please report back if you try it!

17 comments:

  1. Here in my Core 2 Duo system with only 1GB ram, everything is working fine until now!

    ReplyDelete
  2. some gcc 4.6 messages:
    In file included from kernel/sched.c:2:0:
    kernel/sched_bfs.c: In function ‘sched_getaffinity’:
    kernel/sched_bfs.c:4371:13: warning: variable ‘rq’ set but not used [-Wunused-but-set-variable]
    kernel/sched_bfs.c: In function ‘sys_sched_yield’:
    kernel/sched_bfs.c:4441:13: warning: variable ‘rq’ set but not used [-Wunused-but-set-variable]
    kernel/sched_bfs.c: In function ‘sys_sched_rr_get_interval’:
    kernel/sched_bfs.c:4686:13: warning: variable ‘rq’ set but not used [-Wunused-but-set-variable]

    ReplyDelete
  3. (Not sure if this is a BFS vs mainline thing.) I noticed that 'top' in my system reports CPU% as an integer number (1, 2, etc) while on a system running mainline it reports 0.3, 1.5, etc. Is this normal?

    ReplyDelete
  4. @RealNC : I don't see what you're reporting. I get floats on my machine?

    ReplyDelete
  5. It seems this one causes a regression with the whole cache warm thing. Hang in there and prepare for a 404 that actually exists.

    ReplyDelete
  6. Ralph Ulrich16 May 2011 00:14

    403 works here without issues:
    2.6.39-rc7-git8-bfs403

    ReplyDelete
  7. Ralph Ulrich16 May 2011 00:31

    Can't find your patch 404. Probably your lrzip compressor is too slow :/

    ReplyDelete
  8. Lol, I haven't even started making the 404. You'll likely hit a 404 if you try clicking on the non-existent 404. The 404 is just a concept for now, and is nothing but a 404 till then.

    ReplyDelete
  9. Ralph Ulrich16 May 2011 00:41

    Obviously my translator to german couldn't correctly work with your sentence:
    "and prepare for a 404 that actually exists."

    Con, another question if you like to talk:
    You mentioned in your blog above the slowing down of linux kernel development. Do you think linux has reached a point of maturity where nothing exciting will happen any more?

    Perhaps, if we get some new hardware like quantum processors ...

    ReplyDelete
  10. The "404 that actually exists" was a joke about the 404 that is normally "this page doesn't exist", sorry.

    No, I don't think linux kernel has matured, we are just in a relatively quiet period and I expect the usual frantic pace of development in the near future.

    ReplyDelete
  11. Ralph Ulrich16 May 2011 00:53

    I don't think. It is all done: BKL, drm, CFS, etc ...

    But if you would try again the inclusion of a scheduler plugin infrastructure into mainline. Despite your bad feeling about the matter it is the times to do.

    ReplyDelete
  12. @ck
    > I don't see what you're reporting. I
    > get floats on my machine?

    Weird. Here, I get integers :-P

    http://i56.tinypic.com/b7fcj8.png

    Since you get floats, then I guess it's a userland configuration thing. I'll need to build a mainline kernel and boot with it to veriry.

    ReplyDelete
  13. Try pressing H to enable thread view.

    ReplyDelete
  14. %ck
    > the total cpu did not always add up to 100%.

    I'm suspicious that that you are hiding a bug. As long as you are not brushing it under the rug:

    wait = 100% - system - user - etc, by definition. No?

    I mean, do you really need to count total time?
    Can't you just look at the clock, so to speak?

    ReplyDelete
  15. Mainline certainly works by adding up what's left to get the rest of the CPU on each tick. However what BFS does is add up CPU time every time some accumulates to each respective component. Then on every tick it looks at the current running total. Thus due to rounding down and slight discrepancies between the total and when the actual tick fires it often adds up to 1 or 2% less than 100% of that tick.

    ReplyDelete
  16. Still, scaling is a brute force approach. It will always "work" to hide any bug.

    Are you against adding the unaccounted time to the largest task or the idle time. A task that is already using a lot will not suffer much from 1-2% on top. It could be simpler to implement too.

    ReplyDelete
  17. I am just adding it.

    ReplyDelete