Wednesday, 6 October 2010

Hierarchical tree-based penalty.

Further investigation at last reveals why the latest code affects program groups as well as threads, which was my original intention only. The number of tasks running was getting inherited across fork, meaning it was really acting as a branch penalty. That is, the more times a task has forked from init (the further along the branching), the greater the penalty it is receiving to its deadline. So it's a hierarchical tree penalty. While there may be a real case for such a feature, it is not program grouping at all! Now I have to think hard about where to go from here. The thread grouping idea is still a valid idea, and will work with this code corrected. However the tree penalty may be worth pursuing as a separate concept since it had such dramatic effects...

Somehow I found myself doing a lot more hacking than I had intended, and that's also why there's so much blogging going on from someone who hates blogs.

EDIT: Enough talk. Here's some code. Patch to apply to a BFS 357 patched kernel, with two knobs in /proc/sys/kernel/
group_thread_accounting - groups CPU accounting by threads (default off)
fork_depth_penalty - penalises according to depth of forking from init (default on)
(Updated version):
bfs357-penalise_fork_depth_account_threads.patch

9 comments:

  1. Hey, I just tested out the bfs357-test patch you just posted, and even though I barely have time to write this right now (also haven't had time to run your benchmark tests yet), I wanted let you know that with the fork_depth_penalty enabled everything was fine for a time....but all of a sudden all video content started to stutter hardcore, and all graphical content on on my desktop slowed down to a crawl...The only error I see without digging is the usual fglrx backtrace (ati card) on dmesg that happens whenever I use any preemptive kernel (usually doesn't affect actual performance though).....as soon as I set kernel.fork_depth_penalty = 0 everything went back to being smooth and normal imedialtely....the other option didn't seem to affect what was going on with this bug when I toggled it on and off...I also noticed xorg's load was very high and rising.....I'm a huge fan of your patches by the way, so thanks for all your hard work, and hopefully later tonight I'll have the time to get you some actual quantitative information that I'm guessing would be much more useful to you.

    ReplyDelete
  2. Thanks very much chronniff. There's nothing better than a clear big fat good/bad with on off/on test for me. It was always going to be possible that something important which forked enough suffered with this approach, and a multiple process multimedia application is a perfect example of it. Mplayer in all likelihood works better in this environment because it's not multithreaded. That's not to say mplayer is a better application. Note that applications like amarok are threaded as opposed to having lots of processes (threads vs separate processes) and it doesn't suffer with fork_depth_penalty, but does suffer with group_thread_accounting. It would be interesting to know what CPU you were using, and whether it was UP or SMP.

    All of this just goes to show that as soon as an heuristic comes into play, it may improve things most of the time, but then causes some nasty regression when you least expect it. I've always been opposed to heuristics and BFS just hands out CPU fairly and with low latency.

    All of this points to this fork penalty, unfortunately, being a big failure in the end.

    ReplyDelete
  3. Hey ck, just taking another look now...I actually am not so sure it's BFS causing the behavior I described earlier, although turning the forking off definitely did stop it. I'm running an intel corei7 960 @ 3.20GHz (turbo and hyper-threading enabled), not sure if it's relevant (although it usually is when dealing with bugs) but my video card is an ATI 5770 running the proprietary fglrx.
    I recently started enabling the new intel_idle in the kernel to handle the c-state management instead of the bios implementation....I use to disable C3 and C6 to avoid latency, but intel_idle uses all the c-states, and I really don't notice any penalties, meanwhile my cpu stays much much cooler, and turbo can even over clock the cores even higher than before under certain circumstances...I only mention this because I was wondering if intel_idle could some how be clashing with BFS, and that could be the cause of some issues.....W
    hen I tested it earlier it was flash video that started whatever was happening (surprise surprise), but it then affected video playing on mplayer, all the compiz animations, and essentially all graphical interfaces which all became extremely choppy and extremely slow.
    On the other hand, other than that bug I definitely notice the benefits of grouping the threads to their parent. I had Windows 7 and Centos booting up on vmware, and meanwhile I was opening up vuze, which normally mysteriously makes my system behave both slow, and somewhat strange while running. But using the grouping of threads if vuze was under heavy load, while its throughput may slow down some, the rest of my desktop seems to feel completely unaffected, other than with the disk I/0

    ReplyDelete
  4. In fact I have seen bugs related to intel_idle reported on the linux kernel mailing list, so perhaps you're just being bitten by that?

    In any case this code is too untested to make it into the next major -ck release but I'd love to hear about more feedback from others and yourself if you find it's intel idle at fault.

    ReplyDelete
  5. There is a similer problem.
    I'm running mplayer on AMD Athlon(tm) XP 2500+ with nVIDIA GForce2 MX400 (binary driver)

    The video was not smooth but audio was fine. If I switched fork_depth_penalty to off, video was back to smooth.
    When I switched fork_depth_penalty to on again, video started to be not smooth.
    After switching fork_depth_penalty many times, I noticed that the cpu usage of mplayer appeared 0.0%.
    When I noticed the cpu usage was 0.0%, I tried to switch fork_depth_penalty more times.
    The sitatusion was still the same as I described.

    Then I tried to restart mplayer and played the same video content. the issue was still there.

    At the same time, I was running Deluge.

    I disabled Deluge, and tested it.
    whatever fork_depth_penalty on or off. I can't reproduce this problem.

    So, I started Deluge and tested agsin. I still can't reproduce it.

    Note: I use the latest patch "bfs357-penalise_fork_depth_account_threads.patch"

    ReplyDelete
  6. Very interesting. I wonder if deluge is one of those heavily threaded things. Try the other knob, enabling thread group accounting. Thanks for testing!

    ReplyDelete
  7. By this command: ps aux -T | grep deluge

    There are 5 'threads'.
    I will try if I can reproduce it.

    ReplyDelete
  8. I observe one thing.


    If I used defaut settings: fork_depth_penalty: 1, group_thread_accounting: 0
    when Deluge had I/O operations, its cpu usage might increase to 40 - 60%.
    Thus Mplayer might be affacted.


    Then I tried to set group_thread_account to 1 (fork_depth_penalty was still 1), it improved the situation.
    However, I ever saw serval times that the cpu usage of Deluge increased 40 -60
    (In an 3 hours observation)


    Finally, I set group_thread_account and fork_depth_penalty to 0.
    So far, the cpu usage of Deluge never goes high, and mplayer plays video smoothly.

    ReplyDelete
  9. Thanks very much for that testing too. I'm beginning to suspect that special treatment for threads may have been inappropriate and that a patch that just treats all threads and processes equally may be better. Not only that, but it would be a much smaller patch and less overhead. Not that the current patch is much overhead mind you. I'll experiment further and might try just such a patch as well.

    ReplyDelete