Saturday 1 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.105


Announcing a multiple runqueue variant of BFS, with the more mundane name of MuQSS (pronounced mux) for linux 4.7:

Full patch for linux-4.7
4.7-sched-MuQSS_105.patch

Keep watching this blog for newer versions!

Incremental to patch bfs502 to MuQSS 0.1:
bfs502-MuQSS_103.patch

It was inevitable that one day I would find myself tackling the 2 major scalability limitations in BFS and this is the result of it. These two issues were
  1. The single runqueue which means all CPUs would fight for lock contention over the one runqueue, and
  2. The O(n) look up which means linear increase in overhead for task lookups as number of processes increases.
As you're all aware by now, skiplists were recently introduced into BFS to tackle number 2 with a modest improvement in throughput at high loads.

Till now I did not have the energy nor time to try and find a solution for number 1. that maintained BFS' scheduling decision algorithm as the single runqueue was actually the reason latency remains bound and deterministic on BFS, capitalising with more CPUs instead of fighting against them for scalability.

This scheduler variant is an evolution of BFS, which hopefully will be mature enough to replace BFS one day when stability is assured. It is able to still use the same scheduling algorithm as BFS meaning latency and responsiveness remains as good as always, but with the per-CPU runqueue and discrete locking, it also means it will scale to any number of CPUs, as the mainline scheduler does.

It does NOT guarantee the best possible throughput as there still is virtually no complex balancing mechanism whatsoever, selecting tasks according to deadline primarily with only CPU cache distances being used to determine which idle CPU to go to, or in non-interactive mode, which overloaded CPU to pull from to fill an idle CPU.

It would be possible, with a lot of effort, to wedge the entire balancing algorithm for scalability from mainline into this, though it will probably offset the deterministic latency that makes it special.

This is a massive rewrite and consequently there are bound to still be race conditions and hidden bugs though I have been running it for a while now with reasonable stability. I'm putting this out there for the braver people to test. There's a lot more to document about it but for now let's just say, give it a try.

Please don't use any lock debugging as it will light up every possible complaint for the time being!

Regarding 4.8, for the time being I will still be releasing BFS for it and incorporate it into -ck

EDIT: Updated to version 0.105 with significant bugfixes.

Enjoy!
お楽しみ下さい
-ck

35 comments:

  1. Unfortunately, the 101 fails for me already at kernel starting up, without leaving a note, and without reproducibility at which step it stops (within several trials).
    It's a dual core intel mobile cpu without hyperthreading capability, so this stuff and SMT_NICE are disabled in .config.

    Oh, and I got the following compile time warning:
    CC arch/x86/kernel/cpu/intel.o
    LD kernel/rcu/built-in.o
    CC arch/x86/kernel/cpu/mcheck/mce.o
    CC kernel/sched/bfs.o
    kernel/sched/bfs.c: In function ‘resched_task’:
    kernel/sched/bfs.c:1233:13: warning: unused variable ‘rq’ [-Wunused-variable]
    struct rq *rq = task_rq(p);
    ^
    CC arch/x86/kernel/cpu/mcheck/mce-severity.o
    CC arch/x86/kernel/cpu/mcheck/mce-genpool.o
    CC arch/x86/kernel/cpu/mcheck/mce_intel.o
    CC arch/x86/kernel/cpu/mcheck/threshold.o

    Con, it's very nice to see your actual innovative activity. Keep up your good work, all users would really appreciate it, and hopefully many are willing to help debugging.

    BR, Manuel Krause

    ReplyDelete
  2. Well I said it's very new code. I think 100 booted more reliably than 101... but it's more for demonstration of the code than actual usage at this stage.

    ReplyDelete
    Replies
    1. Of course, I do understood these facts. Just wanted to leave my results on here in the hope they'd help in any way.
      Well booting code is also highly appreciated ;-) Mmmh, you'd know already. :-)

      BR, Manuel Krause

      Delete
    2. I couldn't get linux 4.7.6+4.7-sched-MuqSS_101 to boot with my day to day bfs config either.
      So I built another kernel starting with the default Archlinux config and leave SMT_NICE and CGROUP_SCHED enabled, and it booted fine.
      Unfortunately, these aren't the only changes beetween the kernel configurations, but these are the only bfs related changes.
      You might want to try with SMT_NICE enable if 4.7-sched-MuqSS_102 doesn't work for you.

      Pedro

      Delete
  3. I've uploaded version 102 which has a a lot of bugfixes and makes it boot properly again for me.

    ReplyDelete
  4. From your outline on the top, I completely don't understand one sentence, as non native speaker. It's:
    "It would be possible, with a lot of effort, to wedge the entire balancing algorithm for scalability from mainline into this, though it will probably offset the deterministic latency that makes it special."
    Can you describe it in other words, please, if you find time?
    I really don't understand, what "offsetting the special deterministic latency" would lead to and who's the culprit: The wedging one, mainline, or the offsetting.
    Looked for the related vocabulary, but don't get a clue.

    Thank you, Manuel Krause

    ReplyDelete
    Replies
    1. If I understood that correctly, he means that having implemented the algorithm to do the balancing, will "remove" one of bfs's current feature ("deterministic latency") which makes bfs so special.

      Delete
    2. Really? Don't assume this for his new branding as only truth of this sentence.

      Delete
  5. I have packaged this up for Gentoo in my overlay.

    # layman -a hnaparst
    # USE=muqss emerge ck-sources

    It seems stable for me

    ReplyDelete
  6. Thank you for your work Con! Glad to see you are still hacking bfs with new ideas.

    MuQSS102 is running fine here for a couple of hours. I had sort of a lockup with MuQSS101, but no problem so far with 102. Suspend/resume works.

    I've run my little tests on muqss and bfs502.
    They don't show how MuQSS is scalable as the CPU is only 2 core + hyperthreading, but they give an overview of throughput.
    The performance is on par with bfs502, and there is still a little regression on 50% workload (make -j2).

    A word of caution: the tests are not the same as the one I ran on older bfs releases, so do not compare the results with the older.

    https://docs.google.com/spreadsheets/d/1ZfXUfcP2fBpQA6LLb-DP6xyDgPdFYZMwJdE0SQ6y3Xg/edit?usp=sharing

    Pedro

    ReplyDelete
    Replies
    1. Thanks very much for those Pedro. Indeed it should perform identically with BFS for the most part with lower numbers of CPUs. It will really start to make a difference when CPU numbers get very high (16+) and/or loads run are extreme (more than 4xCPUs). My aim was to make BFS work better everywhere.

      Delete
  7. Updated to version 0.103

    ReplyDelete
    Replies
    1. Due to the fast development I've omitted 102 and went over to 103, what is booting fine again (without changed .config -- Pedro, thank you for your workaround suggestion anyways).
      It's also working without issues/ regressions so far.

      Thank you very much for all your work, Con,
      BR, Manuel Krause

      Delete
    2. I've tested MuQSS103 with acpi-cpufreq+performance governor. Surprisingly, there is no difference in throughput between interactive and non-interactive mode.

      I wonder what test is more relevant at this stage:
      -test with acpi-cpufreq+performance, which lock the cpu to its max frequency
      -test with intel_pstate or acpi-cpufreq+ondemand to test "real usage conditions"
      Anyway, cpu frequency signaling seems to work fine given the results with MuQSS102.

      Pedro

      Delete
    3. I'm finding similar results and am still considering just what non-interactive mode should be at this stage. It's clear what interactive mode should do but not so much for without it to get more throughput.

      Delete
  8. Updated to version 0.104 with more throughput improvements. Will be releasing a new BFS for 4.8 shortly.

    ReplyDelete
  9. Thanks Con ! You're putting so much work in bfs these days.
    I've added the results for MuQSS104.
    I'll now try linux 4.8.

    On a side note, some may have noticed that several results were messed up when I changed the layout of my spreadsheet (mainly on bfs 497, 490, 467 sheets). I didn't notice this at first, and I don't know what happened as I was carefull with my copy/paste. Anyway, I had kept a backup on the old spreadsheet so I was able to put the correct results back. Apologies for the inconvenience.

    Pedro

    ReplyDelete
  10. Con, thanks very much for MuQSS104 !

    I couldn't stress-test it yet but there seems to be a regression compared to former BFS behavior:

    taskset seems broken ?

    Usually I update the "cached repo crawler (eix)"

    via

    time taskset 1 eix-update && time taskset 1 eix-diff

    (since it's single-threaded only, pinning it to a virtual CPU turbo-boost can be used optimally)

    and that command just hangs without any output message in dmesg.

    Incidentally I also added uksm with this new built kernel but I'm pretty sure it cannot be the reason for that hang


    Thanks

    kernelOfTruth

    ReplyDelete
    Replies
    1. Forgot to add that it cannot be interrupted/killed via

      CTRL + c

      Delete
    2. I can't reproduce that problem here with any other applications being started with taskset (though I don't have eix.)

      Note also I've added 2 patches into the pending directory for upcoming fixes:
      http://ck.kolivas.org/patches/muqss/4.0/4.8/Pending/
      Not enough has accumulated yet for a 105 but it won't be long.

      Delete
    3. I'm appending the following options (posting only a subset that seems relevant):

      intel_iommu=on slub_nomerge memory_corruption_check=1 threadirqs nomodeset cgroup_disable=memory


      threadirqs (forced threading) exhibited strange behavior before on not entirely stable kernels so it might facilitate triggering this

      Thanks

      Delete
    4. Just got the following messages in dmesg:

      [ 1433.540818] NOHZ: local_softirq_pending 02
      [ 1433.541105] NOHZ: local_softirq_pending 02
      [ 1581.686378] NOHZ: local_softirq_pending 02
      [ 1581.687366] NOHZ: local_softirq_pending 02
      [ 1581.688365] NOHZ: local_softirq_pending 02
      [ 1592.690111] NOHZ: local_softirq_pending 202
      [ 1592.690291] NOHZ: local_softirq_pending 202
      [ 1608.833740] NOHZ: local_softirq_pending 02
      [ 1620.212481] NOHZ: local_softirq_pending 02
      [ 1621.462443] NOHZ: local_softirq_pending 02

      (probably coinciding with the compilation of firefox right now)

      never saw those messages before

      Not sure if muqss104-deref_safe.patch addresses these

      Delete
    5. I've seen them many years ago with certain RCU options. They seem to have come back with 4.8, and I'm not sure it's anything related to my code (though given my code is so different and new it could be doing anything.)

      Delete
    6. After a bit more playing I was able to reproduce your hanging affinity based problem. Will look into it further, thanks.

      Delete
  11. @kernelOfTruth, FWIW, until Con sorts out your issue, eix-sync seems to be OK used with your taskset command, and I think it accomplishes the same thing.

    ReplyDelete
    Replies
    1. Thanks, I'll take a look into it :)

      Delete
  12. @kernelOfTruth:
    I'm using threadirqs, too, since you've advertised it many months ago. Is it still of benefit or passed over by actual development? (Maybe I suffer from not-booting sometimes.)

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. It's useful if you're dependent on lower latency audio playback or e.g. gaming


      for example:


      psnd=$(pgrep "irq/.*-snd_")
      [[ -n $psnd ]] && chrt -f -p 85 $psnd

      pnvkm=$(pgrep "irq/.*-i915")
      [[ -n $pi915 ]] && chrt -f -p 84 $pi915

      pnvidia=$(pgrep "irq/.*-nvidia")
      [[ -n $pnvidia ]] && chrt -f -p 84 $pnvidia

      Delete
    2. @kernelOfTruth:
      Thank you again for sharing your knowledge. Does the threadirqs by itself, without making use of it, impose any overhead of which you know?
      Unfortunately I seem to have lost the links regarding this topic from those times. Can you tell me, how I can verify that the above commands do what they should?

      Thank you in advance and
      BR, Manuel Krause

      Delete
    3. Manuel,

      here's the modified script I'm using right now:

      https://gist.github.com/kernelOfTruth/856f41450bdd8cbfcd5c55760e4b750e

      as far as I know all was started with "blebs" (ArchLinux forums), known under different name on Gentoo forums who posted the basis of the script

      Cheers

      Delete
    4. Running any of the commented out two ps commands from above in that script should get you what you want to know

      Delete
  13. Have you considered using a lock-free queue for the runqueue? It could eliminate some overhead. For example, there is a C++ implementation of a lock-free concurrent (multiple producer, multiple consumer) queue in Facebook's Folly library: https://github.com/facebook/folly/blob/master/folly/docs/Overview.md

    ReplyDelete
    Replies
    1. Thanks for the suggestion. I've used lockless components as much as possible already but the scheduler doesn't lend itself to the use of lockless storage for the most crucial data because they have to have a system-wide notion of the current 'truth' of that data at exactly the same time.

      Delete
    2. Interesting. I'm currently doing research on automated concurrency and think there may be some data structures that could lend themselves to this kind of scheduling. Perhaps we can talk more offline about it. I think in a few weeks when I get some free time I will take a serious look at the MuQSS code.

      Delete
  14. Hmmm... “mux”, eh? All I can think of, is “mucusss”, heh. :]

    Been using BFS exclusively here, for most of my journey with Linux, and it's quite exciting to see development such as this take place.

    Will need to give this a go very soon.

    Many thanks to you and your work!

    ReplyDelete