Announcing a multiple runqueue variant of BFS, with the more mundane name of MuQSS (pronounced mux) for linux 4.7:
Full patch for linux-4.7
4.7-sched-MuQSS_105.patch
Keep watching this blog for newer versions!
Incremental to patch bfs502 to MuQSS 0.1:
bfs502-MuQSS_103.patch
It was inevitable that one day I would find myself tackling the 2 major scalability limitations in BFS and this is the result of it. These two issues were
- The single runqueue which means all CPUs would fight for lock contention over the one runqueue, and
- The O(n) look up which means linear increase in overhead for task lookups as number of processes increases.
Till now I did not have the energy nor time to try and find a solution for number 1. that maintained BFS' scheduling decision algorithm as the single runqueue was actually the reason latency remains bound and deterministic on BFS, capitalising with more CPUs instead of fighting against them for scalability.
This scheduler variant is an evolution of BFS, which hopefully will be mature enough to replace BFS one day when stability is assured. It is able to still use the same scheduling algorithm as BFS meaning latency and responsiveness remains as good as always, but with the per-CPU runqueue and discrete locking, it also means it will scale to any number of CPUs, as the mainline scheduler does.
It does NOT guarantee the best possible throughput as there still is virtually no complex balancing mechanism whatsoever, selecting tasks according to deadline primarily with only CPU cache distances being used to determine which idle CPU to go to, or in non-interactive mode, which overloaded CPU to pull from to fill an idle CPU.
It would be possible, with a lot of effort, to wedge the entire balancing algorithm for scalability from mainline into this, though it will probably offset the deterministic latency that makes it special.
This is a massive rewrite and consequently there are bound to still be race conditions and hidden bugs though I have been running it for a while now with reasonable stability. I'm putting this out there for the braver people to test. There's a lot more to document about it but for now let's just say, give it a try.
Please don't use any lock debugging as it will light up every possible complaint for the time being!
Regarding 4.8, for the time being I will still be releasing BFS for it and incorporate it into -ck
EDIT: Updated to version 0.105 with significant bugfixes.
Enjoy!
お楽しみ下さい
-ck
Unfortunately, the 101 fails for me already at kernel starting up, without leaving a note, and without reproducibility at which step it stops (within several trials).
ReplyDeleteIt's a dual core intel mobile cpu without hyperthreading capability, so this stuff and SMT_NICE are disabled in .config.
Oh, and I got the following compile time warning:
CC arch/x86/kernel/cpu/intel.o
LD kernel/rcu/built-in.o
CC arch/x86/kernel/cpu/mcheck/mce.o
CC kernel/sched/bfs.o
kernel/sched/bfs.c: In function ‘resched_task’:
kernel/sched/bfs.c:1233:13: warning: unused variable ‘rq’ [-Wunused-variable]
struct rq *rq = task_rq(p);
^
CC arch/x86/kernel/cpu/mcheck/mce-severity.o
CC arch/x86/kernel/cpu/mcheck/mce-genpool.o
CC arch/x86/kernel/cpu/mcheck/mce_intel.o
CC arch/x86/kernel/cpu/mcheck/threshold.o
Con, it's very nice to see your actual innovative activity. Keep up your good work, all users would really appreciate it, and hopefully many are willing to help debugging.
BR, Manuel Krause
Well I said it's very new code. I think 100 booted more reliably than 101... but it's more for demonstration of the code than actual usage at this stage.
ReplyDeleteOf course, I do understood these facts. Just wanted to leave my results on here in the hope they'd help in any way.
DeleteWell booting code is also highly appreciated ;-) Mmmh, you'd know already. :-)
BR, Manuel Krause
I couldn't get linux 4.7.6+4.7-sched-MuqSS_101 to boot with my day to day bfs config either.
DeleteSo I built another kernel starting with the default Archlinux config and leave SMT_NICE and CGROUP_SCHED enabled, and it booted fine.
Unfortunately, these aren't the only changes beetween the kernel configurations, but these are the only bfs related changes.
You might want to try with SMT_NICE enable if 4.7-sched-MuqSS_102 doesn't work for you.
Pedro
I've uploaded version 102 which has a a lot of bugfixes and makes it boot properly again for me.
ReplyDeleteFrom your outline on the top, I completely don't understand one sentence, as non native speaker. It's:
ReplyDelete"It would be possible, with a lot of effort, to wedge the entire balancing algorithm for scalability from mainline into this, though it will probably offset the deterministic latency that makes it special."
Can you describe it in other words, please, if you find time?
I really don't understand, what "offsetting the special deterministic latency" would lead to and who's the culprit: The wedging one, mainline, or the offsetting.
Looked for the related vocabulary, but don't get a clue.
Thank you, Manuel Krause
If I understood that correctly, he means that having implemented the algorithm to do the balancing, will "remove" one of bfs's current feature ("deterministic latency") which makes bfs so special.
DeleteReally? Don't assume this for his new branding as only truth of this sentence.
DeleteI have packaged this up for Gentoo in my overlay.
ReplyDelete# layman -a hnaparst
# USE=muqss emerge ck-sources
It seems stable for me
Thank you for your work Con! Glad to see you are still hacking bfs with new ideas.
ReplyDeleteMuQSS102 is running fine here for a couple of hours. I had sort of a lockup with MuQSS101, but no problem so far with 102. Suspend/resume works.
I've run my little tests on muqss and bfs502.
They don't show how MuQSS is scalable as the CPU is only 2 core + hyperthreading, but they give an overview of throughput.
The performance is on par with bfs502, and there is still a little regression on 50% workload (make -j2).
A word of caution: the tests are not the same as the one I ran on older bfs releases, so do not compare the results with the older.
https://docs.google.com/spreadsheets/d/1ZfXUfcP2fBpQA6LLb-DP6xyDgPdFYZMwJdE0SQ6y3Xg/edit?usp=sharing
Pedro
Thanks very much for those Pedro. Indeed it should perform identically with BFS for the most part with lower numbers of CPUs. It will really start to make a difference when CPU numbers get very high (16+) and/or loads run are extreme (more than 4xCPUs). My aim was to make BFS work better everywhere.
DeleteUpdated to version 0.103
ReplyDeleteDue to the fast development I've omitted 102 and went over to 103, what is booting fine again (without changed .config -- Pedro, thank you for your workaround suggestion anyways).
DeleteIt's also working without issues/ regressions so far.
Thank you very much for all your work, Con,
BR, Manuel Krause
I've tested MuQSS103 with acpi-cpufreq+performance governor. Surprisingly, there is no difference in throughput between interactive and non-interactive mode.
DeleteI wonder what test is more relevant at this stage:
-test with acpi-cpufreq+performance, which lock the cpu to its max frequency
-test with intel_pstate or acpi-cpufreq+ondemand to test "real usage conditions"
Anyway, cpu frequency signaling seems to work fine given the results with MuQSS102.
Pedro
I'm finding similar results and am still considering just what non-interactive mode should be at this stage. It's clear what interactive mode should do but not so much for without it to get more throughput.
DeleteUpdated to version 0.104 with more throughput improvements. Will be releasing a new BFS for 4.8 shortly.
ReplyDeleteThanks Con ! You're putting so much work in bfs these days.
ReplyDeleteI've added the results for MuQSS104.
I'll now try linux 4.8.
On a side note, some may have noticed that several results were messed up when I changed the layout of my spreadsheet (mainly on bfs 497, 490, 467 sheets). I didn't notice this at first, and I don't know what happened as I was carefull with my copy/paste. Anyway, I had kept a backup on the old spreadsheet so I was able to put the correct results back. Apologies for the inconvenience.
Pedro
Con, thanks very much for MuQSS104 !
ReplyDeleteI couldn't stress-test it yet but there seems to be a regression compared to former BFS behavior:
taskset seems broken ?
Usually I update the "cached repo crawler (eix)"
via
time taskset 1 eix-update && time taskset 1 eix-diff
(since it's single-threaded only, pinning it to a virtual CPU turbo-boost can be used optimally)
and that command just hangs without any output message in dmesg.
Incidentally I also added uksm with this new built kernel but I'm pretty sure it cannot be the reason for that hang
Thanks
kernelOfTruth
Forgot to add that it cannot be interrupted/killed via
DeleteCTRL + c
I can't reproduce that problem here with any other applications being started with taskset (though I don't have eix.)
DeleteNote also I've added 2 patches into the pending directory for upcoming fixes:
http://ck.kolivas.org/patches/muqss/4.0/4.8/Pending/
Not enough has accumulated yet for a 105 but it won't be long.
I'm appending the following options (posting only a subset that seems relevant):
Deleteintel_iommu=on slub_nomerge memory_corruption_check=1 threadirqs nomodeset cgroup_disable=memory
threadirqs (forced threading) exhibited strange behavior before on not entirely stable kernels so it might facilitate triggering this
Thanks
Just got the following messages in dmesg:
Delete[ 1433.540818] NOHZ: local_softirq_pending 02
[ 1433.541105] NOHZ: local_softirq_pending 02
[ 1581.686378] NOHZ: local_softirq_pending 02
[ 1581.687366] NOHZ: local_softirq_pending 02
[ 1581.688365] NOHZ: local_softirq_pending 02
[ 1592.690111] NOHZ: local_softirq_pending 202
[ 1592.690291] NOHZ: local_softirq_pending 202
[ 1608.833740] NOHZ: local_softirq_pending 02
[ 1620.212481] NOHZ: local_softirq_pending 02
[ 1621.462443] NOHZ: local_softirq_pending 02
(probably coinciding with the compilation of firefox right now)
never saw those messages before
Not sure if muqss104-deref_safe.patch addresses these
I've seen them many years ago with certain RCU options. They seem to have come back with 4.8, and I'm not sure it's anything related to my code (though given my code is so different and new it could be doing anything.)
DeleteAfter a bit more playing I was able to reproduce your hanging affinity based problem. Will look into it further, thanks.
Delete@kernelOfTruth, FWIW, until Con sorts out your issue, eix-sync seems to be OK used with your taskset command, and I think it accomplishes the same thing.
ReplyDeleteThanks, I'll take a look into it :)
Delete@kernelOfTruth:
ReplyDeleteI'm using threadirqs, too, since you've advertised it many months ago. Is it still of benefit or passed over by actual development? (Maybe I suffer from not-booting sometimes.)
BR, Manuel Krause
It's useful if you're dependent on lower latency audio playback or e.g. gaming
Deletefor example:
psnd=$(pgrep "irq/.*-snd_")
[[ -n $psnd ]] && chrt -f -p 85 $psnd
pnvkm=$(pgrep "irq/.*-i915")
[[ -n $pi915 ]] && chrt -f -p 84 $pi915
pnvidia=$(pgrep "irq/.*-nvidia")
[[ -n $pnvidia ]] && chrt -f -p 84 $pnvidia
@kernelOfTruth:
DeleteThank you again for sharing your knowledge. Does the threadirqs by itself, without making use of it, impose any overhead of which you know?
Unfortunately I seem to have lost the links regarding this topic from those times. Can you tell me, how I can verify that the above commands do what they should?
Thank you in advance and
BR, Manuel Krause
Manuel,
Deletehere's the modified script I'm using right now:
https://gist.github.com/kernelOfTruth/856f41450bdd8cbfcd5c55760e4b750e
as far as I know all was started with "blebs" (ArchLinux forums), known under different name on Gentoo forums who posted the basis of the script
Cheers
Running any of the commented out two ps commands from above in that script should get you what you want to know
DeleteHave you considered using a lock-free queue for the runqueue? It could eliminate some overhead. For example, there is a C++ implementation of a lock-free concurrent (multiple producer, multiple consumer) queue in Facebook's Folly library: https://github.com/facebook/folly/blob/master/folly/docs/Overview.md
ReplyDeleteThanks for the suggestion. I've used lockless components as much as possible already but the scheduler doesn't lend itself to the use of lockless storage for the most crucial data because they have to have a system-wide notion of the current 'truth' of that data at exactly the same time.
DeleteInteresting. I'm currently doing research on automated concurrency and think there may be some data structures that could lend themselves to this kind of scheduling. Perhaps we can talk more offline about it. I think in a few weeks when I get some free time I will take a serious look at the MuQSS code.
DeleteHmmm... “mux”, eh? All I can think of, is “mucusss”, heh. :]
ReplyDeleteBeen using BFS exclusively here, for most of my journey with Linux, and it's quite exciting to see development such as this take place.
Will need to give this a go very soon.
Many thanks to you and your work!