The results for mainline 4.8.4 and 4.8.4-ck4 on a multithreaded hexcore (init 1) can be found here:
http://ck.kolivas.org/patches/muqss/Benchmarks/20161024/
and are copied below. I do not have swap on this machine so the "memload" was not performed. This is a 3.6GHz hexcore with 64GB ram and fast Intel SSDs so to show any difference on this is nice. To make it easier, I've highlighted it in colours similar to the throughput benchmarks I posted previously. Blue means within 1% of each other, red means significantly worse and green significantly better.
Load set to 12 processors Using 4008580 loops per ms, running every load for 30 seconds Benchmarking kernel 4.8.4 at datestamp 201610242116 Comment: cfs --- Benchmarking simulated cpu of Audio in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met None 0.1 +/- 0.1 0.1 100 100 Video 0.0 +/- 0.0 0.1 100 100 X 0.1 +/- 0.1 0.1 100 100 Burn 0.0 +/- 0.0 0.0 100 100 Write 0.1 +/- 0.1 0.1 100 100 Read 0.1 +/- 0.1 0.1 100 100 Ring 0.0 +/- 0.0 0.1 100 100 Compile 0.0 +/- 0.0 0.0 100 100 --- Benchmarking simulated cpu of Video in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met None 0.1 +/- 0.1 0.1 100 100 X 0.1 +/- 0.1 0.1 100 100 Burn 17.4 +/- 19.5 46.3 87 7.62 Write 0.1 +/- 0.1 0.1 100 100 Read 0.1 +/- 0.1 0.1 100 100 Ring 0.0 +/- 0.0 0.0 100 100 Compile 17.4 +/- 19.1 45.9 89.5 6.07 --- Benchmarking simulated cpu of X in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met None 0.0 +/- 0.1 1.0 100 99.3 Video 13.4 +/- 25.8 68.0 36.2 27.3 Burn 94.4 +/- 127.0 334.0 12.9 4.37 Write 0.1 +/- 0.4 4.0 97.4 96.4 Read 0.1 +/- 0.7 4.0 96.2 93.8 Ring 0.5 +/- 1.9 9.0 89.3 84.9 Compile 93.3 +/- 127.7 333.0 12.2 4.2 --- Benchmarking simulated cpu of Gaming in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU None 0.0 +/- 0.2 2.2 100 Video 7.9 +/- 21.4 69.3 92.7 X 1.4 +/- 1.6 2.7 98.7 Burn 136.5 +/- 145.3 360.8 42.3 Write 1.8 +/- 2.0 4.4 98.2 Read 11.2 +/- 20.3 47.8 89.9 Ring 8.1 +/- 8.1 8.2 92.5 Compile 152.3 +/- 166.8 346.1 39.6
Load set to 12 processors Using 4008580 loops per ms, running every load for 30 seconds Benchmarking kernel 4.8.4-ck4+ at datestamp 201610242047 Comment: muqss116-int1 --- Benchmarking simulated cpu of Audio in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met None 0.0 +/- 0.0 0.0 100 100 Video 0.0 +/- 0.0 0.0 100 100 X 0.0 +/- 0.0 0.0 100 100 Burn 0.0 +/- 0.0 0.0 100 100 Write 0.0 +/- 0.0 0.1 100 100 Read 0.0 +/- 0.0 0.0 100 100 Ring 0.0 +/- 0.0 0.0 100 100 Compile 0.0 +/- 0.1 0.8 100 100 --- Benchmarking simulated cpu of Video in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met None 0.0 +/- 0.0 0.0 100 100 X 0.0 +/- 0.0 0.0 100 100 Burn 3.1 +/- 7.2 17.7 100 81.6 Write 0.0 +/- 0.0 0.5 100 100 Read 0.0 +/- 0.0 0.0 100 100 Ring 0.0 +/- 0.0 0.0 100 100 Compile 10.5 +/- 13.3 19.7 100 37.3 --- Benchmarking simulated cpu of X in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met None 0.0 +/- 0.1 1.0 100 99.3 Video 3.7 +/- 12.1 56.0 89 82.6 Burn 47.2 +/- 66.5 142.0 16.7 7.58 Write 0.1 +/- 0.5 5.0 97.7 95.7 Read 0.1 +/- 0.7 4.0 95.6 93.5 Ring 0.5 +/- 1.9 12.0 89.8 86 Compile 55.9 +/- 77.6 196.0 18.6 8.12 --- Benchmarking simulated cpu of Gaming in the presence of simulated --- Load Latency +/- SD (ms) Max Latency % Desired CPU None 0.0 +/- 0.1 0.5 100 Video 1.2 +/- 1.2 1.8 98.8 X 1.4 +/- 1.6 2.9 98.7 Burn 130.9 +/- 132.1 160.3 43.3 Write 2.4 +/- 2.5 7.0 97.7 Read 3.2 +/- 3.2 3.6 96.9 Ring 5.9 +/- 6.2 10.3 94.4 Compile 146.5 +/- 149.3 209.2 40.6
As you can see, the only times mainline is better, there is less than 1% difference between them which is within the margins for noise. MuQSS meets more deadlines, gives the benchmarked task more of its desired CPU and has substantially lower max latencies.
I'm reasonably confident that I've been able to maintain the interactivity people have come to expect from BFS in the transition to MuQSS now and have the data to support it above.
Enjoy!
お楽しみ下さい
-ck
Thanks for these benchmark results, very interesting. One thing I've been curious about is the inherent context switch overhead, and it seems BFS still is slightly better.
ReplyDeleteFor example with v116 and switching isolated to core #4:
$taskset -c 4 perf bench sched pipe -T
# Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two threads
Total time: 1.773 [sec]
1.773453 usecs/op
563871 ops/sec
With BFS-v512+ I (repeatably) see ~1.55µs/op and correspondingly a higher absolute number of context switches/sec. Any idea whether this is an inherent consequence of multiple runqueues, skiplist overhead or something else?
It's the extra overhead of the multiple runqueue examination that happens on every switch. Should be slightly less with interactive disabled but will still be more than bfs which only needs to examine one runqueue.
DeleteThanks, that's what I suspected (I understand the design tradeoff). Out of curiosity - have you considered a single per-physical core runqueue shared between that core's vCPUs? Not sure if (or how much) that would just make things more complicated for scheduling, but it could be a nice compromise between global single rq and one for every vCPU - sort of per-core BFS with "local SMT", if that makes sense.
DeleteAs far as my admittedly limited (not an EE) understanding of SMT implementation iternals goes, this would probably be friendlier for them as well.
Just a crazy idea.
It is not crazy at all and many of us had discussed it as a valid option for years. I had considered it numerous times before but there are 2 things against it. One is it ends up being more complex than one rq per cpu and the second is that the number of cores and threads per cpu keeps rising and is now beyond the cut off where bfs ran into contention problems (16+).
Delete@ck:
DeleteDon't call me dumb, but, what number of runqueues do I have running with current MuQSS vs. former BFS? Determined by what means?
I must have not understood your step to "multiple queues", if that had not meant attaching rqs to cpu cores/ real entities.
BR, Manuel Krause
BFS had one queue. MuQSS has one queue per CPU.
DeleteSo, if I have 2 cores on 1 CPU die, and these 2 co-exist without any SMT/HT capability, I do have still only one rq? More or less just like with former BFQ?
DeletePlease, don't understand this added question as complaint, but just as user-want's-to-understand one.
Thx & BR, Manuel Krause
That's 2 runqueues. It's one for each logical CPU.
DeleteI don't know where to post this so I'm putting it here...
ReplyDeleteDescription:
linux-ck(P4 & CK) hard freeze when using google talk plugin on firefox.
Additional info:
* package version(s)4.8.4-1-ck(i686)
* config and/or log files etc.
Manjaro Xfce
Steps to reproduce:
Attempt to make a phone call in firefox gmail (hangouts) ....window opens & FREEZE!
Total system freeze. Only force poer off works.
Works correctly on other kernels 4.8 & 4.9
UPDATE: 4.8.4-4 P4 is working now.....Good work!
Delete