-ck hacking: First MuQSS Throughput Benchmarks

Tuesday, 18 October 2016

First MuQSS Throughput Benchmarks

The short version graphical summary:

Red = MuQSS 112 interactive off
Purple = MuQSS 112 interactive on
Blue = CFS

The detail:
http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/

I went on a journey looking for meaningful benchmarks to conduct to assess the scalability aspect as far as I could on my own 12x machine and was really quite depressed to see what the benchmark situation on linux is like. Only the old and completely invalid benchmarks seem to still be hanging around in public sites and promoted, like Reaim, aim7, dbench, volanomark, etc. and none of those are useful scalability benchmarks. Even more depressing was the only ones with any reputation are actually commercial benchmarks costing hundreds of dollars.

This made me wonder out loud just how the heck mainline is even doing scalability improvements if there are precious few valid benchmarks for linux and no one's using them. The most promising ones, like mosbench, need multiple machines and quite a bit of set up to get them going.

I spent a day wading through the phoronix test suite - a site and its suite not normally known for meaningful high performance computing discussion and benchmarks - looking for benchmarks that could be used for meaningful results for multicore scalability assessment and were not too difficult to deploy and came up with the following collection:

John The Ripper - a CPU bound application that is threaded to the number of CPUs and intermittently drops to one thread making for slightly more interesting behaviour than just a fully CPU bound workload.

7-Zip Compression - a valid real world CPU bound application that is threaded but rarely able to spread out to all CPUs making it an interesting light load benchmark.

ebizzy - This emulates a heavy content delivery server load which scales beyond the number of CPUs and emulates what goes on between a http server and database.

Timed Linux Kernel Compilation - A perennial favourite because it is a real world case and very easy to reproduce. Despite numerous complaints about its validity as a benchmark, it is surprisingly consistent in its results and tests many facets of scalability, though does not scale to use all CPUs at all time either.

C-Ray - A ray tracing benchmark that uses massive threading per CPU and is completely CPU bound but overloads all CPUs.

Primesieve - A prime number generator that is threaded to the number of CPUs exactly, is fully CPU bound and is cache intensive.

PostgreSQL pgbench - A meaningful database benchmark that is done at 3 different levels - single threaded, normal loaded and heavily contended, each testing different aspects of scalability.

And here is a set of results comparing 4.8.2 mainline (labelled CFS), MuQSS 112 in interactive mode (MuQSS-int1) and MuQSS 112 in non-interactive mode (MuQSS-int0):

http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/

It's worth noting that there is quite a bit of variance in these benchmarks and some are bordering on the difference being just noise. However there is a clear pattern here - when the load is light, in terms of throughput, CFS outperforms MuQSS. When load is heavy, the heavier it gets, MuQSS outperforms CFS, especially in non-interactive mode. As a friend noted, for the workloads where you wouldn't be running MuQSS in interactive mode, such as a web server, database etc, non-interactive mode is of clear performance benefit. So at least on the hardware I had available to me, on a 12x machine, MuQSS is scaling better than mainline on these workloads as load increases.

The obvious question people will ask is why MuQSS doesn't perform better at light loads, and in fact I have an explanation. The reason is that mainline tends to cling to processes much more so that if it is hovering at low numbers of active processes, they'll all cluster on one CPU or fewer CPUs than being spread out everywhere. This means the CPU benefits more from the turbo modes virtually all newer CPUs have, but it comes at a cost. The latency to tasks is greater because they're competing for CPU time on fewer busy CPUs rather than spreading out to idle cores or threads. It is a design decision in MuQSS, as taken from BFS, to always spread out to any idle CPUs if they're available, to minimise latency, and that's one of the reasons for the interactivity and responsiveness of MuQSS. Of course I am still investigating ways of closing that gap further.

Hopefully I can get some more benchmarks from someone with even bigger hardware, and preferably with more than one physical package since that's when things really start getting interesting. All in all I'm very pleased with the performance of MuQSS in terms of scalability on these results, especially assuming I'm able to maintain the interactivity of BFS which were my dual goals.

There is MUCH more to benchmarking than pure throughput of CPU - which is almost the only thing these benchmarks is checking - but that's what I'm interested in here. I hope that providing my list of easy to use benchmarks and the reasoning behind them can generate interest in some kind of meaningful standard set of benchmarks. I did start out in kernel development originally after writing and being a benchmarker :P

To aid that, I'll give simple instructions here for how to ~imitate the benchmarks and get results like I've produced above.

Download the phoronix test suite from here:
http://www.phoronix-test-suite.com/

The generic tar.gz is perfectly fine. Then extract it and install the relevant benchmarks like so:


tar xf phoronix-test-suite-6.6.1.tar.gz

cd phoronix-test-suite

./phoronix-test-suite install build-linux-kernel c-ray compress-7zip ebizzy john-the-ripper pgbench primesieve

./phoronix-test-suite default-run build-linux-kernel c-ray compress-7zip ebizzy john-the-ripper pgbench primesieve

Now obviously this is not ideal since you shouldn't run benchmarks on a multiuser login with Xorg and all sorts of other crap running so I actually always run benchmarks at init level 1.

Enjoy!
お楽しみ下さい
-ck

16 comments:

Unknown18 October 2016 at 22:27
Image is 404ing.
ReplyDelete
Replies
ck18 October 2016 at 22:28
I was busy fighting with the svg tags (I suck at html). Should work now though http, not https.
ReplyDelete
Replies
Unknown19 October 2016 at 00:49
Could you compare with bfs as well?
ReplyDelete
Replies
Anonymous19 October 2016 at 06:39
@ck:
Thank you for sharing your knowledge about most useful actual benchmarks. I've also tried the PTS some years ago, but didn't know what special tests lead to useful comparisons.
Can you tell me, how much disk space your three-kernel test setup was using (I'm a little limited regarding this atm.)?

BR, Manuel Krause
ReplyDelete
Replies
Anonymous19 October 2016 at 09:40
Thanks Con for these benchmark.
At last some numbers on a high end desktop !

About the way MuQSS and BFS spread tasks to idle cpu, can we assume these scheduler are less energy efficient than CFS under low workload ? Because that spread might interfere with optimal frequency scaling and power gating.

Pedro
ReplyDelete
Replies
Anonymous19 October 2016 at 17:41
@Pedro,

actually I keep an eye on power consumption because I use mux on my laptop as well as desktop and I can say that it's very good. If I remember correctly since last time I tested, it might be better than CFS. My observations are purely based on laptops battery time, I really notice that half an hour difference :)
If I compare standard Ubuntu kernel which has 250Hz, Voluntary Preempt vs my compilation of mux/bfs with 500Hz and Full Preempt, it's about the same if not better + interactivity is better when using the desktop, which I can't really measure :)

br, Eduardo
ReplyDelete
Replies
jwh720 October 2016 at 02:39
Con, I'm curious on your thoughts of why MuQSS with interactivity off didn't perform closer to CFS with 7-Zip Compression, when it compared well in the other areas? Caching difference?
ReplyDelete
Replies