Tuesday, 18 October 2016

First MuQSS Throughput Benchmarks

The short version graphical summary:



Red = MuQSS 112 interactive off
Purple = MuQSS 112 interactive on
Blue = CFS

The detail:
http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/

I went on a journey looking for meaningful benchmarks to conduct to assess the scalability aspect as far as I could on my own 12x machine and was really quite depressed to see what the benchmark situation on linux is like. Only the old and completely invalid benchmarks seem to still be hanging around in public sites and promoted, like Reaim, aim7, dbench, volanomark, etc. and none of those are useful scalability benchmarks. Even more depressing was the only ones with any reputation are actually commercial benchmarks costing hundreds of dollars.

This made me wonder out loud just how the heck mainline is even doing scalability improvements if there are precious few valid benchmarks for linux and no one's using them. The most promising ones, like mosbench, need multiple machines and quite a bit of set up to get them going.

I spent a day wading through the phoronix test suite - a site and its suite not normally known for meaningful high performance computing discussion and benchmarks - looking for benchmarks that could be used for meaningful results for multicore scalability assessment and were not too difficult to deploy and came up with the following collection:

John The Ripper - a CPU bound application that is threaded to the number of CPUs and intermittently drops to one thread making for slightly more interesting behaviour than just a fully CPU bound workload.

7-Zip Compression - a valid real world CPU bound application that is threaded but rarely able to spread out to all CPUs making it an interesting light load benchmark.

ebizzy - This emulates a heavy content delivery server load which scales beyond the number of CPUs and emulates what goes on between a http server and database.

Timed Linux Kernel Compilation - A perennial favourite because it is a real world case and very easy to reproduce. Despite numerous complaints about its validity as a benchmark, it is surprisingly consistent in its results and tests many facets of scalability, though does not scale to use all CPUs at all time either.

C-Ray - A ray tracing benchmark that uses massive threading per CPU and is completely CPU bound but overloads all CPUs.

Primesieve - A prime number generator that is threaded to the number of CPUs exactly, is fully CPU bound and is cache intensive.

PostgreSQL pgbench - A meaningful database benchmark that is done at 3 different levels - single threaded, normal loaded and heavily contended, each testing different aspects of scalability.

And here is a set of results comparing 4.8.2 mainline (labelled CFS), MuQSS 112 in interactive mode (MuQSS-int1) and MuQSS 112 in non-interactive mode (MuQSS-int0):

http://ck.kolivas.org/patches/muqss/Benchmarks/20161018/

It's worth noting that there is quite a bit of variance in these benchmarks and some are bordering on the difference being just noise. However there is a clear pattern here - when the load is light, in terms of throughput, CFS outperforms MuQSS. When load is heavy, the heavier it gets, MuQSS outperforms CFS, especially in non-interactive mode. As a friend noted, for the workloads where you wouldn't be running MuQSS in interactive mode, such as a web server, database etc, non-interactive mode is of clear performance benefit. So at least on the hardware I had available to me, on a 12x machine, MuQSS is scaling better than mainline on these workloads as load increases.

The obvious question people will ask is why MuQSS doesn't perform better at light loads, and in fact I have an explanation. The reason is that mainline tends to cling to processes much more so that if it is hovering at low numbers of active processes, they'll all cluster on one CPU or fewer CPUs than being spread out everywhere. This means the CPU benefits more from the turbo modes virtually all newer CPUs have, but it comes at a cost. The latency to tasks is greater because they're competing for CPU time on fewer busy CPUs rather than spreading out to idle cores or threads. It is a design decision in MuQSS, as taken from BFS, to always spread out to any idle CPUs if they're available, to minimise latency, and that's one of the reasons for the interactivity and responsiveness of MuQSS. Of course I am still investigating ways of closing that gap further.

Hopefully I can get some more benchmarks from someone with even bigger hardware, and preferably with more than one physical package since that's when things really start getting interesting. All in all I'm very pleased with the performance of MuQSS in terms of scalability on these results, especially assuming I'm able to maintain the interactivity of BFS which were my dual goals.

There is MUCH more to benchmarking than pure throughput of CPU - which is almost the only thing these benchmarks is checking - but that's what I'm interested in here. I hope that providing my list of easy to use benchmarks and the reasoning behind them can generate interest in some kind of meaningful standard set of benchmarks. I did start out in kernel development originally after writing and being a benchmarker :P

To aid that, I'll give simple instructions here for how to ~imitate the benchmarks and get results like I've produced above.

Download the phoronix test suite from here:
http://www.phoronix-test-suite.com/

The generic tar.gz is perfectly fine. Then extract it and install the relevant benchmarks like so:

tar xf phoronix-test-suite-6.6.1.tar.gz
cd phoronix-test-suite
./phoronix-test-suite install build-linux-kernel c-ray compress-7zip ebizzy john-the-ripper pgbench primesieve
./phoronix-test-suite default-run build-linux-kernel c-ray compress-7zip ebizzy john-the-ripper pgbench primesieve


Now obviously this is not ideal since you shouldn't run benchmarks on a multiuser login with Xorg and all sorts of other crap running so I actually always run benchmarks at init level 1.

Enjoy!
お楽しみ下さい
-ck

16 comments:

  1. I was busy fighting with the svg tags (I suck at html). Should work now though http, not https.

    ReplyDelete
  2. @ck:
    Thank you for sharing your knowledge about most useful actual benchmarks. I've also tried the PTS some years ago, but didn't know what special tests lead to useful comparisons.
    Can you tell me, how much disk space your three-kernel test setup was using (I'm a little limited regarding this atm.)?

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. That selection of benchmarks added only about 600MB.

      Delete
    2. @ck:
      That low amount, also with all the kernel compilation results? If yes, take this question as not asked. This is available disk space on my ext4 partitions, I've had nightmare-like thoughts of several GB.
      Did you use special settings for the output, if any, me mainly focussing on meaningful graphs output?
      I'm really eager to test the 4.7.8 MuQSS vs. Alfred's latest VRQ for 4.7 with that same setup.

      Another question: During your journey through the actual linux benchmarking world, did you find a "benchmark" that is able/ claims to be able to "measure" responsiveness/ interactivity? Of course, you know why I use these many '"' in this question.

      Thanks in advance and BR,
      Manuel Krause

      Delete
    3. The only meaningful interactivity benchmark is my interbench, and I'm about the only person in the world who understands its results. It is also unmaintained and does not properly work on newer SMP systems so don't bother.

      Delete
    4. In fact, maybe it's time I gave interbench some love and attention to make it work properly today.

      Delete
    5. @ck:
      Nice, to read you thinking... :-)
      'interbench', long time nothing read about.
      As it's supposed to be data based, how can you make an algorithm capture interactivity values?
      BR, Manuel Krause

      Delete
    6. Interbench comparison between plain / bfq and ck kernels back in 2015.
      http://hastebin.com/yitogoweso.sql

      >I'm about the only person in the world who understands its results..

      /me too!

      Delete
    7. Great! I've finally updated interbench and have uploaded a git tree. There are still some outliers in muqss that I'm trying to fix so while the deadlines met beats mainline, the worst latencies have some bug remaining that I'm looking for.

      https://github.com/ckolivas/interbench

      Delete
  3. Thanks Con for these benchmark.
    At last some numbers on a high end desktop !

    About the way MuQSS and BFS spread tasks to idle cpu, can we assume these scheduler are less energy efficient than CFS under low workload ? Because that spread might interfere with optimal frequency scaling and power gating.

    Pedro

    ReplyDelete
  4. @Pedro,

    actually I keep an eye on power consumption because I use mux on my laptop as well as desktop and I can say that it's very good. If I remember correctly since last time I tested, it might be better than CFS. My observations are purely based on laptops battery time, I really notice that half an hour difference :)
    If I compare standard Ubuntu kernel which has 250Hz, Voluntary Preempt vs my compilation of mux/bfs with 500Hz and Full Preempt, it's about the same if not better + interactivity is better when using the desktop, which I can't really measure :)

    br, Eduardo

    ReplyDelete
    Replies
    1. @Eduardo

      Thanks for the info. Such a big difference is surprising.
      From a quick internet search, AMD Phenom II doesn't have power gating on unused cores. So maybe MuQSS's design to spread load to idle cpus has a low/no impact on power on your hardware.
      I'll try to experiment with powertop and turbostat to see how it goes on my ivybridge, but I fear these tools are not precise enough.

      Pedro

      Delete
    2. @Pedro,

      I keep an eye on power consumption on laptop only, coz I work on battery quite often. Btw I have Ivy on laptop as well ;)
      I have to try recent mainline kernel to check what is the actual difference _now_ with the same exact kernel version, that will be fair. Btw I use ondemand almost all the time, pstate gives me unpredictable/lower battery life every time I try it in hope it's better, which is for every new major kernel version. No luck so far. I haven't tried 4.8 really, I wait for BFQ patches.
      For Phenom, I just pay the electricity bill, no need to measure anything:)

      Br, Eduardo

      Delete
  5. Con, I'm curious on your thoughts of why MuQSS with interactivity off didn't perform closer to CFS with 7-Zip Compression, when it compared well in the other areas? Caching difference?

    ReplyDelete