Tuesday 11 October 2016

MuQSS - The Multiple Queue Skiplist Scheduler v0.111

Lots of bugfixes, lots of improvements, build fixes, you name it.

For 4.8:
4.8-sched-MuQSS_111.patch

For 4.7:
4.7-sched-MuQSS_111.patch

And in a complete departure from BFS, a git tree (which suits constant development like this, unlike BFS's stable release massive ports):

https://github.com/ckolivas/linux

Look in the pending/ directory to see all the patches that went into this or read the git changelog. In particular numerous warnings were fixed, throughput improved compared to 108, SCHED_ISO was rewritten for multiple queues, potential races/crashes were addressed, and build fixes for different configurations were committed.

I haven't been able to track the bizarre latency issues reported by runqlat and when I try to reproduce it myself I get nonsense values of latency greater than the history of the earth so I suspect an interface bug with BPF reporting values. It doesn't seem to affect actual latency in any way.

EDIT: Updated to version 0.111 which has a fix for suspend/resume.

Enjoy!
お楽しみ下さい
-ck

48 comments:

  1. Thanks Con.
    I've updated the results with muqss110, both throughput and runqlat.
    https://docs.google.com/spreadsheets/d/1ZfXUfcP2fBpQA6LLb-DP6xyDgPdFYZMwJdE0SQ6y3Xg/edit?usp=sharing
    I've put the old muqss108 results in the sheet 'dev 4.8 muqss' together with all old muqss releases.

    For the runqlat tests, did you apply https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=58bfea9532552d422bde7afa207e1a0f08dffa7d
    Because runqlat is broken since 4.8-rc4 (https://github.com/iovisor/bcc/issues/728).

    Pedro

    ReplyDelete
    Replies
    1. Thanks as always. No I hadn't applied the patch to fix runqlat. The numbers certainly look much better behaved under 110 than 108 but that simply seems to be more accurate use of the sched_info code in the scheduler rather than performance actually improving, and there are still some very weird outliers with high latencies that just shouldn't happen. I suspect there's still some reporting bug rather than actual bad latencies so I'll keep investigating.

      Delete
    2. Updated the results with interactive=0.
      I also added the results of sysbench oltp benchmark with several number of threads.
      MuQSS is ~10% better than cfs here when threads > 1!

      Out of curiosity I also tested sysbench threads with the following command:
      sysbench --test=threads --num-threads=$th --thread-yields=10000 --thread-locks=2 run
      The results are not good at all for MuQSS, however these 'bad' results don't translate in bad 'real world' performance as the other tests shows.
      If they can help you improve MuQSS, I'll put them in a separate sheet. I know you do your own tests anyway.

      Pedro

      Delete
    3. Thanks Pedro. By the way, enabling SMT nice actually lowers throughput slightly if you have a hyperthreaded CPU. It's designed to improve behaviour on a desktop but costs a few CPU cycles to have in there and unless you're using nice levels it will lower throughput. Additionally that second benchmark tests the use of sched_yield and as far as I'm concerned any results using that are meaningless so there's no point trying to optimise it.

      Delete
    4. Yeah, I tested the impact of enabling SMT_NICE with bfs502. It's quite low.
      And thanks for the info on sysbench threads. Real world benchmark matters most.

      Pedro

      Delete
    5. Added the results of sysbench oltp for bfs512 and muqss110+interactive=0.

      And suspend/resume is indeed fixed on my pc with muqss111.

      Pedro

      Delete
    6. Thanks Pedro. Those interactive=0 numbers for sysbench look very good indeed as the threads increase. Would be interesting to see what happens on much bigger hardware.

      Delete
    7. After extensive testing I'm quite sure the results being returned by runqlat are bogus because 1. it is looking to kprobe functions that don't even exist on MuQSS and 2. if you add -P to make it separate out latencies by PID, none of them have the super high latencies that the full histogram plot returns.

      Delete
    8. Thanks Con for having looked at this issue. Too bad this tool is broken with MuQSS because it is quite nice.
      Is it also broken with BFS, as I guess it shares code with MuQSS ? If it is so, I'll remove the results from the spreadsheet.

      I also looked at cyclictest to 'quantify' latency with another program than runqlat, but I'm not sure it's the right tool, and it's broken on MuQSS anyway (the thread counter constantly overflows).

      Pedro

      Delete
    9. I'm still looking into it to see if there's some way for the muqss code to be seen correctly by the tool to make the values more meaningful.

      Delete
  2. @ck:
    Something makes my TOI port fail completely already at writing kernel data during 1st hibernation attempt with v0.110. I've bisected it down to muqss108-008-delay_cpu_switch.patch, introducing this.
    Maybe you find time to have a look at the introduced code.
    It's still with kernel 4.7.7.
    In the meantime, I'll now finally rebuild some BFQ/WBT/TOI free kernels, stepwise, to see, if they spit out messages/ traces or such developers' delights. ;-)
    I really don't want to bother you with this, but we need a fix for this anyways.

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. No, you're not the only one affected. The same bug will affect regular suspend/resume and needs to be fixed, and you've saved me the trouble of bisecting so I can work on it now, thanks.

      Delete
    2. Thanks, for not calling my reports/ testings useless.
      Will wait with recompiling MuQSS until your next pending patch is available.

      Btw., even if you may not be too lucky me to name it: My recent 4.7.7 testing with Alfred's last SL VRQ full version and my other patches combo didn't lead to any failure, that I've reported for MuQSS during last weeks.

      Keep up your good mind and good work,
      BR, Manuel Krause

      Delete
    3. Thanks, for not calling my reports/ testings useless.
      Will wait with recompiling MuQSS until your next pending patch is available.

      Btw., even if you may not be too lucky me to name it: My recent 4.7.7 testing with Alfred's last SL VRQ full version and my other patches combo didn't lead to any failure, that I've reported for MuQSS during last weeks.

      Keep up your good mind and good work,
      BR, Manuel Krause

      Delete
    4. @Con,

      I encountered S3 resume bug as well, 4.7.7+mux110 can't resume.

      br, Eduardo

      Delete
  3. I also noticed a fail to resume (from S3) on my x64 PC with my latest update (through patch 11 I think), while the previous (through patch 3) was fine. So good to see it's getting sorted out; thanks guys!

    ReplyDelete
  4. @Con,

    I added 4.8.1 mux results to my Unigine benchmark gsheets. Performance are good.

    There is one more thing I wanted to write about.
    See, I use VRQ kernel (which is Alfreds improvements over BFS) on my host machine and I compile kernels in Ubuntu (6 vCPU) VBox VM, which till yesterday had an Ubuntu kernel. After compilation of MUX 110 I installed that in VM just to see how it fares in VM.
    The results were astonishing, usually compilation of 4.7.7 kernel takes ~ 2:15, now with VRQ host and MUX guest kernels compilation time is almost halved, it's ~ 1:15, which is huuuuuge improvement for me.
    To me it seems that it does not matter whether there is MUX or VRQ kernels in both, the thing is that the scheduler matters! I Could see that compilation took over all 6 vCPU at about 575% CPU usage that is, usually it does not go higher than 300-350%.
    Thanks!

    Br, Eduardo

    ReplyDelete
    Replies
    1. Speaking of VRQ, I noticed that Alfred's v4.8.1-vrq0 last listed fix "for smp_processor_id() preempt code usage in smpboot_thread_fn()" (bitbucket.org/alfredchen/linux-gc/commits/ae83d96cec28eea472fca06a2aef9e824373268f) was in the same section as the last MuQSS v108 patch15 reversion.
      Thoughts Con?

      Delete
    2. Alfred has apparently reworked the patch (by adding the else path preempt_enable(). Recalling history, the first version solved a failure during bootup, that was reported on this blog here. Sorry that I don't remember who reported it or which BFS revision it was.
      Maybe this patch is superseded by MuQSS' evolution and so unneeded any more, while Alfred's VRQ code base may still require it. Just an idea.

      BR, Manuel Krause

      Delete
    3. That preempt_enable() patch worked around what ended up being a bug in BFS and since it introduces new problems it is no longer relevant in MuQSS which doesn't have the same bug.

      Delete
  5. Updated to version 0.111 which has a fix for suspend/resume, but otherwise has no performance/behavioural changes.

    ReplyDelete
    Replies
    1. Resume from S3 suspend working a-ok again; rock on, Con. :) Now to build i686, noSM{P,T}...

      @Manuel, you previously asked about in-kernel hibernation; have you tried S3 suspend/sleep? I've been using it for years on my x64 PC and i686 netbook; though I usually just shut down the latter, as its battery only holds up for a day in standby. It only has a 4 GB SSD, no swap, and 16 GB SD card (but 2 GB mem), so I've never attempted hibernation on it.

      Delete
    2. @ck:
      Suspend-to-disk absolute failure is gone. With my old complete setup it's still not reliable. Second suspend already failed, same like written above.

      With this post I'm coming back to you with a stripped down kernel 4.7.7+MuQSS_ONLY, what is a really bad experience:
      * disk I/O with cfq impacts cpus' action and video playback
      -> BFQ is known to improve it
      * standard, in-kernel suspend-to-disk is still a complete mess:
      When needing 13min to find firefox responsive again, I could also poweroff completely at night. It's a pity. With TOI it was responsive at once.

      With that slowliness of in-kernel suspend-to-disk & resume, we won't ever find a culprit. IMHO, the culprit gets too much time to do it's 'evil' work.

      BR, Manuel Krause

      Delete
    3. Thanks Manuel. I'm not asking you to use cfq or in-kernel suspend long term. I only want to rule out either of those as the reason for any bugs that show up - though all the bugs seem to have gone away with 111.

      Delete
    4. Yes, Con, sorry man! I've definitely understood it.

      As my testing is still quite "young", and may take several days for a realistic result, can you please post a short statement, whether the post-v0.111 commits on github are important fixes that may be of value in my case?
      Yesterday evening I was so disappointed about in-kernel suspend-to-disk's performance, that I've re-added my TOI to the setup. As it's known to not talk much to the logs, this may be a mistake, but I'll keep testing/ watching anyways and maybe revert again.

      BR, and thanks for your patience,
      Manuel Krause

      Delete
    5. Yes, Con, sorry man! I've definitely understood it.

      As my testing is still quite "young", and may take several days for a realistic result, can you please post a short statement, whether the post-v0.111 commits on github are important fixes that may be of value in my case?
      Yesterday evening I was so disappointed about in-kernel suspend-to-disk's performance, that I've re-added my TOI to the setup. As it's known to not talk much to the logs, this may be a mistake, but I'll keep testing/ watching anyways and maybe revert again.

      BR, and thanks for your patience,
      Manuel Krause

      Delete
    6. @ck:
      *me sheding tears*
      Seems like I've trashed my TOI port with the transition from 4.7.5-6, where I needed to adjust it last time, due to changes in kernel/power/snapshot.c
      Came to this conclusion, as 4.7.7+MuQSS111+myTOI also failed, now at 3rd hibernation :..-(
      Now, running the kernel with the above mentioned changes reverted, to check last-known-good with MuQSS 111.
      Btw., it's not nice that the TOI project is still unmaintained since end of april, as it's so much time saving, when it works ;-)

      Too much offtopic stuff, I know,
      BR, Manuel Krause

      Delete
    7. The extra patches in git after 111 should make no difference to performance so there is no need to add them and re-benchmark.

      Delete
    8. @ck:
      O.k., me still weeping, the last revert didn't change anything. Failure on 1st resume.
      After resume I got a same stacktrace as before reverting:
      http://pastebin.com/Ynk9k6X2
      Related TOI port patch (4.7.6+) for reference:
      https://workupload.com/file/MhfLnUf

      With Alfred's 4.7 VRQ I've never run into those.
      Hopefully, you'd find some time to have a look on.

      BR, Manuel Krause

      Delete
    9. @ck:
      Also after adding the 12 new commits after 111 from github, failure and trace stay the same. Just FYI.

      BR and thanks in advance,
      Manuel Krause

      Delete
  6. I have tested this on older single core (2.7GHz) machine and it's good, but there's some freezing on playing/changing song when cpu is on load.

    Xmorph.

    ReplyDelete
  7. I have the old netbook running 4.8.1 v111 noSM{P,T} without issue (last -ck for it was 4.3.1!); I also now have the x64 PC running with the "clean up bind_zero" from git (fdd879d*700e92a) as well.

    ReplyDelete
  8. Hi Con,

    Kernel compilation of a new kernel with 0.110 was also one of the lowest so far (less than 8 minutes),

    tested GRID Autosport and it was really smooth - no stuttering observable,

    smooth video playback with MPV while running compiz (composited desktop) also worked kind of fine,

    will test now v111 with pending changes from github

    Thanks A LOT !

    ReplyDelete
    Replies
    1. Con,

      kwin_x11 seems to work reliably and stable with compositing,

      but compiz (0.9.12.2*) is a different story:

      there are short multi-second hangs when switching between apps from time to time (via Alt-Tab) - when the previews of the windows/icons are shown [the content of the entire screen is NOT refreshed while in that state, it also doesn't react to input, then later quickly continues what was not processed]

      this got me worried already since under heavy stress the stalls tend to become longer

      up to the point now that 3 times in a row while cloning and checking out the repositories of an Android ROM (Omni, roughly 38 GB) the PC locked up

      2 times it could be rebooted via Magic SYSRQ Key: almost 1-2 minutes later it would react and reboot

      the 3rd time however it was stuck that deeply that only a reset via reset button helped (luckily this case does have one !)


      It might partially be configuration issue since compiz via ccsm offers lots of options to tweak its behavior and misconfiguration is possible (well that performance significantly degrades)

      but it never got THAT bad that it would lock up due to scheduler issues (as far as I remember)

      Scheduler is at state of 4.8 muqss of Oct 14th


      Could any of the changes from Oct 15th improve that situation ?

      Does this ring a bell of race issues, scheduler issues or timing problems ?


      Thanks

      Delete
    2. Hi KoT. I have an idea what might be causing this and there's noting in the git commits to improve it yet. Try setting interactive to 0.

      Delete
  9. Hi,
    It seems that there is some bad interactions between muqss 0.111 and the Nvidia blob driver.
    Today I switched to muqss and my PC has freezed for 10 seconds when playing Starcraft 2 with wine then the computer has recovered. 2 hours later another freeze not recoverable (the machine was still reachable via ssh).
    dmesg after the first freeze shows this:
    Oct 15 14:36:31 kernel: NVRM: GPU at PCI:0000:04:00: GPU-6b577e9d-dc5f-ad2d-f8a0-c23db512691f
    Oct 15 14:36:31 kernel: NVRM: Xid (PCI:0000:04:00): 8, Channel 0000002b
    Oct 15 14:36:33 kernel: NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
    Oct 15 14:36:37 kernel: NVRM: Xid (PCI:0000:04:00): 56, CMDre 00000001 00000080 00000000 00000005 00000034

    No messages for the second freeze.
    With BFS, Starcraft 2+wine has always worked flawlessly.
    The only things that have changed is: kernel 4.8 ==> 4.8.1 and BFS 0.512 ==> MuQSS 0.111 (no nvidia driver change).
    Now, I've switched back to 4.8.1 and BFS 0.512.

    ReplyDelete
    Replies
    1. Hi! If you happen to have time, you can try adding the 12 new commits upon v0.111 (found here: https://github.com/ckolivas/linux/commits/4.8-muqss) and see, if they improve things for you.

      BR, Manuel Krause

      Delete
    2. Thanks for your reply, but its definitely not BFS/MuQSS related as it happen again just right now.
      I updated the nvidia driver 10 days ago, so it is probably the culprit.

      Delete
    3. Let's listen to Con's reply on this. I only wanted to show an eventual shortcut.

      BR, Manuel Krause

      Delete
    4. What did you want me to say about it? He said he gets the bug without muqss or bfs so it's not my problem.

      Delete
  10. With the last 12 pending commits upon 111 applied, superuser applications won't get more cpu attention than any normal user program, they seem to get _lower_ attention.

    Please look on that, BR, Manuel Krause

    ReplyDelete
    Replies
    1. Superuser applications NEVER got more cpu than mainline. Only scheduling policy and nice levels matter.

      Delete
    2. What the heck is no supposed to mean? No to what?

      Delete
    3. The possible scope is too wide. That's the reason for "no".
      My "root" login based processes are severely slowed down. Maybe, due to the presence of IDLE_PRIO tasks.

      BR, Manuel Krause

      Delete
  11. Found a nasty behavioural bug when sched_yield is called in interactive=1, which a lot of GPU drivers and compositing managers use. It would lead to serious stalls and misbehaviour. It has been fixed in git and will be in the next released version which I'm planning to do today.

    ReplyDelete
  12. @Con,

    I encountered this in version 111: http://pastebin.com/cPzRSxFx
    The same old smp_processor_id bug.
    Is this fixed in version 112 as well?

    br, Eduardo

    ReplyDelete