Saturday 5 November 2016

linux-4.8-ck6, MuQSS version 0.135

Announcing a new version of MuQSS and a -ck release

4.8-ck6 patchset:

MuQSS by itself for 4.8:

MuQSS by itself for 4.7:

Git tree:

A week has passed since the last major update to BFS and -ck was posted, allowing me to concentrate on receiving and responding to any bug reports. As it turns out, there were very few apart from the recurring local_softirq_pending warning/stalls. This is nice because it means MuQSS is mostly ~stable now. Mainline has even had more "stable" releases in the same time as MuQSS for 4.8, moving to 4.8.6 in the interim.

In this version I've added aggressive handling of pending softirqs in the hope the warnings and stalls all go away. The true reason the handling of softirqs are being dropped still escapes me but is likely related to the fact that MuQSS does a lot of lockless rescheduling across CPUs to decrease overhead but this does not give guarantees that locking would.

Additionally, I've added a number of APIs to the kernel to do specified millisecond schedule timeouts which use the highres timers which are mandatory now for MuQSS. The reason for doing this is there are many timeouts in the kernel that specify values below 10ms and the timer resolution at 100Hz only guarantees timeouts under 20ms.

I've also added a code sweep across the entire kernel looking for timeout calls under 50ms and use the new interface in its place. Additionally there are numerous places where schedule_timeout(1) are used in the kernel where a "minimum timeout" is expected, yet this is entirely Hz dependent, again being up to 20ms in duration. I've replaced all these with a 1ms timeout, emulating what would happen on a 1000Hz kernel, but without the overhead of running the higher Hz kernel. I'm not entirely sure this will equate to any real world improvements but the fact it's used in things like audio drivers worries me that it might.

Finally I've replaced the standard msleep call from userspace to use highres timers, in case there are userspace applications that expects msleep to actually give some kind of sleep that resembles what's asked of it, instead of something Hz limited, in case this is leading to slowdowns in userspace due to assumptions on the userspace coders' part. Calls to msleep() from userspace now give 100us accuracy at 100Hz instead of 20ms.

All these timing changes add overhead since they're trying to emulate the timing accuracy of running at 1000Hz but in a latency-focused scheduler I believe they're appropriate, and they do not incur the overhead that actually changing Hz would incur. Additionally they add accuracy to timers and timeouts that 1000Hz does not afford.

In the -ck tarball of broken-out patches, I've kept these timer changes separate to allow the muqss scheduler to be applied by itself should they prove problematic, and they will make merging with future kernels easier.



  1. Thanks. Fortunately harmless.

  2. Linux 4.8 MuQSS 0.135

    Get this message on startup:

    [ 242.153019] snd_hda_intel 0000:00:1b.0: IRQ timing workaround is activated for card #0. Suggest a bigger bdl_pos_adj.

    System hangs up whenever I have osu! running in Wine or JACK w Audacious (and may need some heavy CPU app in order to trigger this) on battery

    Good news is the softirq messages are all gone

    1. What do you mean by "hangs up?" Do you mean it hangs completely needing hard reset? Also does JACK run realtime audio? If so, then there is no way to prevent it from hanging up the machine should something go wrong.

    2. Yes, you need hard reset to get out of it, sysrq can't won't work for some reason.

      I have configured osu! to interact with the soundcard directly instead of going through JACK (which means JACK has to be closed as well) and the hang still occur (wine runs under SCHED_NORMAL)

    3. Well that sucks. What nohz config are you using? Tickless idle?, full dynticks? full on all CPUs? Detect full system idle state?

    4. Here is my CONFIG_NO_HZ

      # CONFIG_NO_HZ_FULL is not set
      # CONFIG_NO_HZ is not set
      # CONFIG_RCU_FAST_NO_HZ is not set

    5. I just restarted the machine and tried again, osu! audio now appear to be glitched: a part of the audio got repeated until the game was quit. After it was quit, it turned into 'Z' state with a child (which is itself) in 'R' state. Neither the parent nor the child can be killed with SIGKILL

    6. Would the audio be trying to get realtime priority somehow? If so it might be getting SCHED_ISO instead. Try disabling sched iso by setting its percentage to zero to see if that helps?
      echo 0 > /proc/sys/kernel/iso_cpu

    7. I managed to run osu! without system lockup by luck :P. But after a few second osu! audio start to get repeated again and again, if left like that then osu! will cause the whole system to hang after a few second, if quit, osu! turns into a zombie process with a child, which is itself stuck in "Running" state and could not the killed by any means (even SIGKILL).

    8. Please ignore the comment above, turns out internet was playing tricks with me :P

    9. I disabled SCHED_ISO and the problem still occur

    10. Must be some magic combination of that software and your hardware/drivers that recreate this issue since I can't reproduce it anywhere here... Sorry I have no other leads. Try recording top output in batch mode until it goes wrong to see what is happening there.

    11. I hard crash everytime I open osu! now, but I have been able to delay these hangs by changing the sample format, sample rate, resample method, and realtime scheduling with pulseaudio.

      It hard crashes on all settings, without fail, when the sound distorts on note clicks for this song: Some songs play okay on default settings, but crash when the map becomes complex.

      I've had it hard crash watching a Youtube video at some point.

    12. Y U NO post GPU driver, graphics card, browser, acceleration, mainboard, etc. info ?

  3. And another one:

    [ 0.064678] TSC deadline timer enabled
    [ 0.064680] smpboot: CPU0: Intel(R) Xeon(R) CPU E3-1245 v3 @ 3.40GHz (family: 0x6, model: 0x3c, stepping: 0x3)
    [ 0.064689] Performance Events: PEBS fmt2+, Haswell events, 16-deep LBR, full-width counters, Intel PMU driver.
    [ 0.064716] ... version: 3
    [ 0.064722] ... bit width: 48
    [ 0.064728] ... generic registers: 4
    [ 0.064734] ... value mask: 0000ffffffffffff
    [ 0.064740] ... max period: 0000ffffffffffff
    [ 0.064746] ... fixed-purpose events: 3
    [ 0.064751] ... event mask: 000000070000000f
    [ 0.078104] NMI watchdog: Disabling watchdog on nohz_full cores by default
    [ 0.091452] x86: Booting SMP configuration:
    [ 0.091458] .... node #0, CPUs: #1
    [ 0.176792] ------------[ cut here ]------------
    [ 0.176811] WARNING: CPU: 1 PID: 16 at kernel/sched/cputime.c:721 get_vtime_delta+0x87/0xb2
    [ 0.176818] Modules linked in:
    [ 0.176826] CPU: 1 PID: 16 Comm: migration/1 Not tainted 4.8.6_dtop-I.16 #1
    [ 0.176832] Hardware name: ASUS All Series/P9D WS, BIOS 2202 05/14/2015
    [ 0.176840] 0000000000000086 00000000175948c5 ffff9bdffa8e3cf0 ffffffff8956df44
    [ 0.176847] 0000000000000000 0000000000000000 ffff9bdffa8e3d30 ffffffff89121073
    [ 0.176855] 000002d100000000 0032dcd2b6161e57 0000000000000000 ffff9bdffa898000
    [ 0.176862] Call Trace:
    [ 0.176872] [] dump_stack+0x4d/0x63
    [ 0.176879] [] __warn+0xc5/0xe0
    [ 0.176886] [] warn_slowpath_null+0x18/0x1a
    [ 0.176893] [] get_vtime_delta+0x87/0xb2
    [ 0.176900] [] vtime_account_idle+0x9/0x13
    [ 0.176907] [] vtime_common_task_switch+0x16/0x28
    [ 0.176914] [] finish_task_switch+0xbb/0x2da
    [ 0.176922] [] __schedule+0x8c8/0xb90
    [ 0.176928] [] ? preempt_schedule+0x1e/0x20
    [ 0.176937] [] ? ___preempt_schedule+0x16/0x18
    [ 0.176945] [] ? sort_range+0x1d/0x1d
    [ 0.176951] [] schedule+0x86/0xce
    [ 0.176959] [] __kthread_parkme+0x39/0x5c
    [ 0.176965] [] kthread+0xd6/0xe4
    [ 0.176972] [] ret_from_fork+0x1f/0x40
    [ 0.176979] [] ? kthread_create_on_node+0x1ac/0x1ac
    [ 0.176986] ---[ end trace fb7f61e5ef93b6b1 ]---
    [ 0.177115] #2 #3 #4 #5 #6 #7
    [ 0.683611] x86: Booted up 1 node, 8 CPUs
    [ 0.683626] smpboot: Total of 8 processors activated (54301.91 BogoMIPS)

  4. Here is the log you requested:

    Also, the first time i try to record the log, the system stutter, mouse pointer jumps around, dmesg doesn't show anything unusual, but rebooting stuck forever at remount RO state, so sysrq is used to shutdown. Here is the log of that time:

    1. Very weird. The only thing I can see unusual going on in that toplog is a realtek usb card reader that appears to be stuck... which unless you're running off it somehow during this workload is a completely unrelated issue.

    2. Nothing is in the card reader. I will try blacklisting the module.

    3. Here is the log after rtsx has been unloaded:

    4. Unfortunately I can't see anything in that top log that gives me a clue as to what your problem is. It all looks pretty normal.

    5. Only other thing I can suggest is to try muqss without the timer changes and see if the problem is in the main muqss patch or the timer changes. They're split out for -ck and muqss alone is here:

    6. I tried both MuQSS and MuQSS w/o timer changes. Both have the same problem.

      On the other hand, CFS w timer changes and CFS alone doesn't have this problem.

  5. It seems that the hrtimer changes subtly broke task accounting: with 0.135 I see individual processes (esp. burners like longer-running C++ compilers) accounted with >100% CPU time (usually ~105-110%), which is clearly impossible.

    1. Actually the >100% predates 135. It happens only on tasks that fork and do work along the way, like a compile, since both forked processes/threads are being accounted while they're running concurrently which does actually add up to 100%. You won't be able to reproduce it on long running single processes/threads.

  6. @ck:
    Regarding the 4.7 kernel 135 MuQSS patch and if you're willing to support it for another little while:
    I've needed to change one hunk for kernel/sched/idle.c to eliminate a build failure:
    @@ -213,7 +219,10 @@ static void cpu_idle_loop(void)

    - tick_nohz_idle_enter();
    + if (unlikely(softirq_pending(cpu)))
    + pending = true;
    + else
    + tick_nohz_idle_enter();

    while (!need_resched()) {

    where -----
    + if (unlikely(softirq_pending(cpu)))
    should be -----
    + if (unlikely(softirq_pending(smp_processor_id())))

    if I understood this correctly.

    And additionally for kernels from 4.7.7 upwards the hunk for kernel/sched/sched.h should be taken from the 4.8 MuQSS patch.
    Then, kernel compiles and works fine :-) Thank you Con!
    (I hope the above cited patch lines are understandable.)

    BR, Manuel Krause

    1. @ck: Late side note: Above things got introduced by the 4.7 commit dfbb56f. BR, Manuel Krause

    2. @ck:
      Just want to leave a positive experience, that with this v0.135 my previously reported interactivity problems with 100Hz gone down below the 1000Hz level. Meaning, it's o.k. now for my machine, on my old 4.7.10 kernel (that I've run at 1000Hz all time before).
      BR, Manuel Krause

  7. Thanks as always Con.

    I've run the usual benchmarks. I used my desktop this time (Intel Haswell 4770k). The fan on my laptop is starting to fail. So no comparison with older MuQSS, sorry.

    I've put the results in a new spreadsheet:

    Nothing new in terms of throughput. MuQSS is roughly on par with CFS, except under partial load (make j2 and j4).

    I also ran interbench -L 8 on CFS@300Hz, CFS@1000Hz and MuQSS135@100Hz, with intel_pstate+powersave frequency governor.
    It doesn't show differences between all the kernels, so I wonder if I did things right. I used interbench from your git repo.


    1. Thanks. The first time you run interbench it benchmarks the speed of your CPU and uses it every time after that. If you run it the first time without the performance governor, it will create useless results from then after. If you add -b on startup it will perform the benchmark again.

    2. Thanks, I forgot about that.
      I've re-done the tests. The first time I used performance governor with CFS@300Hz, and then for every other test powersave (the default).
      The numbers of loops per ms goes from ~800,000 to ~4,000,000, that's better.
      The results are more like the ones expected. I can't look at them in detail right now, but they are on the spreadsheet.


    3. Thanks Pedro. Those results look more like it and are consistent in which categories are better on muqss. You shouldn't need to set the load as it will detect the number of logical CPUs automatically.

    4. I had a quick look at the interbench results. I have the impression that overall:
      - CFS@1000Hz is not that bad
      - the difference between Interactive values is rather small
      But you understand theses results better than I do. Am I correct ?

    5. I also ran the same tests you did with the Phoronix test suite.
      The results are here:

      The results shows the same things as the ones you got with MuQSS112.
      I'll think I'll redo the pgbench test with 'in buffer' scaling instead of the default 'on disk' because it seems the SSD is a bottleneck in the last test.


    6. Updated results here:


  8. 1000 Hz, periodic timer ticks, nice low latency with rr_interval of 3.

    100Hz will make latency worse? yes? no?

    1. Because of the changes I made to make MuQSS independent of ticks, 100Hz will not make latency worse.

    2. Ok, will try.

    3. I can confirm that with my old hardware periodic timer ticks result in a lower latency with 100 HZ. Performance is better. I can't confirm that periodic timer ticks are only for reducing power consume.

    4. Periodic timer ticks use more power than tickless idle. You can save more power than 100Hz alone by going 100Hz with tickless idle.

    5. You might be right, but fact is that my Core2 Duo machine consumes less CPU (about 10%!) whith periodic timers when using vapoursynth video filters. I compared this with different kernels (your ck and liquorix) - it's always the same!

    6. But an old desktop system on a power cable does not need to spare little bit of electric power but wants to have a little more throughput:
      Having an old y2009 Core2Duo I think of
      none CONFIG_NO_HZ_IDLE

      Thanx for your work, Ralph

    7. If you don't care at all about power savings then sure periodic timer tick is fine compared to tickless idle. You should also set your cpu frequency governor to performance as well then since no matter how good the governor is, it will never work as well as simply setting it to performance. The only reason to set it to anything else is to save power, reduce heat, and decrease fan noise. They do still use less power when idle with the performance governor, but not as much as with cpu frequency scaling enabled.

  9. My laptop didn't completely turned off (keyboard and screen light still on, screen blanked, laptop get hot) after suspend. I was still able to wake the system up using the power button, and was able to retrieve this log:

    1. Thanks for that. I know what would cause that and will have a patch for it soon.

  10. I'm using your patches since a lot of time to give some life to an old atom Z520 with great satisfaction.

    Since MuQSS (now i'm using version 0.135 with HZ 100) i'm experiencing random panics at boot.
    When the boot process doesn't hangs, the system seems to work very well (except hybernation - but i'm not sure it is related)

    The same kernel without MuQSS works and boots with no problems.

    Here attached a part of the trace:
    [] __hrtimer_run_queues+0xcb/0/x2a0
    [] ? perf_trace_sched_switch+0x180/0x180
    [] hrtimer_interrupt+0x8a/0x180
    [] local_apic_timer_interrupt+0x32/0x60
    [] smp_apic_timer_interrupt+0x34/0x3c
    [] apic_timer_interrupt+0x34/0x3c

    1. Thanks. Probably the result of the aggressive timer changes. Try muqss by itself from the -ck split out patches: 4.8-muqss135

    2. Well, i'm in fact using the pf6 post-factum patch due to the easy integration into Fedora kernels to deploy the same kernel into all the low performance boxes i have to manage.

      Would you suggest to use the patch:

      "Alone" with nothing else?

    3. Yes, and that's the same thing I linked you.

    4. Thank you for your help, i have just prepared a build with your patches 0001,0011,0012,0013 (since i need also BFQ).
      Trying it in a 64bit test box that has also Virtualbox installed, i obtain:

      Nov 10 16:25:07 localhost kernel: usercopy: kernel memory overwrite attempt detected to ffff94c316442dc8 (kmalloc-8) (128 bytes)
      Nov 10 16:25:07 localhost kernel: ------------[ cut here ]------------
      Nov 10 16:25:07 localhost kernel: kernel BUG at mm/usercopy.c:75!
      Nov 10 16:25:07 localhost kernel: invalid opcode: 0000 [#1] SMP
      Nov 10 16:25:07 localhost kernel: Modules linked in: vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) bfq_iosched fuse intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore ppdev intel_rapl_perf snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep iTCO_wdt snd_seq snd_seq_device iTCO_vendor_support snd_pcm snd_timer nuvoton_cir parport_pc snd mei_me i2c_i801 rc_core mei parport soundcore lpc_ich shpchp i2c_smbus tpm_tis tpm_tis_core tpm binfmt_misc i915 i2c_algo_bit drm_kms_helper drm crc32c_intel r8169 serio_raw ata_generic pata_acpi mii fjes video
      Nov 10 16:25:07 localhost kernel: CPU: 0 PID: 6400 Comm: wineserver Tainted: G OE 4.8.6-201.muqss.bfq.fc24.x86_64 #1
      Nov 10 16:25:07 localhost kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H67M, BIOS P2.10 04/27/2012
      Nov 10 16:25:07 localhost kernel: task: ffff94c283840000 task.stack: ffff94c283980000
      Nov 10 16:25:07 localhost kernel: RIP: 0010:[] [] __check_object_size+0x77/0x1dc
      Nov 10 16:25:07 localhost kernel: RSP: 0018:ffff94c283983ee0 EFLAGS: 00010282
      Nov 10 16:25:07 localhost kernel: RAX: 000000000000005e RBX: ffff94c316442dc8 RCX: 0000000000000000
      Nov 10 16:25:07 localhost kernel: RDX: 0000000000000000 RSI: ffff94c31f20df88 RDI: ffff94c31f20df88
      Nov 10 16:25:07 localhost kernel: RBP: ffff94c283983f00 R08: 00000000000af2bf R09: 0000000000000005
      Nov 10 16:25:07 localhost kernel: R10: 0000000000000008 R11: 000000000000033b R12: 0000000000000080
      Nov 10 16:25:07 localhost kernel: R13: 0000000000000000 R14: ffff94c316442e48 R15: ffff94c316442dc8
      Nov 10 16:25:07 localhost kernel: FS: 00007f16b4130700(0000) GS:ffff94c31f200000(0000) knlGS:0000000000000000
      Nov 10 16:25:07 localhost kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      Nov 10 16:25:07 localhost kernel: CR2: 00007faf2afef870 CR3: 0000000092fc3000 CR4: 00000000000406f0
      Nov 10 16:25:07 localhost kernel: Stack:
      Nov 10 16:25:07 localhost kernel: 0000000000000080 0000000000000080 00007fff7d684db0 0000000000001903
      Nov 10 16:25:07 localhost kernel: ffff94c283983f48 ffffffffaa0ccdb8 ffff94c316442dc8 000000003ec2e714
      Nov 10 16:25:07 localhost kernel: 000000000000000f 00000000013de7a0 00000000013e2410 000000000000001f
      Nov 10 16:25:07 localhost kernel: Call Trace:
      Nov 10 16:25:07 localhost kernel: [] SyS_sched_setaffinity+0x68/0x100
      Nov 10 16:25:07 localhost kernel: [] entry_SYSCALL_64_fastpath+0x1a/0xa4
      Nov 10 16:25:07 localhost kernel: Code: 48 0f 44 d1 48 c7 c6 8e f0 a4 aa 48 c7 c1 02 36 a4 aa 48 0f 44 f1 4d 89 e1 49 89 c0 48 89 d9 48 c7 c7 88 b6 a4 aa e8 04 bb f6 ff <0f> 0b e8 12 af fb ff 85 c0 75 78 48 89 df e8 06 76 e3 ff 84 c0
      Nov 10 16:25:07 localhost kernel: RIP [] __check_object_size+0x77/0x1dc
      Nov 10 16:25:07 localhost kernel: RSP
      Nov 10 16:25:07 localhost kernel: ---[ end trace fcfa1d973eabc43d ]---

    5. Unusual error. There's a change I've committed to git which hopefully will help with this. I'll probably have to wrap up changes into a new release to coincide with 4.8.7 as well.

    6. Thanks for looking into it. During the night i have tried the same build with patches 0001,0011,0012,0013 into the laptop with Atom Z520 and the situation is even worst since it boots only 1/10 with panics about "update_process_times.. tick_periodic.. tick_handle_periodic.. local_apic_timer.. smp_apic_timer... EIP: scheduler_tick"

      Boot again with 4.7.6 + bfs works still great without any issue.
      I know that look back seems a bad way to go ahead, but... do you plan prepare a set of bfs patches also for kernel 4.8.x?

    7. BFS has turned into muqss. There is no more BFS.

  11. These warnings should now be fixed in git.

  12. Does VirtualBox require cgroups or Isochronous scheduling?
    Windows 7, Ubuntu, and Arch guests on Arch host extremely sluggish to load and run, audio is broken and lags rendering guests unuseable, WinXP guest blue screens on launch. Arch Linux 4.8.6-2-ck (piledriver) since MuQSS.

    1. Yes it is a container, not a virtual machine, which means it relies on cgroups entirely which do nothing on muqss.

    2. To run it in muqss, is it: schedtool -I -e virtualbox ?
      VirtualBox worked with bfs for years. Have been following all the blogs since this one to current, and read up on stubs but have found scant information on creating stubs. Have read several wiki's on cgroups (incl Arch & Redhat), and scheduling and looking at the example for amarok, is best I can comprehend of it.
      I'm currently arch 4.8.11-1 stock since vbox quit working in ck.
      Thank you.


    3. Try building it without cgroups.

  13. Turn out the problem with osu! only appear if you set the FPS limiter to Unlimited. I'm unable to reproduce the problem with FPS limiter set to 120fps or 240fps.

    You may need a complex map such as this one:

    Play the map on Xtra difficulty with Auto mod (press F2 and select Auto).

    I'm currently using MuQSS 1769b2d on Linux 4.8.
    Intel HD Graphics 5500.

  14. Thank you for reply ck - I figured it did but could only find cgroups reference to a *.slice file on pc regarding vbox. Setting cgroups to work with muqss way over my head for now, but again thanks.