Saturday, 12 November 2016

linux-4.8-ck7, MuQSS version 0.140

Another week has passed, another stable linux release, and to follow, another -ck and MuQSS release.

linux-4.7-ck7 patch:
patch-4.8-ck7.lrz

Split out patches:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck7/patches/

MuQSS by itself for 4.8:
4.8-sched-MuQSS_140.patch

MuQSS by itself for 4.7:
4.7-sched-MuQSS_140.patch

This release marks a change towards conservative changes only.

I've rolled back the extensive timer changes outside the main scheduler code. There are too many assumptions made about timeouts in the kernel code that are potentially problematic in the real world, and there is code that is poorly prepared for freezer usage (suspend to ram) that breaks. Additionally, not a single user reported a workload that they noticed benefited from the lower latency accurate timeouts. Finally, the added overhead is demonstrable in throughput benchmarks, and when doing comparisons with mainline it is doing MuQSS a disservice to mix in other code that it's not actually responsible for.

There are also a small number of bugfixes for warnings/crashes in the updated MuQSS that showed up after the last release as people are using it on more and varied hardware in the wild now. These may have positive effects on other less defined issues in the wild too.

The -ck release also includes an updated version of BFQ. Along with this updated version, I would like to issue a warning regarding BFQ. I have heard rumour that a number of users have reported filesystem corruption with the combination of BTRFS and BFQ. If you are using this filesystem, I urge you to not compile in BFQ at all, or at the very least not make it default to BFQ, using it selectively on devices you are running a different filesystem (I still recommend people use ext4.) I would like to encourage users who have run into this problem to report it to the BFQ maintainer.

I've cleaned up the patches in the -ck tarball once again to include only the changes in combined related patches. This will ease the burden of porting to the next major linux kernel release and allow users to easily select which patches they wish to use themselves.

As always, make sure to give me your feedback, bug reports, warnings, and bitcoin.

Enjoy!
お楽しみ下さい
-ck

130 comments:

  1. I've found out which settings cause the osu! system lockup bug manifest.

    As long as the FPS Limiter in osu! is set to a value lower than Unlimited, the bug won't show up. If the value is set to unlimited, play any map would result in system lockup.

    The bug is easier to be triggered when a complex beatmap is used, such as this one (Game Over difficulty): https://osu.ppy.sh/s/332532

    NOTE: You don't have to play the game, just turn the Auto mod on and the map will be played by a bot

    ReplyDelete
    Replies
    1. With your hardware you mean. It doesn't happen on anything I have, but thanks for narrowing it down. Some kind of interaction with your Intel GPU driver and MuQSS misbehaving together then.

      Delete
    2. I've been trying to narrow the bug requirement with differents kernel configuration.

      CFS@100Hz: No problem
      CFS@1000Hz: No problem

      MuQSS@100Hz: Won't lockup until a beatmap is being played
      MuQSS@1000Hz: Lockup at start screen

      Delete
    3. Thanks for investigating and reporting. I'm quite sure your problem is related to a flood of soft interrupts that it's trying to service and gets into some kind of endless loop since it started once I modified the scheduler to aggressively handle the softirqs and get rid of the local_softirq_pending warnings. I do believe the GPU driver is partly to blame for getting into that state in the first place though obviously mainline handling it okay means my approach for handling them is not ideal. Not being able to reproduce the issue locally makes it very hard for me to test workarounds so any patch I create is going to be purely generic based on what I assume is going on. I'll see what I can do though.

      Delete
    4. To that end. Could I kindly ask if you could try this patch that reverts all the softirq changes? It should bring you back to the state where you got the local_softirq_pending warnings, though a lot of other code has changed in the interim so I need to know if a softirq workaround is even needed any more (likely): muqss140-revert_softirq_handling.patch

      Delete
    5. I'm back with some results:

      Hz = 1000
      4.8.6 w patch: Lockup when a beatmap is selected, no NOHZ error lines
      4.8.7 w patch: No problem at all

      I will give this another try with 100Hz to see if I can reproduce the bug anymore, also I will give 4.8.7 without the patch a try

      Delete
    6. I am not the beatmap gamer of above. But build a linux-4.8.7-ck7 kernel with revert_softirq_handling.patch applied out of curiousity: Cannot see any issue looking through dmesg output.

      Using an old 2009 MacMini CoreDuo, Ralph

      Delete
    7. Arghh wait, Using that revert_softirq patch I can see many of:
      ---
      Nov 13 19:44:29 maci systemd-journald[15336]: Missed 170 kernel messages
      Nov 13 19:44:29 maci systemd-journald[15336]: Missed 3 kernel messages
      ---
      looking into journalctl
      Ralph

      Delete
    8. Well it'll be funny (and not entirely surprising) if the bug goes away with 4.8.7 simply because it was a mainline bug that only manifested with muqss... It wouldn't be the first time the scheduler brought out races much easier than mainline.

      Delete
    9. The bug can be reproduced again after the charger was unplugged :(

      Delete
    10. Odd that it depends on the charger being unplugged... I wonder what changes... some power setting or something.

      Delete
    11. I have TLP installed. Here is the configuration.

      http://pastebin.com/TEUFxgjd

      Delete
    12. Disable each setting one by one in TLP to see which one triggers it?

      Delete
    13. Turns out the time it didn't bug out was just pure luck. I can reproduce it regardless of charger being plugged or not. I've managed to capture some logs:

      i2c_designware INT3433:00: controller timed out
      i2c_designware INT3433:00: controller timed out
      i2c_hid i2c-DLL0641:00: failed to change power setting.
      i2c_designware INT3433:00: controller timed out

      Delete
    14. Turns out the time I managed to avoid the bug was pure luck. I am able to reproduce it everytime regardless of charger state. I managed to capture some logs before the lockup:

      i2c_designware INT3433:00: controller timed out
      i2c_designware INT3433:00: controller timed out
      i2c_hid i2c-DLL0641:00: failed to change power setting.
      i2c_designware INT3433:00: controller timed out

      Delete
    15. ^^systemd is the bug.

      Delete
  2. Btrfs+BFQ issues? Bullshit. I use that all the time on multiple machines, and had no issues at all.

    ReplyDelete
    Replies
    1. There had been some (maybe related) issues quite long time ago, but they appear to have been fixed (also long time ago, maybe even pre 4.6 era, and then, from BTRFS side).
      The bfq-iosched google group, at least, hasn't seen such reports and not even such rumours for a long time. And then, you're shipping the latest bugfixed BFQ code in the -ck7 patchset, what should keep you on the safe side.

      BR, Manuel Krause

      Delete
    2. Thanks for that. That's why I didn't positively say there was an issue, only that I heard rumours there was a problem with btrfs.

      Delete
    3. @ck:
      I want to persuade you to remove the BFQ related lines naming "corruption" from the top posting.
      Until you get serious bug reports, you should not even keep any rumours up online.
      The BFQ project has had so many years of successful progress, for now, off-kernel.

      Thanks, Manuel Krause

      Delete
  3. @ck:
    Regarding the timer changes rollback for non-scheduler code, I assume that you still recommend the HZ_100 config, right?

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. Yes still Hz 100 for MuQSS.

      Delete
    2. Finally ran 4.8.7-MUQSS using 100Hz opposed to normally using 1000Hz. Throughput is much higher while retaining most of the responsiveness.
      A good compromise.

      Delete
  4. @Pedro

    Could you please benchmark this:
    https://sourceforge.net/projects/xanmod/files/sources/linux-4.8.6-xanmod8_4.8.6-xanmod8.orig.tar.gz/download
    (not the newer one 4.8.7!)

    It has some set some of the cfs parameters for low latency by default.
    Feels very snappy for me, especially when running xonotic, very low input latency. I prefer this one for xonotic.

    ReplyDelete
    Replies
    1. Yes I can. What kind of benchmarks do you want : throughput (make j1, make j2, ...), interbench, phoronix-test-suite ?
      I'll run them next week.
      I have a xanmod kernel up and running. I kept the default xanmod config however, which is different than Archlinux's stock config of the kernels already benchmarked. Here is a PKGBUILD, since I haven't found one on the AUR : http://pastebin.com/P3ADx0rD

      Pedro

      Delete
    2. Debian/Ubuntu only? :/

      Delete
    3. It's very interesting to see how cfs with a low latency setup compares to default-cfs/muqss. So both throughput and latency test are interesting.
      Since I like to play xonotic I'm very interested in low input latency, xonotic is a fast shooter so input latency really matters it has a huge impact on the game.

      It's a single threaded application but still the scheduler has a huge impact on it.
      All the schedulers I've tested perform well if I'm only looking at the framerate (always stable 250 fps)
      However there are differences in input latency and how precise/consistent the game reacts to input.
      I think there are a lot of context switches between xorg and xonotic every frame, also lots of interrupts the mouse alone produces an interrupt every ms (this games are played with a mouse polling rate of 1000HZ so we have an usb-interrupt every ms) and so on. So the scheduler has a huge impact on the game, cfs (xanmod8_4.8.6) performs best so far.

      @Con: do you have any thoughts about it?

      Delete
    4. You can find 4.8.7-xanmod9 throughput benchmarks, alongside MuQSS140, here:
      https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing

      and here:
      http://openbenchmarking.org/result/1611176-LO-CFSVSMUQS44

      Overall performance doesn't seem different than CFS@1000Hz, except for single threaded workload where it may be a bit better.
      Keep in mind that Xanmod's kernel config is not the same as other's kernels in theses tests (500Hz timer, acpi-cpufreq ondemand, ...).
      And for those who are wondering, Xanmod works fine on Archlinux with its default config, but I've not tested it extensively.

      Pedro

      Delete
  5. ^^tested xanmod kernel on my custom slackware install. while it feels pretty responsive for a cfs kernel (also on xonotic) 4.8.6 muqss runs better on that old core2duo machine. also xanmod modified the makefile to -Ofast which is questionable.

    ReplyDelete
  6. Even though 100Hz or even 300Hz might provide better throughput there were several issues with 100Hz (tested):

    upon reboot the unmounting of the dozens of ZFS subvolumes took ages - whereas with 300 Hz it was faster and 1000 HZ it was almost instantaneous


    Running Deus Ex Mankind Divided with 100 Hz wasn't fun - showing augmentations menus (with running videos on them) would regularly slow and clog down the animation/FPS, also while looking left and right or getting moving

    there was much more lag involved compared with 1000 Hz where it runs significantly smoother


    This might in part be thanks to running with CONFIG_NO_HZ_FULL while on 100 Hz and

    now CONFIG_NO_HZ_IDLE=y with 1000 Hz but there's quite a difference


    # CONFIG_RCU_FAST_NO_HZ is OFF

    ReplyDelete
    Replies
    1. On second recall - only 300 Hz was tested with Deus Ex MD but the difference was pretty noticable

      and the improvement quite significant when switching to 1000 Hz

      Delete
    2. That may be kernelOfTruth, but these are all userspace coding errors if they're Hz dependent. You still have the option of running whatever Hz you want to work around userspace problems; the scheduler should not be responsible for these mistakes.

      Delete
  7. No problems, old Intel cpu.
    Very fast, love it.

    ReplyDelete
    Replies
    1. Useless comment, so far! :-(
      Kernel version? Single cpu core? HZ settings? Added patches?

      BR, Manuel Krause

      Delete
    2. 4.8.7-ck6, 3Ghz dual core, 1000Hz, no patches.

      Delete
    3. @Anonymous:
      Thank you for adding this :-)
      I'm at 4.8.7, MuQSS 140, BFQv8r5 and WBT7 (what's not all the same setup as yours -- -ck6 ships the older MuQSS 135!) + my reworked humble old TOI port. CPU is 2.26GHz dualcore.
      I've tested both 100Hz and 1000Hz and find them hard to distinguish in behaviour in my (non-benchmarking, non-gaming) use.
      Maybe you want to upgrade to MuQSS 140/ -ck7.

      BR, Manuel Krause

      Delete
    4. Upgraded to MuQSS 140. Running good.
      Regarding Hz:
      While 1000 Hz feels more responsive it adds overhead and lowers throughput.

      Delete
    5. You know, when you want more interactivity/responsiveness the throughput may suffer. If I read correctly, Con is working on some configurable improvements.
      Atm. there's one maybe worth-to-test runtime tunable for you: "echo 0 > /proc/sys/kernel/interactive" (default = 1). Some also experiment with the rr_interval, but that affects all system's interactivity/throughput distribution. YMMV.

      BR, Manuel Krause

      Delete
  8. Having input freezes (USB 2.0, mouse, keyboard, 500 Hz polling) when running Xorg + Xonotic with SCHED_ISO on 4.8.7-ck7.
    4.8.7-ck6+0001-Make-freezable-timeouts-not-use-the-highres-timers.patch is fine although when enabling multicore-scheduler support in the kernel config it will tend to input freeze also.
    1000 Hz, periodic timer ticks.
    Input freeze as in game continues normally but accepts no input anymore for maybe 3 to 7 seconds (time varies) on random occasions which can be like a few minutes apart.
    Apart from that 4.8.7-ck7 seems to run better.
    No other patches.

    ReplyDelete
    Replies
    1. If you're experiencing freezes running applications SCHED_ISO then don't do that. Most applications aren't smart enough to run at realtime priority (which ISO does) and may lead to priority inversion.

      Delete
    2. Ok.
      Solved the problem by pinning Xorg to the first core, xonotic to the second one. No more hangs :). Thanks for the hint.

      Delete
    3. It's crazy that we have to do such hacks on linux.

      @ck: Why do we have this input delays on linux?
      The behavior is clearly scheduler dependant.
      I'm on a 4-core cpu and I'm having the same behavior when running xonotic on SCHED_ISO. I don't understand how priority inversion could happen on a 4-core cpu if we have manly two processes involved(xonotic/xorg).
      I have even modified xonotic to read input directly from /dev/input/... devices so the input doesn't depend on xorg, it's better but still not good enough compared to windows.
      To be clear: I don't run xonotic whit SCHED_ISO policy, I did it just for a quick test. I'm not sure why input on linux doesn't feel as consistent as on windows, is it the graphics stack(xorg)? But as I sad the behavior is (also) scheduler dependant. @ck do you have some thoughts about this?

      Delete
    4. Although not gaming, I also see little input related delays when hovering over the KDE desktop. And I second your line "It's crazy that we have to do such hacks on linux." I can understand it with my old hardware, but won't understand it on quadcores. Few weeks ago, already with MuQSS testing, I also tried the SCHED_ISO for special tasks like Xorg or plugin-container (where the flash-player lives in firefox), but it only showed up, that related data transfer suffered too much (maybe even affecting i915 gfx).

      BR, Manuel Krause

      Delete
    5. Why? Because despite 25 years of development, linux is not a desktop operating system. Its time on the desktop NEVER came. It got better and better as a server operating system and then along came the mobile device platform which got a concerted effort and paid development and suddenly became decent. Linux the desktop, however, spent so much time rewriting itself from scratch over and over and over again that it never matured. How many times have I contributed bugs on kde 1 to see kde 2 come out, then bugs on kde 2 when kde 3 came out, then bugs on kde 3 when kde 4 came out and now kde 5 still blows up at regular intervals for me. Gnome did exactly the same thing, then got replaced. Way too much time was spent comparing dicks between the two desktop environments when neither was in fact good nor mature. Precious little paid development ever went to the desktop itself and the applications that go on it. No gaming manufacturer would waste their time developing primarily on linux knowing how small the market is - if you're lucky it's an afterthought at best. Same goes for the linux GPU drivers - you can see they never perform as well as on windows. why do I still use it then? It does everything I need on a desktop, but not for everyone. I can do all the things via the web browser, basic office functionality, play music and watch videos. But outside of that most of what I do is not desktop-related usage. If you want a comprehensive desktop environment, windows is still it. If you want a desktop for someone without a soul, or a fisher price unix, macos is it. However Linux is for EVERYTHING else, and that's why we still use it.

      Delete
    6. I'm somewhat disappointed with this answer, I was looking for a more technical ideas.
      A also don't agree with you. I've installed windows only to test how xonotic runs on it and deleted it afterwards. It runs pretty well input is consistent so it's not a hardware isssue on my side. Concerning the desktop experience: it was pretty bad on windows. I have a hdd, the filesystem on windows seems to be very bad. Starting windows takes ages, starting firefox takes ages, if windows decides to update, the pc isn't usable for a long period because my hdd is busy all the time. Concerning the GPU drivers: I'm using nvidias blobs and xonotic runs about 10% faster on linux! If you google you'll see that also on windows people complain about input latency. So if somebody has ideas how do debug the input latency this would help a lot.

      Delete
    7. Sorry about the rant then, but I do not believe this has anything to do with the scheduler which has guaranteed deterministic latency.

      Delete
    8. It seems this contradicts the SCHED_ISO policy observations two people have done.

      Delete
    9. What are you talking about there? Running something SCHED_ISO is forcing something to run realtime scheduling (effectively SCHED_RR) over and above the normal scheduling policy and if the application is designed with busy waiting in the code it will priority invert relative to X and the desktop environment. It is a soft realtime policy for low latency applications only, designed only to not hang your machine to the point of complete lockup requiring reboot. If applications use too much CPU while running SCHED_ISO they get kicked back to normal scheduling policy to prevent that.

      Delete
    10. Just to make things clear:
      I'm really happy that we have someone who is able to write a well performing scheduler for linux. This is important since it can show flaws in the default scheduler and as shown by some benchmark is outperfoming cfs in some workloads. So thx ck.

      I should really try to understand the code before asking but well....
      ...."is designed with busy waiting in the code it will priority invert relative to X and the desktop environment." I can't understand how this could happen on a 4-core cpu with 2 running processes?

      Also some more observations (with default scheduler settings):
      1. Xonotic has on option called 'cl_maxfps_alwayssleep - gives up some processing time to other applications each frame, value in milliseconds' with this set to 1 I'm getting much better input behaviour (much more consistent), for both muqqs and cfs. Why is this the case on a 4-core cpu? Shouldn't there be enough free cores to schedule X without delays?

      2. Nvidias blob has an option '__GL_YIELD' I'm sure you know about it (yield vs. busy waiting)
      If set to 'NOTHING' I'm getting bad input behavior, to be more precise: input isn't consistent it varies a lot, I think it depends how much cpu/gpu time a frame needs, so again why does this has such a huge impact on a 4-core cpu?

      It seems that input latency depends heavily on the cpu load also muqqs performs better for me than cfs, so why do you believe this has anything to do with the schedule?

      Delete
    11. I forgot to mention that I have limited the framerate to 250 and they are always stable.

      Delete
    12. @ck:
      I really like you for your clear words above! Don't call it a rant. It compiles so many good thoughts, what are the reasons for us, to be following your BFS/MuQSS project, for years now.
      Comparing with windows, we'd need another thread.

      BR, Manuel Krause

      Delete
    13. @Anonymous:
      You still have no name shown. That's not o.k.
      BR, Manuel Karuse

      Delete
    14. I'll use duud.

      Delete
    15. Because running an application SCHED_RR (which is effectively what happens when you run SCHED_ISO) means that kernel threads, interrupt handling and RCU is even delayed until it relinquishes control of the CPU it is bound to. Since these workloads are bound to one CPU each, there is no way to migrate them to other CPUs in the scheduler, and eventually the flow on effect is it affects the whole system, not just the CPU it's bound to.

      Delete
    16. duud, are you willing to upload your modified Xonotic somewhere?
      I am in search of low input latency also.

      Delete
    17. Thx ck, the SCHED_ISO part makes sense to me, I didn't know that interrupt handling, rcu and kernel threads can't be migrated. I'm using threadirqs, so does this mean that treaded irq-handlers can't be migrated? I'm somewhat confused about the purpose of thrededirqs then.

      What about the 2 observations I was speaking about? They were made with default scheduler settings, I don't use SCHED_ISO for any processes (sorry if I didn't made it clear enough), I did it only once to confirm the observations someone mentioned above.

      duud

      Delete
    18. Not really sure on the others, but my guess would be GPU resources? The GPU drivers only just started implementing a scheduler of their own.

      Delete
    19. "duud, are you willing to upload your modified Xonotic somewhere?"....

      Sure, it's a quick and dirty modification of my patch that I'm using for quakelive, so there is unused stuff and I didn't even cared about indentation, but it works fine.

      http://pastebin.com/jf5dziQb

      duud

      Delete
    20. My gpu load is pretty low. Xonotic is based on an old engine, it's cpu heavy. In some situations xontic uses 100% of one core if I don't use 'cl_maxfps_alwayssleep'. Also as far as I understand if I'm using 'Option "RegistryDwords" "OGL_MaxFramesAllowed=0x1"' for the nvidia blob the gpu is at most 1 frame behind the cpu, and this is really noticable.

      duud

      Delete
    21. Thank you very much, duud.

      Delete
    22. @ck:
      I'm really confused what kernel stuff is bound to one cpu only and can't be migrated. I'm seeing this in /proc/interrupts: '57 6993 2297317 389 IO-APIC 19-fasteoi uhci_hcd:usb6' and how does threadirqs affect the distribution of interrupts across the cores and how does the scheduler affect this? CK do you have some time to explain this please?

      duud

      Delete
    23. I use irqbalance to balance irqs even across cores. It works pretty well most of the time:

      https://github.com/Irqbalance/irqbalance

      Also you can pin them to specific cores manually:

      (killall irqbalance)
      echo [c] > /proc/irq/[i]/smp_affinity

      where:
      [c] = core # (1, 2, 3, ...)
      [i] = interrupt # (1, 2, 3, ...)

      Delete
    24. This doesn't make sense to me, since this means that I would also have to pin my processes carefully to cpus. Ideally the kernel would use an idling cpu for interrupt processing, this is what I thought it does. However ck told me its not the case, so I'm not sure now.

      duud

      Delete
    25. Most distributions now run irqbalance by default anyway. Once an interrupt is taken by a particular CPU, the same CPU has to service it. There is a limit to how many soft interrupts will be serviced without any scheduling involved and then after that their servicing is deferred to the ksoftirqd thread that is bound to that CPU only and runs as SCHED_NORMAL nice 0 (once upon a time it used to even run at nice 19.)

      Delete
    26. ^^Fixed some input latency by compiling a 1500Hz MuQSS kernel although throughput suffers.

      Delete
    27. If a 1500Hz kernel helps you then perhaps the timing changes will too once they're updated and reinstated.

      Delete
  9. Just for grins I've put generic Ubuntu LTS kernel packages in the ck7 directory as well.

    ReplyDelete
    Replies
    1. Thank you!
      Just want to mention that latest Liquorix also features MuQQS scheduler.
      Regards,
      zt

      Delete
  10. New benchmarks of MuQSS140 on linux 4.8.7 here:
    http://openbenchmarking.org/result/1611152-LO-CFSVSMUQS66

    MuQSS135 vs MuQSS140 here:
    http://openbenchmarking.org/result/1611150-LO-MUQSS135127

    Nothing new with these results.

    Pedro

    ReplyDelete
  11. >"Additionally, not a single user reported a workload that they noticed benefited from the lower latency accurate timeouts"

    I found one. Probably Valve's at fault here, but still: Team Fortress 2 takes about 20 minutes to start up with ck7 @ 100Hz while it takes "only" 2 minutes with the other patches from the previous release and also only about 2 minutes with the old 1k Hz default.
    (I think it's compiling shaders during that time; only uses 1 cpu core but at about 100% though..)

    And even when I wait the 20 minutes, TF2 is running at an unplayable <15fps while with the 4.8-ck6 patchset it is running at 60 to 120fps.

    This is on a system that takes under 20 seconds to boot, just for perspective.

    OS: Arch Linux, linux-ck-k10-4.8.7-2 from graysky's repo-ck
    CPU: k10, amd phenom II x4 955 black edition (OC at 3.7GHz)
    GPU: amd radeon hd 7870 with fglrx/catalyst 15.12
    board: asus m4a77td (not the pro variant)

    Let me know if I should provide any logs/further info or do some tests.
    Also if this is something that has to be fixed in TF2, I'd be grateful for any hints on what I should include in a bug report over at valve's github.

    $ zcat /proc/config.gz |grep HZ
    CONFIG_NO_HZ_COMMON=y
    # CONFIG_HZ_PERIODIC is not set
    CONFIG_NO_HZ_IDLE=y
    # CONFIG_NO_HZ_FULL is not set
    # CONFIG_NO_HZ is not set
    CONFIG_HZ_100=y
    # CONFIG_HZ_250 is not set
    # CONFIG_HZ_300 is not set
    # CONFIG_HZ_1000 is not set
    CONFIG_HZ=100
    CONFIG_MACHZ_WDT=m

    ReplyDelete
    Replies
    1. Thanks for that one. That's more what I was expecting to see with those changes. You can simply report the difference you see in frame rate and startup time being dependent on what Hz the kernel is configured at. That's bug report enough. There isn't even any need to mention muqss or -ck.

      Delete
    2. Hi,

      I try playing this on my build as well (when I'll have time), I have quite similar HW, so results should not be wide apart.
      My HW is: Phenom II 975BE (@3.6 GHz, not OC), HD7850, but I use mesa. Let's see whether I can reproduce the same.
      Did You run the game using performance governor or ondemand? My testing shows that ondemand can be quite lower on frames compared to performance, at least in games I play.

      Br, Eduardo

      Delete
    3. Thanks for the replies.

      @Eduardo, yes, I play using performance governor since there's a very noticeable difference in framerate stability and input latency on my system as well. I tested the startup time with both performance and ondemand, though, and it didn't make a noticeable difference here.
      For the sake of completeness: I'm running the game scheduled as SCHED_ISO, via native runtime instead of steam's bundled libs (arch linux has the meta package "steam-native-runtime"), and force preloading of the system's libSDL2 since TF2 comes bundled with an outdated version which has a bug causing all mouse movement to be doubled.
      My startup parameters for TF2 in Steam look like this:
      SDL_VIDEO_MINIMIZE_ON_FOCUS_LOSS=0 LD_PRELOAD="/usr/\$LIB/libSDL2.so:$LD_PRELOAD" schedtool -I -e %command% -novid

      Delete
    4. I've now filed a bug report against TF2:
      https://github.com/ValveSoftware/Source-1-Games/issues/2011

      In the meantime I guess I'll stick to ck6 as I haven't noticed any issues with it (yet).

      Thanks for your work, Con!

      Delete
    5. Report looks good. It should be easy for them to reproduce but chances are they won't do anything about it (alas.) As I said elsewhere, I'll probably be reinstating the changes for the next -ck release and make it configurable.

      Delete
    6. I really would welcome some tuning options. :)

      Delete
  12. If you post a message and it doesn't show up here, just wait as I eventually go through and unmark things as spam. Blogger is very aggressive at marking comments as spam.

    ReplyDelete
  13. For those that have applications that are definitely faster at higher Hz configs, it would be interesting to see if just replacing the msleep calls from userspace with high resolution ones is enough to fix the slowdown. Try this patch on top of ck7: 0001-Use-hrtimeouts-when-possible-instead-for-msleep.patch

    ReplyDelete
    Replies
    1. @ck:
      Currently testing above patch with recommended 100HZ and with the muqss140-revert_softirq_handling.patch. On my system it solves the mentioned firefox-tabs-reloading long delay (compared with using 1000HZ). Flash in firefox still can lag.

      Yesterday, just out of curiosity, I've tested another combination on 4.8.7 & MuQSS140:
      1000HZ, muqss140-revert_softirq_handling.patch applied and reverted commit a312cfc -- meaning, your timer work patched-in again. This resulted in a really very-very responsive KDE desktop without ff/flash lags. Please, don't blame me for using this old software (and me even using it as somekind of benchmarking). Just want to let you know, that your kernel-wide timer related work wasn't useless at all. Hopefully you'd keep it in mind.

      BR, Manuel Krause

      Delete
    2. I don't blame you at all for anything Manuel. I wanted those changes there in the first place myself but they introduced bugs for other architectures and overhead. I was considering reintroducing them myself anyway, but make them somehow configurable instead of mandatory.

      Delete
    3. @ck:
      Making those variants a kernel config time decision would make sense. But then, who explains each of them vs. their use cases?
      BR, Manuel Krause

      Delete
    4. I also prefer the reintroduction of your hrtimer patches. Since many of us use precompiled packages I wouldn't prefer a compile time option. A kernel parameter would be find if it can't be done at runtime.

      duud

      Delete
    5. @ck:
      Yes, duud is right. If it doesn't work as runtime tunable (like interactive is) a kernel cmdline parameter would be nice.
      Another idea: Maybe you could make the unique MuQSS parameter represent different levels of "interactive"ness: 0 - none, 1 - actual standard, 2 - improved timers. Don't know how difficult it is to implement. I only suggest this idea, as several users seem to be quite confident with v140/-ck7 without modifications.

      BR, Manuel Krause

      Delete
    6. I agree, more options to improve certain situations would be great.

      Delete
  14. I've definitely noticed the MuQSS leads to a less responsive desktop when compiling very large projects, like Unreal Engine. Could this just be since it's more efficient at using all available resources than BFS?

    ReplyDelete
    Replies
    1. On all measurable interactivity and responsiveness figures muqss performs identically to BFS which is not surprising since it uses the identical scheduling algorithm, just with separate runqueues, so I'm not sure why you'd be finding it less responsive. The overhead of MuQSS with more CPUs gets exponentially less, but otherwise it should be pretty much the same as BFS in terms of using available resources on regular sized desktop CPUs.

      Delete
    2. Yeah, it's more responsive most of the time. I'm have to run some tests. I believe BFS wasn't pushing my 5960X as much to full utilization. At least, I didn't have any noticeable slow when pushing my cpu for a resource demanding compile. (It only happens during certain parts of the compile) Watching CPU usage I do notice my cores staying at 100% for longer.

      Delete
    3. The CPU usage reported is different on MuQSS though so you can't compare it just based on that. If it happens at certain stages only during the compile, it could actually be disk I/O or memory that's the issue. Have you compared it using the same I/O scheduler settings?

      Delete
    4. Yeah, I'm using bfq with the same settings as I always have. I'm going to try dialing in my XMP settings and see if that has any effect. Probably won't have that done until tomorrow.

      Delete
  15. Running 4.8.8 with ck7 results in much lower idle CPU frequency than without it on my i7-4790K (Haswell). I wrote a script that samples the CPU frequency once per sec then uses a python script to plot a histogram of the data. I ran this script under 4.8.8 and 4.8.8+ck7 and the differences where striking. Much lower idle with the ck patchset/MuQSS.

    Median frequency without patch = 3.72 GHz
    Median frequency with patch = 1.13 GHz

    Link to script: https://github.com/graysky2/bin/blob/master/cpufreq_histogram.sh

    Histogram without patch:
    # NumSamples = 180; Min = 799.80; Max = 4400.80
    # Mean = 3286.151667; Variance = 1422892.793164; SD = 1192.850700; Median 3718.850000
    # each ∎ represents a count of 1
    799.8000 - 1159.9000 [ 17]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.44%)
    1159.9000 - 1520.0000 [ 5]: ∎∎∎∎∎ (2.78%)
    1520.0000 - 1880.1000 [ 7]: ∎∎∎∎∎∎∎ (3.89%)
    1880.1000 - 2240.2000 [ 6]: ∎∎∎∎∎∎ (3.33%)
    2240.2000 - 2600.3000 [ 8]: ∎∎∎∎∎∎∎∎ (4.44%)
    2600.3000 - 2960.4000 [ 32]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (17.78%)
    2960.4000 - 3320.5000 [ 7]: ∎∎∎∎∎∎∎ (3.89%)
    3320.5000 - 3680.6000 [ 6]: ∎∎∎∎∎∎ (3.33%)
    3680.6000 - 4040.7000 [ 11]: ∎∎∎∎∎∎∎∎∎∎∎ (6.11%)
    4040.7000 - 4400.8000 [ 81]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (45.00%)

    Histogram with patch:
    # NumSamples = 180; Min = 799.30; Max = 4400.10
    # Mean = 1612.930556; Variance = 1172476.469566; SD = 1082.809526; Median 1127.550000
    # each ∎ represents a count of 1
    799.3000 - 1159.3800 [ 95]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (52.78%)
    1159.3800 - 1519.4600 [ 27]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (15.00%)
    1519.4600 - 1879.5400 [ 16]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (8.89%)
    1879.5400 - 2239.6200 [ 4]: ∎∎∎∎ (2.22%)
    2239.6200 - 2599.7000 [ 7]: ∎∎∎∎∎∎∎ (3.89%)
    2599.7000 - 2959.7800 [ 6]: ∎∎∎∎∎∎ (3.33%)
    2959.7800 - 3319.8600 [ 5]: ∎∎∎∎∎ (2.78%)
    3319.8600 - 3679.9400 [ 0]: (0.00%)
    3679.9400 - 4040.0200 [ 4]: ∎∎∎∎ (2.22%)
    4040.0200 - 4400.1000 [ 16]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (8.89%)

    ReplyDelete
    Replies
    1. Very interesting.
      MuQSS saves a lot of energy it seems.

      Delete
    2. Thanks Graysky for the results and for the script.

      @Anonymous
      I'm not sure we can infer from this that MuQSS use less power than CFS when idle.
      The script logs the value from '/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq' (this value is the same for all cpu* because on intel processor, all cores share the same frequency).
      But this don't take into account how long the cores have been in idle C0 state.


      For instance, when you run a single threaded load with 'taskset -c 0' :

      while true; do cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq; sleep 1; done
      will show
      frequency for all cpu ~ max frequency

      while turbostat will show
      frequency for all cpu (Bzy_MHz) ~ max frequency
      %time in C0 state (Busy%) ~ 100% for one core and ~ 0 for the others

      This is what I see on my system and what I understand from this post:
      https://plus.google.com/+ArjanvandeVen/posts/dLn9T4ehywL

      Pedro

      Delete
    3. Thanks, Pedro.
      Found a very nice tool (xfreq) following your link.
      https://github.com/cyring/xfreq

      Delete
    4. Hi graysky,

      these are nice values, but with your modern CPU and hopefully the pstate driver these alone are nearly meaningless. Modern CPUs do completely power off, they go to C7 or C6 state and don't change the frequency. You can check this with powertop or powerstat.
      Maybe this is the cause for the meaningless load avg top values too.

      Regards sysitos

      Delete
  16. I've added updated and improved versions of all the timer patches in ck6 back into the -ck git branch. They'll still be kept separate from the muqss code but assuming these latest patches improve behaviour without the bugs they previously introduced, I can roll them into another -ck release (muqss has no pending changes.)

    ReplyDelete
    Replies
    1. Best so far, latency-wise.
      No need for 1500Hz kernel anymore.
      Thank you very much.

      Delete
    2. Thanks for that. Assuming your applications are Hz limited, and my code addresses the workloads affected, it should be the equivalent of a 10,000Hz kernel.

      Delete
    3. Yes, it is very responsive.
      I hope those patches will be maintained somehow or included in future releases.

      Delete
    4. My plan was to reintroduce them into the combined -ck patch as I said, so likely they're here to stay.

      Delete
    5. @ck:
      I'm testing these 7 commits atm. upon MuQSS140, kernel 4.8.8, @100HZ. Still I get a period of massive forking activity when firefox loads my ~150 tabs, coming along with high sys cpu contention (~33% each of two cores). I can only stop it by randomly switching to other KDE windows. I've never left it going as I considered it possibly dangerous. A comparable 'forking activity' is known to me only from kernel compilation, but that ends processes properly.
      This FF forking symptom does occur on a 1000HZ kernel, too, but only for a small negligible duration <1s.

      If you have an idea, how I can debug it for you, please let me know. Regarding FF, I use the flashblock addon, so it's likely to not be its fault.

      BR, and thank you in advance,
      Manuel Krause

      Delete
    6. @ck: Addon: Usage of the above system after "intervention" is fine @100HZ. BR, Manuel Krause

      Delete
    7. Firefox recently changed to separate threads for all tabs and disabled them autoloading because they were aware of performance issues with them autoloading together with the thread change. You must have manually re-enabled the autoloading config option because it's off by default. What exactly they've done in the code to make it misbehave more at different Hz exactly is beyond me but it's once again a userspace issue, not a kernel one... On the other hand are you sure it's just not a reporting difference and it's actually behaving exactly the same? If the system is responding exactly the same it may just be reporting CPU differently at different Hz.

      Delete
    8. @ck So, is it actually good to temporarily use 1000HZ instead of 100HZ because it seems (according to the reports here in the comments) that it has the highest "compatibility" with different applications (firefox, games, ...). I am currently back at using the dead BFS 512 branch which is working very well but I also tried MUQSS 140 with linux 4.8.8 which resulted into random "lags"/"freezes" in applications such as firefox (normal browsing, nothing intensiv).

      Delete
    9. Use whatever works for you. The other comments are all without the extra patches. My recommendation would be muqss with the extra patches in the -ck branch at 100Hz, but if you have userspace that depends on 1000Hz even beyond the extra patches then why make it hard for yourself?

      Delete
    10. Hi, probably dumb question, but how to get those new ck patches into the new 4.8.9 kernel?

      Delete
    11. There will likely be another -ck release tomorrow and the various packagers will provide you with a 4.8.9-ck8

      Delete
    12. Ok. Thank you.

      Delete
    13. @ck:
      Although you consider it a userspace issue, I want to add some info to the firefox "phenomenon". First, I'm using the ESR variant of FF, where the multithread feature hasn't landed yet, second, I use the tabmixplus addon, which lets me start a blank ff and load all the saved tabs, when I tell it to do.
      The set of loaded tabs in my comparison of 100Hz vs. 1000Hz remained the same, and the difference in behaviour is reproducible since the MuQSS release, where you first suggested 100Hz as default. The FF issue @100Hz comes along with it not responding at all, no window refresh happens, while other windows are unaffected.
      If you consider it to be not scheduler related and don't want to bother with it, it's o.k for me anyways, as I have no problem with configuring 1000Hz and just use this one. :-)

      BR, Manuel Krause

      Delete
  17. blogspot seems to have a problem with replying/ displaying at 99th+ posts. Please have a look, thank you.
    BR, Manuel Krause

    ReplyDelete
    Replies
    1. Seems to be sufficient to have a dumb page breaker for 99+
      BR, Manuel Krause

      Delete
  18. Hi, I have tested yesterday ck-ivybridge kernel on Manjaro Distro and didnt perform quite well for me...

    Just an example of what was going on...
    I have like 5-6 autostart applications when I login...
    Clementine was playing music , I have opened steam , clementine had stop playing music while loading steam and after steam loaded it started again playing music....
    I was so shocked only from that simple task so I didnt perform any additional tests...

    The above behavior is not happening with Manjaro Kernels

    ReplyDelete
    Replies
    1. Try the default I/O scheduler; switch from bfq to deadline and try again.

      Delete
    2. Just to be clear, Arch != Manjaro. I do not know what (if any) Manjaro specific kernel config options are used. If you're pulling from [repo-ck], you are getting the Arch default, not necessarily the Manjaro ones. Unknown how this might affect the performance of your system.

      Delete
    3. I am using [repo-ck] but is a bit messy to install ck kernel on Manjaro since you have to remove Manjaro kernel,nvidia utils and nvidia drivers and then install ck kernel with ck nvidia drivers.
      I think I will wait a bit to test it again until it gets the Nvidia 375.20 drivers that have already been released.
      Some other people said that they had problems with some games freezing with ck kernel 4.8.3 but I think that this is fixed by now...

      Delete
  19. I have tested the kernel today but the same thing.
    Maybe is because I am using mdadm raid 0 with 2x SSD?

    ReplyDelete
  20. I don't have time to release anything today. Maybe Tuesday; 4.8.10 will probably be out by then.

    ReplyDelete
  21. Before Con prepares a new patch publishing run... a question to the people with the softirq error messages: Are these gone away by now? And if, by what means? With his included workaround patch?
    Thank you, BR Manuel Krause

    ReplyDelete
    Replies
    1. Normally an interrupt is picked up by a CPU and the work added to the soft interrupt list of work to be done and is then serviced when appropriate. That warning that was coming up was the CPU going idle inappropriately before all the soft interrupt work was completed. MuQSS 140 does not have that warning issue any more as it aggressively handles any pending soft interrupt work when a CPU is about to go idle. The patch in the Test/ directory is not going to be part of the next release as it disables this code and it proves to still be required.

      Delete
    2. @ck:
      Mmmh, "dodgy" can have a really different meaning between american and british English. My misunderstanding of 'dodgy workaround' resulted in reverting it by using the patch from the /Test directory for my most recent kernels. I'd recompile tomorrow and come back.
      BR, Manuel Krause

      Delete
    3. No, it has the same meaning and I thought it was a cludge the way I tackled it in the code but basically this was another case of mainline code making certain assumptions about how things would happen and the way muqss works it didn't play nicely with it so it is needed as is.

      Delete
  22. Finally I got time to test this. Before I was still using BFS 0.512 with the backport patches (so basically the 4.8-bfs branch). The thing I noticed was that it couldn't deal high cpu loads anymore. For example with BFS I had a very smooth desktop experience where everything worked well (no lags) while compiling big projects like LLVM (plus extra workloads like running Spotify, Firefox with 50 tabs, 2x windows vms, chromium 8 tabs, pycharm). With MUQSS (from linux-ck 4.8.9) every was lagging while compiling LLVM, literally everything, even scrolling in spotify lead to some milliseconds delay which did not happen with BFS.

    I know I am quite late to the party as I didn't want to test out older Muqss releases and couldn't report this ealier. Also I am not sure how I can provide more information. The easiest way for me to reproduce this, is to compile LLVM + Clang (ninja -j8/make -j8) while scrolling a large spotify playlist up and down. You will notice lags while doing it which again did not happen with BFS.

    ReplyDelete
    Replies
    1. Well, the scheduling algorithm is identical between BFS and MuQSS so if anything behaves differently then either there's a bug or something else on a configuration level is responsible. Are you comparing the same Hz configurations and running the same I/O schedulers? While I am recommending 100Hz for MuQSS, it appears userspace is still Hz fixated and there are patches that help that in the -ck git branch but ultimately setting the same config options is the only way to tell.

      Delete
    2. I tested both configurations 1kHz and 100Hz (+BFQ on BFS and MuQSS). Both were behaving quite the same as the slowdown (or lag) appeared on both kernel builds. Also with 100Hz I also noticed a startup slowdown, as it took 5 more seconds to boot up the kernel (according to systemd-analyze time).

      linux-ck-4.8.9-3 seems to use ck7 where you have reinstated the hrtimer patches (1b1990a-b07d63a).
      I will try to revert these patches and test again both configurations with 1kHz and 100Hz.

      Delete
    3. Well, I guess I was wrong, ck7 didn't include the hrtimer patches. Do you think it's worth trying those out then?

      Delete
    4. If you still have a slowdown at 1kHz then it's something else. As I've said elsewhere, on measurable benchmarks under load the latency figures are basically identical between muqss and bfs so I'm not really sure why you're affected. Is the rest of your config the same with respect to I/O scheduler for example?

      Delete
    5. I've just recompiled linux-ck with 1kHz and tested it again just to make sure but the delay/lag is still there.
      It seems the configurations are completely the same except linux-ck (MuQSS+1kHz) uses CONFIG_TICK_CPU_ACCOUNTING=y and BFS is not.
      Is this relevant?

      Delete
    6. Nope, that would not make a difference so you have a problem. Email me the output of top running in batch mode for a minute or so during the workload.

      Delete
    7. Another couple of data points - are you running your compilation niced and do you have a hyperthreaded CPU with SMT nice configured in?

      Delete
    8. Please check your inbox, you should have my top results :). And no, I am not "nicing" anything at all (It's more like that I have never used it), so I am just running it "plain". The kernel seems to be compiled with CONFIG_SMT_NICE though. Also I am using a hyperthreaded cpu (i7-skylake).
      Hope this helps.

      Delete