Another week has passed, another stable linux release, and to follow, another -ck and MuQSS release.
linux-4.7-ck7 patch:
patch-4.8-ck7.lrz
Split out patches:
http://ck.kolivas.org/patches/4.0/4.8/4.8-ck7/patches/
MuQSS by itself for 4.8:
4.8-sched-MuQSS_140.patch
MuQSS by itself for 4.7:
4.7-sched-MuQSS_140.patch
This release marks a change towards conservative changes only.
I've rolled back the extensive timer changes outside the main scheduler code. There are too many assumptions made about timeouts in the kernel code that are potentially problematic in the real world, and there is code that is poorly prepared for freezer usage (suspend to ram) that breaks. Additionally, not a single user reported a workload that they noticed benefited from the lower latency accurate timeouts. Finally, the added overhead is demonstrable in throughput benchmarks, and when doing comparisons with mainline it is doing MuQSS a disservice to mix in other code that it's not actually responsible for.
There are also a small number of bugfixes for warnings/crashes in the updated MuQSS that showed up after the last release as people are using it on more and varied hardware in the wild now. These may have positive effects on other less defined issues in the wild too.
The -ck release also includes an updated version of BFQ. Along with this updated version, I would like to issue a warning regarding BFQ. I have heard rumour that a number of users have reported filesystem corruption with the combination of BTRFS and BFQ. If you are using this filesystem, I urge you to not compile in BFQ at all, or at the very least not make it default to BFQ, using it selectively on devices you are running a different filesystem (I still recommend people use ext4.) I would like to encourage users who have run into this problem to report it to the BFQ maintainer.
I've cleaned up the patches in the -ck tarball once again to include only the changes in combined related patches. This will ease the burden of porting to the next major linux kernel release and allow users to easily select which patches they wish to use themselves.
As always, make sure to give me your feedback, bug reports, warnings, and bitcoin.
Enjoy!
お楽しみ下さい
-ck
I've found out which settings cause the osu! system lockup bug manifest.
ReplyDeleteAs long as the FPS Limiter in osu! is set to a value lower than Unlimited, the bug won't show up. If the value is set to unlimited, play any map would result in system lockup.
The bug is easier to be triggered when a complex beatmap is used, such as this one (Game Over difficulty): https://osu.ppy.sh/s/332532
NOTE: You don't have to play the game, just turn the Auto mod on and the map will be played by a bot
With your hardware you mean. It doesn't happen on anything I have, but thanks for narrowing it down. Some kind of interaction with your Intel GPU driver and MuQSS misbehaving together then.
DeleteI've been trying to narrow the bug requirement with differents kernel configuration.
DeleteCFS@100Hz: No problem
CFS@1000Hz: No problem
MuQSS@100Hz: Won't lockup until a beatmap is being played
MuQSS@1000Hz: Lockup at start screen
Thanks for investigating and reporting. I'm quite sure your problem is related to a flood of soft interrupts that it's trying to service and gets into some kind of endless loop since it started once I modified the scheduler to aggressively handle the softirqs and get rid of the local_softirq_pending warnings. I do believe the GPU driver is partly to blame for getting into that state in the first place though obviously mainline handling it okay means my approach for handling them is not ideal. Not being able to reproduce the issue locally makes it very hard for me to test workarounds so any patch I create is going to be purely generic based on what I assume is going on. I'll see what I can do though.
DeleteTo that end. Could I kindly ask if you could try this patch that reverts all the softirq changes? It should bring you back to the state where you got the local_softirq_pending warnings, though a lot of other code has changed in the interim so I need to know if a softirq workaround is even needed any more (likely): muqss140-revert_softirq_handling.patch
DeleteI'm back with some results:
DeleteHz = 1000
4.8.6 w patch: Lockup when a beatmap is selected, no NOHZ error lines
4.8.7 w patch: No problem at all
I will give this another try with 100Hz to see if I can reproduce the bug anymore, also I will give 4.8.7 without the patch a try
I am not the beatmap gamer of above. But build a linux-4.8.7-ck7 kernel with revert_softirq_handling.patch applied out of curiousity: Cannot see any issue looking through dmesg output.
DeleteUsing an old 2009 MacMini CoreDuo, Ralph
Arghh wait, Using that revert_softirq patch I can see many of:
Delete---
Nov 13 19:44:29 maci systemd-journald[15336]: Missed 170 kernel messages
Nov 13 19:44:29 maci systemd-journald[15336]: Missed 3 kernel messages
---
looking into journalctl
Ralph
Well it'll be funny (and not entirely surprising) if the bug goes away with 4.8.7 simply because it was a mainline bug that only manifested with muqss... It wouldn't be the first time the scheduler brought out races much easier than mainline.
DeleteThe bug can be reproduced again after the charger was unplugged :(
DeleteOdd that it depends on the charger being unplugged... I wonder what changes... some power setting or something.
DeleteI have TLP installed. Here is the configuration.
Deletehttp://pastebin.com/TEUFxgjd
Disable each setting one by one in TLP to see which one triggers it?
DeleteTurns out the time it didn't bug out was just pure luck. I can reproduce it regardless of charger being plugged or not. I've managed to capture some logs:
Deletei2c_designware INT3433:00: controller timed out
i2c_designware INT3433:00: controller timed out
i2c_hid i2c-DLL0641:00: failed to change power setting.
i2c_designware INT3433:00: controller timed out
Turns out the time I managed to avoid the bug was pure luck. I am able to reproduce it everytime regardless of charger state. I managed to capture some logs before the lockup:
Deletei2c_designware INT3433:00: controller timed out
i2c_designware INT3433:00: controller timed out
i2c_hid i2c-DLL0641:00: failed to change power setting.
i2c_designware INT3433:00: controller timed out
^^systemd is the bug.
DeleteBtrfs+BFQ issues? Bullshit. I use that all the time on multiple machines, and had no issues at all.
ReplyDeleteThere had been some (maybe related) issues quite long time ago, but they appear to have been fixed (also long time ago, maybe even pre 4.6 era, and then, from BTRFS side).
DeleteThe bfq-iosched google group, at least, hasn't seen such reports and not even such rumours for a long time. And then, you're shipping the latest bugfixed BFQ code in the -ck7 patchset, what should keep you on the safe side.
BR, Manuel Krause
Thanks for that. That's why I didn't positively say there was an issue, only that I heard rumours there was a problem with btrfs.
Delete@ck:
DeleteI want to persuade you to remove the BFQ related lines naming "corruption" from the top posting.
Until you get serious bug reports, you should not even keep any rumours up online.
The BFQ project has had so many years of successful progress, for now, off-kernel.
Thanks, Manuel Krause
@ck:
ReplyDeleteRegarding the timer changes rollback for non-scheduler code, I assume that you still recommend the HZ_100 config, right?
BR, Manuel Krause
Yes still Hz 100 for MuQSS.
DeleteFinally ran 4.8.7-MUQSS using 100Hz opposed to normally using 1000Hz. Throughput is much higher while retaining most of the responsiveness.
DeleteA good compromise.
@Pedro
ReplyDeleteCould you please benchmark this:
https://sourceforge.net/projects/xanmod/files/sources/linux-4.8.6-xanmod8_4.8.6-xanmod8.orig.tar.gz/download
(not the newer one 4.8.7!)
It has some set some of the cfs parameters for low latency by default.
Feels very snappy for me, especially when running xonotic, very low input latency. I prefer this one for xonotic.
Yes I can. What kind of benchmarks do you want : throughput (make j1, make j2, ...), interbench, phoronix-test-suite ?
DeleteI'll run them next week.
I have a xanmod kernel up and running. I kept the default xanmod config however, which is different than Archlinux's stock config of the kernels already benchmarked. Here is a PKGBUILD, since I haven't found one on the AUR : http://pastebin.com/P3ADx0rD
Pedro
Debian/Ubuntu only? :/
DeleteIt's very interesting to see how cfs with a low latency setup compares to default-cfs/muqss. So both throughput and latency test are interesting.
DeleteSince I like to play xonotic I'm very interested in low input latency, xonotic is a fast shooter so input latency really matters it has a huge impact on the game.
It's a single threaded application but still the scheduler has a huge impact on it.
All the schedulers I've tested perform well if I'm only looking at the framerate (always stable 250 fps)
However there are differences in input latency and how precise/consistent the game reacts to input.
I think there are a lot of context switches between xorg and xonotic every frame, also lots of interrupts the mouse alone produces an interrupt every ms (this games are played with a mouse polling rate of 1000HZ so we have an usb-interrupt every ms) and so on. So the scheduler has a huge impact on the game, cfs (xanmod8_4.8.6) performs best so far.
@Con: do you have any thoughts about it?
You can find 4.8.7-xanmod9 throughput benchmarks, alongside MuQSS140, here:
Deletehttps://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing
and here:
http://openbenchmarking.org/result/1611176-LO-CFSVSMUQS44
Overall performance doesn't seem different than CFS@1000Hz, except for single threaded workload where it may be a bit better.
Keep in mind that Xanmod's kernel config is not the same as other's kernels in theses tests (500Hz timer, acpi-cpufreq ondemand, ...).
And for those who are wondering, Xanmod works fine on Archlinux with its default config, but I've not tested it extensively.
Pedro
^^tested xanmod kernel on my custom slackware install. while it feels pretty responsive for a cfs kernel (also on xonotic) 4.8.6 muqss runs better on that old core2duo machine. also xanmod modified the makefile to -Ofast which is questionable.
ReplyDeleteEven though 100Hz or even 300Hz might provide better throughput there were several issues with 100Hz (tested):
ReplyDeleteupon reboot the unmounting of the dozens of ZFS subvolumes took ages - whereas with 300 Hz it was faster and 1000 HZ it was almost instantaneous
Running Deus Ex Mankind Divided with 100 Hz wasn't fun - showing augmentations menus (with running videos on them) would regularly slow and clog down the animation/FPS, also while looking left and right or getting moving
there was much more lag involved compared with 1000 Hz where it runs significantly smoother
This might in part be thanks to running with CONFIG_NO_HZ_FULL while on 100 Hz and
now CONFIG_NO_HZ_IDLE=y with 1000 Hz but there's quite a difference
# CONFIG_RCU_FAST_NO_HZ is OFF
On second recall - only 300 Hz was tested with Deus Ex MD but the difference was pretty noticable
Deleteand the improvement quite significant when switching to 1000 Hz
That may be kernelOfTruth, but these are all userspace coding errors if they're Hz dependent. You still have the option of running whatever Hz you want to work around userspace problems; the scheduler should not be responsible for these mistakes.
DeleteNo problems, old Intel cpu.
ReplyDeleteVery fast, love it.
Useless comment, so far! :-(
DeleteKernel version? Single cpu core? HZ settings? Added patches?
BR, Manuel Krause
4.8.7-ck6, 3Ghz dual core, 1000Hz, no patches.
Delete@Anonymous:
DeleteThank you for adding this :-)
I'm at 4.8.7, MuQSS 140, BFQv8r5 and WBT7 (what's not all the same setup as yours -- -ck6 ships the older MuQSS 135!) + my reworked humble old TOI port. CPU is 2.26GHz dualcore.
I've tested both 100Hz and 1000Hz and find them hard to distinguish in behaviour in my (non-benchmarking, non-gaming) use.
Maybe you want to upgrade to MuQSS 140/ -ck7.
BR, Manuel Krause
Upgraded to MuQSS 140. Running good.
DeleteRegarding Hz:
While 1000 Hz feels more responsive it adds overhead and lowers throughput.
You know, when you want more interactivity/responsiveness the throughput may suffer. If I read correctly, Con is working on some configurable improvements.
DeleteAtm. there's one maybe worth-to-test runtime tunable for you: "echo 0 > /proc/sys/kernel/interactive" (default = 1). Some also experiment with the rr_interval, but that affects all system's interactivity/throughput distribution. YMMV.
BR, Manuel Krause
Having input freezes (USB 2.0, mouse, keyboard, 500 Hz polling) when running Xorg + Xonotic with SCHED_ISO on 4.8.7-ck7.
ReplyDelete4.8.7-ck6+0001-Make-freezable-timeouts-not-use-the-highres-timers.patch is fine although when enabling multicore-scheduler support in the kernel config it will tend to input freeze also.
1000 Hz, periodic timer ticks.
Input freeze as in game continues normally but accepts no input anymore for maybe 3 to 7 seconds (time varies) on random occasions which can be like a few minutes apart.
Apart from that 4.8.7-ck7 seems to run better.
No other patches.
If you're experiencing freezes running applications SCHED_ISO then don't do that. Most applications aren't smart enough to run at realtime priority (which ISO does) and may lead to priority inversion.
DeleteOk.
DeleteSolved the problem by pinning Xorg to the first core, xonotic to the second one. No more hangs :). Thanks for the hint.
It's crazy that we have to do such hacks on linux.
Delete@ck: Why do we have this input delays on linux?
The behavior is clearly scheduler dependant.
I'm on a 4-core cpu and I'm having the same behavior when running xonotic on SCHED_ISO. I don't understand how priority inversion could happen on a 4-core cpu if we have manly two processes involved(xonotic/xorg).
I have even modified xonotic to read input directly from /dev/input/... devices so the input doesn't depend on xorg, it's better but still not good enough compared to windows.
To be clear: I don't run xonotic whit SCHED_ISO policy, I did it just for a quick test. I'm not sure why input on linux doesn't feel as consistent as on windows, is it the graphics stack(xorg)? But as I sad the behavior is (also) scheduler dependant. @ck do you have some thoughts about this?
Although not gaming, I also see little input related delays when hovering over the KDE desktop. And I second your line "It's crazy that we have to do such hacks on linux." I can understand it with my old hardware, but won't understand it on quadcores. Few weeks ago, already with MuQSS testing, I also tried the SCHED_ISO for special tasks like Xorg or plugin-container (where the flash-player lives in firefox), but it only showed up, that related data transfer suffered too much (maybe even affecting i915 gfx).
DeleteBR, Manuel Krause
Why? Because despite 25 years of development, linux is not a desktop operating system. Its time on the desktop NEVER came. It got better and better as a server operating system and then along came the mobile device platform which got a concerted effort and paid development and suddenly became decent. Linux the desktop, however, spent so much time rewriting itself from scratch over and over and over again that it never matured. How many times have I contributed bugs on kde 1 to see kde 2 come out, then bugs on kde 2 when kde 3 came out, then bugs on kde 3 when kde 4 came out and now kde 5 still blows up at regular intervals for me. Gnome did exactly the same thing, then got replaced. Way too much time was spent comparing dicks between the two desktop environments when neither was in fact good nor mature. Precious little paid development ever went to the desktop itself and the applications that go on it. No gaming manufacturer would waste their time developing primarily on linux knowing how small the market is - if you're lucky it's an afterthought at best. Same goes for the linux GPU drivers - you can see they never perform as well as on windows. why do I still use it then? It does everything I need on a desktop, but not for everyone. I can do all the things via the web browser, basic office functionality, play music and watch videos. But outside of that most of what I do is not desktop-related usage. If you want a comprehensive desktop environment, windows is still it. If you want a desktop for someone without a soul, or a fisher price unix, macos is it. However Linux is for EVERYTHING else, and that's why we still use it.
DeleteI'm somewhat disappointed with this answer, I was looking for a more technical ideas.
DeleteA also don't agree with you. I've installed windows only to test how xonotic runs on it and deleted it afterwards. It runs pretty well input is consistent so it's not a hardware isssue on my side. Concerning the desktop experience: it was pretty bad on windows. I have a hdd, the filesystem on windows seems to be very bad. Starting windows takes ages, starting firefox takes ages, if windows decides to update, the pc isn't usable for a long period because my hdd is busy all the time. Concerning the GPU drivers: I'm using nvidias blobs and xonotic runs about 10% faster on linux! If you google you'll see that also on windows people complain about input latency. So if somebody has ideas how do debug the input latency this would help a lot.
Sorry about the rant then, but I do not believe this has anything to do with the scheduler which has guaranteed deterministic latency.
DeleteIt seems this contradicts the SCHED_ISO policy observations two people have done.
DeleteWhat are you talking about there? Running something SCHED_ISO is forcing something to run realtime scheduling (effectively SCHED_RR) over and above the normal scheduling policy and if the application is designed with busy waiting in the code it will priority invert relative to X and the desktop environment. It is a soft realtime policy for low latency applications only, designed only to not hang your machine to the point of complete lockup requiring reboot. If applications use too much CPU while running SCHED_ISO they get kicked back to normal scheduling policy to prevent that.
DeleteJust to make things clear:
DeleteI'm really happy that we have someone who is able to write a well performing scheduler for linux. This is important since it can show flaws in the default scheduler and as shown by some benchmark is outperfoming cfs in some workloads. So thx ck.
I should really try to understand the code before asking but well....
...."is designed with busy waiting in the code it will priority invert relative to X and the desktop environment." I can't understand how this could happen on a 4-core cpu with 2 running processes?
Also some more observations (with default scheduler settings):
1. Xonotic has on option called 'cl_maxfps_alwayssleep - gives up some processing time to other applications each frame, value in milliseconds' with this set to 1 I'm getting much better input behaviour (much more consistent), for both muqqs and cfs. Why is this the case on a 4-core cpu? Shouldn't there be enough free cores to schedule X without delays?
2. Nvidias blob has an option '__GL_YIELD' I'm sure you know about it (yield vs. busy waiting)
If set to 'NOTHING' I'm getting bad input behavior, to be more precise: input isn't consistent it varies a lot, I think it depends how much cpu/gpu time a frame needs, so again why does this has such a huge impact on a 4-core cpu?
It seems that input latency depends heavily on the cpu load also muqqs performs better for me than cfs, so why do you believe this has anything to do with the schedule?
I forgot to mention that I have limited the framerate to 250 and they are always stable.
Delete@ck:
DeleteI really like you for your clear words above! Don't call it a rant. It compiles so many good thoughts, what are the reasons for us, to be following your BFS/MuQSS project, for years now.
Comparing with windows, we'd need another thread.
BR, Manuel Krause
@Anonymous:
DeleteYou still have no name shown. That's not o.k.
BR, Manuel Karuse
I'll use duud.
DeleteBecause running an application SCHED_RR (which is effectively what happens when you run SCHED_ISO) means that kernel threads, interrupt handling and RCU is even delayed until it relinquishes control of the CPU it is bound to. Since these workloads are bound to one CPU each, there is no way to migrate them to other CPUs in the scheduler, and eventually the flow on effect is it affects the whole system, not just the CPU it's bound to.
Deleteduud, are you willing to upload your modified Xonotic somewhere?
DeleteI am in search of low input latency also.
Thx ck, the SCHED_ISO part makes sense to me, I didn't know that interrupt handling, rcu and kernel threads can't be migrated. I'm using threadirqs, so does this mean that treaded irq-handlers can't be migrated? I'm somewhat confused about the purpose of thrededirqs then.
DeleteWhat about the 2 observations I was speaking about? They were made with default scheduler settings, I don't use SCHED_ISO for any processes (sorry if I didn't made it clear enough), I did it only once to confirm the observations someone mentioned above.
duud
Not really sure on the others, but my guess would be GPU resources? The GPU drivers only just started implementing a scheduler of their own.
Delete"duud, are you willing to upload your modified Xonotic somewhere?"....
DeleteSure, it's a quick and dirty modification of my patch that I'm using for quakelive, so there is unused stuff and I didn't even cared about indentation, but it works fine.
http://pastebin.com/jf5dziQb
duud
My gpu load is pretty low. Xonotic is based on an old engine, it's cpu heavy. In some situations xontic uses 100% of one core if I don't use 'cl_maxfps_alwayssleep'. Also as far as I understand if I'm using 'Option "RegistryDwords" "OGL_MaxFramesAllowed=0x1"' for the nvidia blob the gpu is at most 1 frame behind the cpu, and this is really noticable.
Deleteduud
Thank you very much, duud.
Delete@ck:
DeleteI'm really confused what kernel stuff is bound to one cpu only and can't be migrated. I'm seeing this in /proc/interrupts: '57 6993 2297317 389 IO-APIC 19-fasteoi uhci_hcd:usb6' and how does threadirqs affect the distribution of interrupts across the cores and how does the scheduler affect this? CK do you have some time to explain this please?
duud
I use irqbalance to balance irqs even across cores. It works pretty well most of the time:
Deletehttps://github.com/Irqbalance/irqbalance
Also you can pin them to specific cores manually:
(killall irqbalance)
echo [c] > /proc/irq/[i]/smp_affinity
where:
[c] = core # (1, 2, 3, ...)
[i] = interrupt # (1, 2, 3, ...)
This doesn't make sense to me, since this means that I would also have to pin my processes carefully to cpus. Ideally the kernel would use an idling cpu for interrupt processing, this is what I thought it does. However ck told me its not the case, so I'm not sure now.
Deleteduud
Most distributions now run irqbalance by default anyway. Once an interrupt is taken by a particular CPU, the same CPU has to service it. There is a limit to how many soft interrupts will be serviced without any scheduling involved and then after that their servicing is deferred to the ksoftirqd thread that is bound to that CPU only and runs as SCHED_NORMAL nice 0 (once upon a time it used to even run at nice 19.)
Delete^^Fixed some input latency by compiling a 1500Hz MuQSS kernel although throughput suffers.
DeleteIf a 1500Hz kernel helps you then perhaps the timing changes will too once they're updated and reinstated.
DeleteJust for grins I've put generic Ubuntu LTS kernel packages in the ck7 directory as well.
ReplyDeleteLOL
DeleteThank you!
DeleteJust want to mention that latest Liquorix also features MuQQS scheduler.
Regards,
zt
New benchmarks of MuQSS140 on linux 4.8.7 here:
ReplyDeletehttp://openbenchmarking.org/result/1611152-LO-CFSVSMUQS66
MuQSS135 vs MuQSS140 here:
http://openbenchmarking.org/result/1611150-LO-MUQSS135127
Nothing new with these results.
Pedro
>"Additionally, not a single user reported a workload that they noticed benefited from the lower latency accurate timeouts"
ReplyDeleteI found one. Probably Valve's at fault here, but still: Team Fortress 2 takes about 20 minutes to start up with ck7 @ 100Hz while it takes "only" 2 minutes with the other patches from the previous release and also only about 2 minutes with the old 1k Hz default.
(I think it's compiling shaders during that time; only uses 1 cpu core but at about 100% though..)
And even when I wait the 20 minutes, TF2 is running at an unplayable <15fps while with the 4.8-ck6 patchset it is running at 60 to 120fps.
This is on a system that takes under 20 seconds to boot, just for perspective.
OS: Arch Linux, linux-ck-k10-4.8.7-2 from graysky's repo-ck
CPU: k10, amd phenom II x4 955 black edition (OC at 3.7GHz)
GPU: amd radeon hd 7870 with fglrx/catalyst 15.12
board: asus m4a77td (not the pro variant)
Let me know if I should provide any logs/further info or do some tests.
Also if this is something that has to be fixed in TF2, I'd be grateful for any hints on what I should include in a bug report over at valve's github.
$ zcat /proc/config.gz |grep HZ
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
# CONFIG_NO_HZ is not set
CONFIG_HZ_100=y
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=100
CONFIG_MACHZ_WDT=m
Thanks for that one. That's more what I was expecting to see with those changes. You can simply report the difference you see in frame rate and startup time being dependent on what Hz the kernel is configured at. That's bug report enough. There isn't even any need to mention muqss or -ck.
DeleteHi,
DeleteI try playing this on my build as well (when I'll have time), I have quite similar HW, so results should not be wide apart.
My HW is: Phenom II 975BE (@3.6 GHz, not OC), HD7850, but I use mesa. Let's see whether I can reproduce the same.
Did You run the game using performance governor or ondemand? My testing shows that ondemand can be quite lower on frames compared to performance, at least in games I play.
Br, Eduardo
Thanks for the replies.
Delete@Eduardo, yes, I play using performance governor since there's a very noticeable difference in framerate stability and input latency on my system as well. I tested the startup time with both performance and ondemand, though, and it didn't make a noticeable difference here.
For the sake of completeness: I'm running the game scheduled as SCHED_ISO, via native runtime instead of steam's bundled libs (arch linux has the meta package "steam-native-runtime"), and force preloading of the system's libSDL2 since TF2 comes bundled with an outdated version which has a bug causing all mouse movement to be doubled.
My startup parameters for TF2 in Steam look like this:
SDL_VIDEO_MINIMIZE_ON_FOCUS_LOSS=0 LD_PRELOAD="/usr/\$LIB/libSDL2.so:$LD_PRELOAD" schedtool -I -e %command% -novid
I've now filed a bug report against TF2:
Deletehttps://github.com/ValveSoftware/Source-1-Games/issues/2011
In the meantime I guess I'll stick to ck6 as I haven't noticed any issues with it (yet).
Thanks for your work, Con!
Report looks good. It should be easy for them to reproduce but chances are they won't do anything about it (alas.) As I said elsewhere, I'll probably be reinstating the changes for the next -ck release and make it configurable.
DeleteI really would welcome some tuning options. :)
DeleteIf you post a message and it doesn't show up here, just wait as I eventually go through and unmark things as spam. Blogger is very aggressive at marking comments as spam.
ReplyDeleteFor those that have applications that are definitely faster at higher Hz configs, it would be interesting to see if just replacing the msleep calls from userspace with high resolution ones is enough to fix the slowdown. Try this patch on top of ck7: 0001-Use-hrtimeouts-when-possible-instead-for-msleep.patch
ReplyDelete@ck:
DeleteCurrently testing above patch with recommended 100HZ and with the muqss140-revert_softirq_handling.patch. On my system it solves the mentioned firefox-tabs-reloading long delay (compared with using 1000HZ). Flash in firefox still can lag.
Yesterday, just out of curiosity, I've tested another combination on 4.8.7 & MuQSS140:
1000HZ, muqss140-revert_softirq_handling.patch applied and reverted commit a312cfc -- meaning, your timer work patched-in again. This resulted in a really very-very responsive KDE desktop without ff/flash lags. Please, don't blame me for using this old software (and me even using it as somekind of benchmarking). Just want to let you know, that your kernel-wide timer related work wasn't useless at all. Hopefully you'd keep it in mind.
BR, Manuel Krause
I don't blame you at all for anything Manuel. I wanted those changes there in the first place myself but they introduced bugs for other architectures and overhead. I was considering reintroducing them myself anyway, but make them somehow configurable instead of mandatory.
Delete@ck:
DeleteMaking those variants a kernel config time decision would make sense. But then, who explains each of them vs. their use cases?
BR, Manuel Krause
I also prefer the reintroduction of your hrtimer patches. Since many of us use precompiled packages I wouldn't prefer a compile time option. A kernel parameter would be find if it can't be done at runtime.
Deleteduud
@ck:
DeleteYes, duud is right. If it doesn't work as runtime tunable (like interactive is) a kernel cmdline parameter would be nice.
Another idea: Maybe you could make the unique MuQSS parameter represent different levels of "interactive"ness: 0 - none, 1 - actual standard, 2 - improved timers. Don't know how difficult it is to implement. I only suggest this idea, as several users seem to be quite confident with v140/-ck7 without modifications.
BR, Manuel Krause
I agree, more options to improve certain situations would be great.
DeleteI've definitely noticed the MuQSS leads to a less responsive desktop when compiling very large projects, like Unreal Engine. Could this just be since it's more efficient at using all available resources than BFS?
ReplyDeleteOn all measurable interactivity and responsiveness figures muqss performs identically to BFS which is not surprising since it uses the identical scheduling algorithm, just with separate runqueues, so I'm not sure why you'd be finding it less responsive. The overhead of MuQSS with more CPUs gets exponentially less, but otherwise it should be pretty much the same as BFS in terms of using available resources on regular sized desktop CPUs.
DeleteYeah, it's more responsive most of the time. I'm have to run some tests. I believe BFS wasn't pushing my 5960X as much to full utilization. At least, I didn't have any noticeable slow when pushing my cpu for a resource demanding compile. (It only happens during certain parts of the compile) Watching CPU usage I do notice my cores staying at 100% for longer.
DeleteThe CPU usage reported is different on MuQSS though so you can't compare it just based on that. If it happens at certain stages only during the compile, it could actually be disk I/O or memory that's the issue. Have you compared it using the same I/O scheduler settings?
DeleteYeah, I'm using bfq with the same settings as I always have. I'm going to try dialing in my XMP settings and see if that has any effect. Probably won't have that done until tomorrow.
DeleteRunning 4.8.8 with ck7 results in much lower idle CPU frequency than without it on my i7-4790K (Haswell). I wrote a script that samples the CPU frequency once per sec then uses a python script to plot a histogram of the data. I ran this script under 4.8.8 and 4.8.8+ck7 and the differences where striking. Much lower idle with the ck patchset/MuQSS.
ReplyDeleteMedian frequency without patch = 3.72 GHz
Median frequency with patch = 1.13 GHz
Link to script: https://github.com/graysky2/bin/blob/master/cpufreq_histogram.sh
Histogram without patch:
# NumSamples = 180; Min = 799.80; Max = 4400.80
# Mean = 3286.151667; Variance = 1422892.793164; SD = 1192.850700; Median 3718.850000
# each ∎ represents a count of 1
799.8000 - 1159.9000 [ 17]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (9.44%)
1159.9000 - 1520.0000 [ 5]: ∎∎∎∎∎ (2.78%)
1520.0000 - 1880.1000 [ 7]: ∎∎∎∎∎∎∎ (3.89%)
1880.1000 - 2240.2000 [ 6]: ∎∎∎∎∎∎ (3.33%)
2240.2000 - 2600.3000 [ 8]: ∎∎∎∎∎∎∎∎ (4.44%)
2600.3000 - 2960.4000 [ 32]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (17.78%)
2960.4000 - 3320.5000 [ 7]: ∎∎∎∎∎∎∎ (3.89%)
3320.5000 - 3680.6000 [ 6]: ∎∎∎∎∎∎ (3.33%)
3680.6000 - 4040.7000 [ 11]: ∎∎∎∎∎∎∎∎∎∎∎ (6.11%)
4040.7000 - 4400.8000 [ 81]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (45.00%)
Histogram with patch:
# NumSamples = 180; Min = 799.30; Max = 4400.10
# Mean = 1612.930556; Variance = 1172476.469566; SD = 1082.809526; Median 1127.550000
# each ∎ represents a count of 1
799.3000 - 1159.3800 [ 95]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (52.78%)
1159.3800 - 1519.4600 [ 27]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (15.00%)
1519.4600 - 1879.5400 [ 16]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (8.89%)
1879.5400 - 2239.6200 [ 4]: ∎∎∎∎ (2.22%)
2239.6200 - 2599.7000 [ 7]: ∎∎∎∎∎∎∎ (3.89%)
2599.7000 - 2959.7800 [ 6]: ∎∎∎∎∎∎ (3.33%)
2959.7800 - 3319.8600 [ 5]: ∎∎∎∎∎ (2.78%)
3319.8600 - 3679.9400 [ 0]: (0.00%)
3679.9400 - 4040.0200 [ 4]: ∎∎∎∎ (2.22%)
4040.0200 - 4400.1000 [ 16]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (8.89%)
Very interesting.
DeleteMuQSS saves a lot of energy it seems.
Thanks Graysky for the results and for the script.
Delete@Anonymous
I'm not sure we can infer from this that MuQSS use less power than CFS when idle.
The script logs the value from '/sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq' (this value is the same for all cpu* because on intel processor, all cores share the same frequency).
But this don't take into account how long the cores have been in idle C0 state.
For instance, when you run a single threaded load with 'taskset -c 0' :
while true; do cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq; sleep 1; done
will show
frequency for all cpu ~ max frequency
while turbostat will show
frequency for all cpu (Bzy_MHz) ~ max frequency
%time in C0 state (Busy%) ~ 100% for one core and ~ 0 for the others
This is what I see on my system and what I understand from this post:
https://plus.google.com/+ArjanvandeVen/posts/dLn9T4ehywL
Pedro
Thanks, Pedro.
DeleteFound a very nice tool (xfreq) following your link.
https://github.com/cyring/xfreq
Hi graysky,
Deletethese are nice values, but with your modern CPU and hopefully the pstate driver these alone are nearly meaningless. Modern CPUs do completely power off, they go to C7 or C6 state and don't change the frequency. You can check this with powertop or powerstat.
Maybe this is the cause for the meaningless load avg top values too.
Regards sysitos
I've added updated and improved versions of all the timer patches in ck6 back into the -ck git branch. They'll still be kept separate from the muqss code but assuming these latest patches improve behaviour without the bugs they previously introduced, I can roll them into another -ck release (muqss has no pending changes.)
ReplyDeleteBest so far, latency-wise.
DeleteNo need for 1500Hz kernel anymore.
Thank you very much.
Thanks for that. Assuming your applications are Hz limited, and my code addresses the workloads affected, it should be the equivalent of a 10,000Hz kernel.
DeleteYes, it is very responsive.
DeleteI hope those patches will be maintained somehow or included in future releases.
My plan was to reintroduce them into the combined -ck patch as I said, so likely they're here to stay.
Delete@ck:
DeleteI'm testing these 7 commits atm. upon MuQSS140, kernel 4.8.8, @100HZ. Still I get a period of massive forking activity when firefox loads my ~150 tabs, coming along with high sys cpu contention (~33% each of two cores). I can only stop it by randomly switching to other KDE windows. I've never left it going as I considered it possibly dangerous. A comparable 'forking activity' is known to me only from kernel compilation, but that ends processes properly.
This FF forking symptom does occur on a 1000HZ kernel, too, but only for a small negligible duration <1s.
If you have an idea, how I can debug it for you, please let me know. Regarding FF, I use the flashblock addon, so it's likely to not be its fault.
BR, and thank you in advance,
Manuel Krause
@ck: Addon: Usage of the above system after "intervention" is fine @100HZ. BR, Manuel Krause
DeleteFirefox recently changed to separate threads for all tabs and disabled them autoloading because they were aware of performance issues with them autoloading together with the thread change. You must have manually re-enabled the autoloading config option because it's off by default. What exactly they've done in the code to make it misbehave more at different Hz exactly is beyond me but it's once again a userspace issue, not a kernel one... On the other hand are you sure it's just not a reporting difference and it's actually behaving exactly the same? If the system is responding exactly the same it may just be reporting CPU differently at different Hz.
Delete@ck So, is it actually good to temporarily use 1000HZ instead of 100HZ because it seems (according to the reports here in the comments) that it has the highest "compatibility" with different applications (firefox, games, ...). I am currently back at using the dead BFS 512 branch which is working very well but I also tried MUQSS 140 with linux 4.8.8 which resulted into random "lags"/"freezes" in applications such as firefox (normal browsing, nothing intensiv).
DeleteUse whatever works for you. The other comments are all without the extra patches. My recommendation would be muqss with the extra patches in the -ck branch at 100Hz, but if you have userspace that depends on 1000Hz even beyond the extra patches then why make it hard for yourself?
DeleteHi, probably dumb question, but how to get those new ck patches into the new 4.8.9 kernel?
DeleteThere will likely be another -ck release tomorrow and the various packagers will provide you with a 4.8.9-ck8
DeleteOk. Thank you.
Delete@ck:
DeleteAlthough you consider it a userspace issue, I want to add some info to the firefox "phenomenon". First, I'm using the ESR variant of FF, where the multithread feature hasn't landed yet, second, I use the tabmixplus addon, which lets me start a blank ff and load all the saved tabs, when I tell it to do.
The set of loaded tabs in my comparison of 100Hz vs. 1000Hz remained the same, and the difference in behaviour is reproducible since the MuQSS release, where you first suggested 100Hz as default. The FF issue @100Hz comes along with it not responding at all, no window refresh happens, while other windows are unaffected.
If you consider it to be not scheduler related and don't want to bother with it, it's o.k for me anyways, as I have no problem with configuring 1000Hz and just use this one. :-)
BR, Manuel Krause
blogspot seems to have a problem with replying/ displaying at 99th+ posts. Please have a look, thank you.
ReplyDeleteBR, Manuel Krause
Seems to be sufficient to have a dumb page breaker for 99+
DeleteBR, Manuel Krause
Hi, I have tested yesterday ck-ivybridge kernel on Manjaro Distro and didnt perform quite well for me...
ReplyDeleteJust an example of what was going on...
I have like 5-6 autostart applications when I login...
Clementine was playing music , I have opened steam , clementine had stop playing music while loading steam and after steam loaded it started again playing music....
I was so shocked only from that simple task so I didnt perform any additional tests...
The above behavior is not happening with Manjaro Kernels
Try the default I/O scheduler; switch from bfq to deadline and try again.
DeleteJust to be clear, Arch != Manjaro. I do not know what (if any) Manjaro specific kernel config options are used. If you're pulling from [repo-ck], you are getting the Arch default, not necessarily the Manjaro ones. Unknown how this might affect the performance of your system.
DeleteI am using [repo-ck] but is a bit messy to install ck kernel on Manjaro since you have to remove Manjaro kernel,nvidia utils and nvidia drivers and then install ck kernel with ck nvidia drivers.
DeleteI think I will wait a bit to test it again until it gets the Nvidia 375.20 drivers that have already been released.
Some other people said that they had problems with some games freezing with ck kernel 4.8.3 but I think that this is fixed by now...
I have tested the kernel today but the same thing.
ReplyDeleteMaybe is because I am using mdadm raid 0 with 2x SSD?
I don't have time to release anything today. Maybe Tuesday; 4.8.10 will probably be out by then.
ReplyDeleteBefore Con prepares a new patch publishing run... a question to the people with the softirq error messages: Are these gone away by now? And if, by what means? With his included workaround patch?
ReplyDeleteThank you, BR Manuel Krause
Normally an interrupt is picked up by a CPU and the work added to the soft interrupt list of work to be done and is then serviced when appropriate. That warning that was coming up was the CPU going idle inappropriately before all the soft interrupt work was completed. MuQSS 140 does not have that warning issue any more as it aggressively handles any pending soft interrupt work when a CPU is about to go idle. The patch in the Test/ directory is not going to be part of the next release as it disables this code and it proves to still be required.
Delete@ck:
DeleteMmmh, "dodgy" can have a really different meaning between american and british English. My misunderstanding of 'dodgy workaround' resulted in reverting it by using the patch from the /Test directory for my most recent kernels. I'd recompile tomorrow and come back.
BR, Manuel Krause
No, it has the same meaning and I thought it was a cludge the way I tackled it in the code but basically this was another case of mainline code making certain assumptions about how things would happen and the way muqss works it didn't play nicely with it so it is needed as is.
DeleteFinally I got time to test this. Before I was still using BFS 0.512 with the backport patches (so basically the 4.8-bfs branch). The thing I noticed was that it couldn't deal high cpu loads anymore. For example with BFS I had a very smooth desktop experience where everything worked well (no lags) while compiling big projects like LLVM (plus extra workloads like running Spotify, Firefox with 50 tabs, 2x windows vms, chromium 8 tabs, pycharm). With MUQSS (from linux-ck 4.8.9) every was lagging while compiling LLVM, literally everything, even scrolling in spotify lead to some milliseconds delay which did not happen with BFS.
ReplyDeleteI know I am quite late to the party as I didn't want to test out older Muqss releases and couldn't report this ealier. Also I am not sure how I can provide more information. The easiest way for me to reproduce this, is to compile LLVM + Clang (ninja -j8/make -j8) while scrolling a large spotify playlist up and down. You will notice lags while doing it which again did not happen with BFS.
Well, the scheduling algorithm is identical between BFS and MuQSS so if anything behaves differently then either there's a bug or something else on a configuration level is responsible. Are you comparing the same Hz configurations and running the same I/O schedulers? While I am recommending 100Hz for MuQSS, it appears userspace is still Hz fixated and there are patches that help that in the -ck git branch but ultimately setting the same config options is the only way to tell.
DeleteI tested both configurations 1kHz and 100Hz (+BFQ on BFS and MuQSS). Both were behaving quite the same as the slowdown (or lag) appeared on both kernel builds. Also with 100Hz I also noticed a startup slowdown, as it took 5 more seconds to boot up the kernel (according to systemd-analyze time).
Deletelinux-ck-4.8.9-3 seems to use ck7 where you have reinstated the hrtimer patches (1b1990a-b07d63a).
I will try to revert these patches and test again both configurations with 1kHz and 100Hz.
Well, I guess I was wrong, ck7 didn't include the hrtimer patches. Do you think it's worth trying those out then?
DeleteIf you still have a slowdown at 1kHz then it's something else. As I've said elsewhere, on measurable benchmarks under load the latency figures are basically identical between muqss and bfs so I'm not really sure why you're affected. Is the rest of your config the same with respect to I/O scheduler for example?
DeleteI've just recompiled linux-ck with 1kHz and tested it again just to make sure but the delay/lag is still there.
DeleteIt seems the configurations are completely the same except linux-ck (MuQSS+1kHz) uses CONFIG_TICK_CPU_ACCOUNTING=y and BFS is not.
Is this relevant?
Nope, that would not make a difference so you have a problem. Email me the output of top running in batch mode for a minute or so during the workload.
DeleteAnother couple of data points - are you running your compilation niced and do you have a hyperthreaded CPU with SMT nice configured in?
DeletePlease check your inbox, you should have my top results :). And no, I am not "nicing" anything at all (It's more like that I have never used it), so I am just running it "plain". The kernel seems to be compiled with CONFIG_SMT_NICE though. Also I am using a hyperthreaded cpu (i7-skylake).
DeleteHope this helps.
thank you
ReplyDeleteyou can also read Why Linux is Free?
I'm hitting a BUG when trying to create a QEMU/KVM VM, any ideas? I saw similar BUG in previous user comments regarding to BFS for 4.8, where person hit same BUG when he tried using VirtualBox.
ReplyDeleteApr 18 20:08:35 ROG audit[5962]: AVC apparmor="STATUS" operation="profile_replace" profile="unconfined" name="libvirt-f02a7ff8-d128-4db2-
Apr 18 20:08:35 ROG kernel: audit: type=1400 audit(1492564115.688:55): apparmor="STATUS" operation="profile_replace" profile="unconfined"
Apr 18 20:08:35 ROG kernel: usercopy: kernel memory overwrite attempt detected to ffff9b05d3ece708 (kmalloc-8) (128 bytes)
Apr 18 20:08:35 ROG kernel: ------------[ cut here ]------------
Apr 18 20:08:35 ROG kernel: kernel BUG at /usr/src/linux-4.10.0/mm/usercopy.c:75!
Apr 18 20:08:35 ROG kernel: invalid opcode: 0000 [#1] SMP
Apr 18 20:08:35 ROG kernel: Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 b
Apr 18 20:08:35 ROG kernel: cryptd snd_hwdep snd_pcm intel_cstate nvidia(POE) intel_rapl_perf snd_seq_midi saa7164 snd_seq_midi_event sn
Apr 18 20:08:35 ROG kernel: multipath linear uas usb_storage hid_generic usbhid hid raid0 i915 i2c_algo_bit drm_kms_helper syscopyarea s
Apr 18 20:08:35 ROG kernel: CPU: 0 PID: 4052 Comm: libvirtd Tainted: P OE 4.10.0-19+my-generic #21
Apr 18 20:08:35 ROG kernel: Hardware name: ASUS All Series/MAXIMUS VII GENE, BIOS 3003 10/28/2015
Apr 18 20:08:35 ROG kernel: task: ffff9b05aa5c5300 task.stack: ffffaaf342f30000
Apr 18 20:08:35 ROG kernel: RIP: 0010:__check_object_size+0x77/0x1d6
Apr 18 20:08:35 ROG kernel: RSP: 0018:ffffaaf342f33ee0 EFLAGS: 00010282
Apr 18 20:08:35 ROG kernel: RAX: 000000000000005e RBX: ffff9b05d3ece708 RCX: 0000000000000000
Apr 18 20:08:35 ROG kernel: RDX: 0000000000000000 RSI: ffff9b05efa0dbc8 RDI: ffff9b05efa0dbc8
Apr 18 20:08:35 ROG kernel: RBP: ffffaaf342f33f00 R08: 0000000000000005 R09: 0000000000000551
Apr 18 20:08:35 ROG kernel: R10: 0000000000000008 R11: ffffffffa84469cd R12: 0000000000000080
Apr 18 20:08:35 ROG kernel: R13: 0000000000000000 R14: ffff9b05d3ece788 R15: ffff9b05d3ece708
Apr 18 20:08:35 ROG kernel: FS: 00007f410c5d1700(0000) GS:ffff9b05efa00000(0000) knlGS:0000000000000000
Apr 18 20:08:35 ROG kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 18 20:08:35 ROG kernel: CR2: 00007f41196eaaa0 CR3: 00000003f0da5000 CR4: 00000000001406f0
Apr 18 20:08:35 ROG kernel: Call Trace:
Apr 18 20:08:35 ROG kernel: SyS_sched_setaffinity+0x6b/0xe0
Apr 18 20:08:35 ROG kernel: entry_SYSCALL_64_fastpath+0x1e/0xad
Apr 18 20:08:35 ROG kernel: RIP: 0033:0x7f41188425dc
Apr 18 20:08:35 ROG kernel: RSP: 002b:00007f410c5d0798 EFLAGS: 00000246 ORIG_RAX: 00000000000000cb
Apr 18 20:08:35 ROG kernel: RAX: ffffffffffffffda RBX: 00007f41193e271c RCX: 00007f41188425dc
Apr 18 20:08:35 ROG kernel: RDX: 00007f40e81211e0 RSI: 0000000000000080 RDI: 000000000000174c
Apr 18 20:08:35 ROG kernel: RBP: 00007f40e83155d0 R08: 00007f40e81de0e0 R09: 0000000000000000
Apr 18 20:08:35 ROG kernel: R10: 00007f40e81211e0 R11: 0000000000000246 R12: 00007f40e83155d0
Apr 18 20:08:35 ROG kernel: R13: 00007f41196eaa90 R14: 0000000000000001 R15: 00007f410c5d1698
Apr 18 20:08:35 ROG kernel: Code: c7 c2 13 4f ed a7 48 c7 c6 d1 da e9 a7 48 c7 c7 60 a5 e9 a7 48 0f 44 d1 48 c7 c1 8a 2e e9 a7 48 0f 44 f
Apr 18 20:08:35 ROG kernel: RIP: __check_object_size+0x77/0x1d6 RSP: ffffaaf342f33ee0
Apr 18 20:08:35 ROG kernel: ---[ end trace 7f5e3e96a69c8802 ]---