Here's a new release to go along with and commemorate the 4.8.10 stable release (they're releasing stable releases faster than my development code now.)
linux-4.8-ck8 patch:
patch-4.8-ck8.lrz
MuQSS by itself:
4.8-sched-MuQSS_144.patch
There are a small number of updates to MuQSS itself.
Notably there's an improvement in interactive mode when SMT nice is enabled and/or realtime tasks are running, or there are users of CPU affinity. Tasks previously would not schedule on CPUs when they were stuck behind those as the highest priority task and it would refuse to schedule them transiently.
The old hacks for CPU frequency changes from BFS have been removed, leaving the tunables to default as per mainline.
The default of 100Hz has been removed, but in its place a new and recommended 128Hz has been implemented - this just a silly microoptimisation to take advantage of the fast shifts that /128 has on CPUs compared to /100, and is close enough to 100Hz to behave otherwise the same.
For the -ck patch only I've reinstated updated and improved versions of the high resolution timeouts to improve behaviour of userspace that is inappropriately Hz dependent allowing low Hz choices to not affect latency.
Additionally by request I've added a couple of tunables to adjust the behaviour of the high res timers and timeouts.
/proc/sys/kernel/hrtimer_granularity_us
and
/proc/sys/kernel/hrtimeout_min_us
Both of these are in microseconds and can be set from 1-10,000. The first is how accurate high res timers will be in the kernel and is set to 100us by default (on mainline it is Hz accuracy).
The second is how small to make a request for a "minimum timeout" generically in all kernel code. The default is set to 1000us by default (on mainline it is one tick).
I doubt you'll find anything useful by tuning these but feel free to go nuts. Decreasing the second tunable much further risks breaking some driver behaviour.
Enjoy!
お楽しみ下さい
-ck
cylictest (cyclictest -N -S -p 80) avg times increased by a factor of 10 with this version.
ReplyDeleteduud
Try it with interactive disabled.
DeleteSame results with interactive disabled.
DeleteNumbers are in ns.
T: 0 ( 491) P:80 I:1000 C: 2045 Min: 3989 Act: 71984 Avg: 66114 Max: 80756
T: 1 ( 492) P:80 I:1500 C: 1363 Min: 4347 Act: 103537 Avg: 95088 Max: 113803
T: 2 ( 493) P:80 I:2000 C: 1022 Min: 3128 Act: 134320 Avg: 119873 Max: 146422
T: 3 ( 494) P:80 I:2500 C: 818 Min: 4053 Act: 165569 Avg: 130382 Max: 176960
On CFS:
DeleteT: 0 ( 545) P:80 I:1000 C: 817 Min: 2209 Act: 9607 Avg: 9385 Max: 16567
T: 1 ( 546) P:80 I:1500 C: 545 Min: 2293 Act: 2635 Avg: 7862 Max: 16549
T: 2 ( 547) P:80 I:2000 C: 408 Min: 3506 Act: 9911 Avg: 9822 Max: 10986
T: 3 ( 548) P:80 I:2500 C: 327 Min: 2180 Act: 3270 Avg: 9018 Max: 17319
Con just puts the userspace programmers' hardcoded 'dodgy' things back on the table. Although it seems buggy, the factor of 10 is nicely significant. Maybe, cyclictest isn't MuQSS-ready(tm) ;-)
Delete@duud: Any issues or slowdowns?
BR, Manuel Krause
Okay let's take a step back. What's cyclic test and where is it? Need to know if it's benchmarking something valid first before we consider if this is relevant. Also what muqss exactly are you benchmarking? Muqss by itself, with -ck, what Hz config?
DeleteI tried cyclictest, but it was broken with older MuQSS releases.
DeleteWhen run with the histogram option, the latency counters were always overflowing. But this seem fixed with MuQSS144.
This presentation gives usefull info on cyclictest:
http://events.linuxfoundation.org/sites/events/files/slides/cyclictest.pdf
Pedro
I'd have to see the actual code to pass judgement, but the fact that it was even possible for the latency counters to overflow, regardless of the CPU scheduler it was run on, makes it hard to put any great value on it. It's not unusual that these latency measurement tools depend on kernel design/APIs that may or may not exist in muqss or have a totally different meaning. The same thing happened with lattest.
Delete^^ That first cyclictest looks really horrible.
DeleteMust be some bad driver or bad code in userspace or something.
Latencies should be spread almost even across cores.
Something clearly isn't right. It also shows on CFS but not that extreme though.
Cyclictest running fine as it should here on Core 2 3Ghz. (MuQSS)
T: 0 ( 1510) P:80 I:1000 C: 11657 Min: 1041 Act: 1223 Avg: 1319 Max: 5217
T: 1 ( 1511) P:80 I:1500 C: 7771 Min: 1167 Act: 1390 Avg: 1478 Max: 6242
1.3-1.5µs on Core 2 :)
Factor 2-3 lower than CFS, like it should be. But then again cyclictest isn't saying too much also.
I use a custom kernel though with "everything" ripped out except hardware drivers for the machine I am compiling on.
Cyclictest on W3690 hexacore, latest ck, kernel 4.8.10.
DeleteT: 0 ( 2036) P:80 I:1000 C: 9970 Min: 1123 Act: 1357 Avg: 1455 Max: 4912
T: 1 ( 2037) P:80 I:1500 C: 6647 Min: 1113 Act: 1467 Avg: 1504 Max: 5313
T: 2 ( 2038) P:80 I:2000 C: 4985 Min: 1189 Act: 1367 Avg: 1453 Max: 4933
T: 3 ( 2039) P:80 I:2500 C: 3988 Min: 1195 Act: 1345 Avg: 1449 Max: 4778
T: 4 ( 2040) P:80 I:3000 C: 3323 Min: 1224 Act: 1384 Avg: 1397 Max: 3064
T: 5 ( 2041) P:80 I:3500 C: 2848 Min: 1170 Act: 1312 Avg: 1370 Max: 8951
I usually run a quick cyclictest to check for scheduling overhead.
DeleteNow I did some tests. It's the new 128HZ option - nobody tested it? Setting back to 100HZ yields low numbers again.
duud
That reminds me...
Deletewhen I was experimenting with higher Hz than 1000 I noticed the same regarding uneven Hz numbers.
They must be divisible by 2 and 10 and such (no idea exactly), otherwise performance suffers somehow (maybe drivers?).
As I use 1000 Hz normally I didn't notice that 128 Hz thing.
What a fascinating discovery. Now the real question is - is there some in-kernel requirement of it being that multiple that makes it actually misbehave, OR is it just a sampling error that occurs as a result of that multiple and it's reporting bad where in fact it's performing fine?
DeleteAll I can say performance really suffered.
DeleteLike above on that cyclictest CFS/MuQSS (128Hz) comparison.
Some code seems to be multiple of ? Hz-dependent it seems.
Maybe in the kernel, maybe in the drivers, maybe in userspace...
... no idea...
Well I'm saying that I'm not sure that performance actually is suffering and that there may be a reporting error from the clock with unusual Hz values. Either way, Hz 128 isn't proving to be beneficial so that gets the boot too next release...
DeleteFrom testing performance really suffered, I didn't measure anything though since it was obvious.
DeleteI think it was 864 Hz or something...
and figured maybe it was a rounding or accuracy error or something and checking the next value if it was "even".
Eventually I came up with 1/1250 which is 0.0008.
I changed Hz to 1250 then and went straight to the next compilation.
No problems anymore.
Probably just your hardware - maybe it makes the tsc unstable. It behaves fine on mine. Don't get anything like your results now that I've tried cyclictest (was getting empty directory when trying to git clone it before.)
DeleteAlso bear in mind that cyclictest should be run as sudo to run realtime. It might think it's running realtime when running on muqss without sudo but it's running sched iso which is nothing like running sched_fifo.
DeleteThe hardware argument makes sense since I need to enable "Enable PCI quirk workarounds" in the kernel also to get low latency.
DeleteSo, summarising several of the reports of this blog entry, choosing one of the more traditional HZ values (and NOT 128) would be more reasonable (more compatible, safer, etc. for drivers, kernel and userspace) ?
DeleteThanks and best regards, Manuel Krause
Here is something that I'm getting only on 128HZ
DeleteAPIC calibration not consistent with PM-Timer: 93ms instead of 100ms
APIC delta adjusted to PM-Timer: 1312496 (1230459)
I took a very quick look into the code but I can't see how this might be HZ related. Ideas?
duud
This might be interesting:
Delete#define LAPIC_CAL_LOOPS (HZ/10)
duud
^ Good find. There might be more...
Delete@duud:
DeleteI also got these with 128HZ, second line differed a bit, by "... 1662498 (1558584)". Core2duo cpu, Florian reported this first on this blog page.
After recompiling same setup with 100HZ no such messages occur. But then, my FF forking issue is appearing again (vs. 128HZ).
I haven't seen slowdowns with 128HZ on my system. Maybe Con is on the right way with this path, but not touching all corner cases so far.
What is the "#define LAPIC_CAL_LOOPS (HZ/10)" thing about? The possibly needed fraction of 10?
Manuel Krause
@Manuel:
DeleteYes, its an integer devision
Short excerpt of a single file (not complete)
Delete/arch/x86/include/asm/apb_timer.h
Line 44 : (loops_per_jiffy * (HZ/4)));
Line 78 : #define MIN_SPU_TIMESLICE max(5 * HZ / (1000 * SPUSCHED_TICK), 1)
Line 79 : #define DEF_SPU_TIMESLICE (100 * HZ / (1000 * SPUSCHED_TICK))
Line 133: timer64_config(TIMER64_RATE / HZ);
Line 192: bogosum/(500000/HZ), bogosum/(5000/HZ) % 100);
Line 470: (loops_per_jiffy/(500000/HZ)),
Line 539: schedule_timeout(HZ/10);
Line 553: mod_timer(&timer_virt_cntr, jiffies + HZ / 10);
Line 667: mod_timer(&timer_spu_event_swap, jiffies + HZ / 25);
on 4.8.10 kernel, 4.8-ck8 patchset
ReplyDeletepatch -p1 < ../4.8-ck8/patches/0006-Implement-min-and-msec-hrtimeout-un-interruptible-sc.patch
patching file include/linux/sched.h
Hunk #1 FAILED at 437.
1 out of 1 hunk FAILED -- saving rejects to file include/linux/sched.h.rej
patching file kernel/time/hrtimer.c
Hunk #1 FAILED at 1788.
1 out of 1 hunk FAILED -- saving rejects to file kernel/time/hrtimer.c.rej
^^ Also there are two 0001-... patches, maybe this is the problem?
DeleteYes that was it. Uploading new tarballs. The second 0001 was not meant to be there.
DeleteWorks now and running great on a W3690 hexacore (4.8.10 + 4.8-ck8 patchset).
DeleteDidn't notice any major increase regarding cyclictest^^.
Also I like the implementation of tunables although I didn't have time to play around with them yet.
Best regards,
Anonymouse :)
Hi,
ReplyDeleteI just updated to MuQSS v0.144 and for the first time I saw this kernel message in my Arch syslog during boot:
kernel: APIC calibration not consistent with PM-Timer: 93ms instead of 100ms
My HZ-config on my Core2 Duo:
CONFIG_HZ_PERIODIC=y
# CONFIG_NO_HZ_IDLE is not set
# CONFIG_NO_HZ_FULL is not set
# CONFIG_HZ_100 is not set
CONFIG_HZ_128=y
Does this anything has to do with HZ_128=y? Do I have to optimize one of the following options:
/proc/sys/kernel/hrtimer_granularity_us is defaulting to 100 and
/proc/sys/kernel/hrtimeout_min_us defaults to 1000.
Thanks,
Florian.
Neither the Hz value nor those tunables should cause or have any effect on that so I don't really know why, except that not all hardware has stable TSC to be able to use them for the apic timers. I don't believe you need to do anything as the kernel will automatically pick the fastest stable clock it can.
Delete@Florian & @ck:
DeleteI get the same message (same cpu type), but it seems to be only informative. Second line after this is on my system:
APIC delta adjusted to PM-Timer: 1662498 (1558584)
So, kernel seems to know how to deal with. I don't observe anomalies on my system.
BR, Manuel Krause
Hello.
ReplyDeleteHere comes the usual benchmarks. The kernel configuration is Archlinux's 4.8.7 one. Intel-pstate+powersave frequency governor is used.
CFS vs MuQSS144
http://openbenchmarking.org/result/1611224-LO-CFSVSMUQS05
MuQSS140 vs MuQSS144
http://openbenchmarking.org/result/1611224-LO-MUQSS140164
There is some small improvement with MuQSS144+interactive=1, notably on ebizzy.
Pedro
Nice, but too bad those benchmarks don't show the major gain in responsiveness compared to CFS.
DeleteThere are some more benchmarks, to those who are interested: https://docs.google.com/spreadsheets/d/1EayezAsGlJdXjZbS3b9m7YtvtRF-DJ3xrT3hYCvfymQ/edit?usp=sharing
DeleteBr, Eduardo
Hey,
ReplyDeleteJust wanted to let you know that I still have issues with the spotify scrolling even with this new release (Same workload as I described in my email).
I have tested this with 1kHz so far. I will test the new 128Hz in a moment.
Is there anything else I can do to help tracking down this problem?
DeleteSeems that 128Hz didn't make a difference.
DeleteAs I said to you in the email, I did not specifically address your issue and was hoping the fixes in 144 helped but I guess not. I'm still scratching my head on that one for the reasons I've mentioned. Perhaps send me a fresh 'top' output again with the affected workload and make sure it shows threads please.
DeleteHey,
DeleteSorry I couldn't use my email because I am currently out of city, so obviously I am not able to test the nicing stuff (rcu_preempt) until sunday.
Despite that, I can at least give you my grep RCU .config result:
# RCU Subsystem
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_RCU_EXPEDITE_BOOT is not set
# RCU Debugging
# CONFIG_PROVE_RCU is not set
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_RCU_PERF_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
@ck:
ReplyDeleteWith my first shot test I see happened progress between 140 and 144, regarding my FF "issue".
Running 4.8.10 with the -ck8 timer patches at default 128Hz. And very low system base load. Very nice :-)
BR, Manuel Krause
Hi Con.
ReplyDeleteI've some questions on interbench.
I ran it several times and if found that sometimes there are big variations with CFS in 'Max Latency', '% Desired CPU' and '% Deadlines Met'. Average latency are more consistent, but still with variations.
I tried with both intel-pstate performance and powersave. Doesn't make a difference.
I tried running interbench longer (-t 90), and it is better.
You can see the results here:
https://docs.google.com/spreadsheets/d/163U3H-gnVeGopMrHiJLeEY1b7XlvND2yoceKbOvQRm4/edit?usp=sharing
The colors mean:
blue = within +- 10% of the reference
green = better
red = worse
The reference is the first run on the left.
I wonder if such variations are expected.
If it is so, how to do a fair comparison between schedulers ?
Interestingly, there are less variations with MuQSS (maybe due to the heuristics in CFS ?).
Pedro
Yes, CFS is a big state machine and that's what happens when there are heuristics to predict how to be interactive - there is wild variation in behaviour. MuQSS is deterministic in its behaviour and has no heuristics so the results should be mostly consistent apart from system load events (kernel threads, interrupts etc.)
DeleteThanks a lot, this release seems to alleviate TF2's startup time & fps problems. I also haven't run into any crashes or issues in an hour of testing.
ReplyDeleteSubjectively, my whole system seems to be more responsive and also boot up a little quicker but that could be placebo as I haven't done scientific testing. Although I could run a youtube video in the background while gaming without noticing any changes in input responsiveness (which I usually do in that case) so I guess something in this release does improve that.
~ kiwii, the anon who filed the TF2 bug report
Thanks for the feedback. Yes the timer changes in -ck are specifically designed to work around userspace coding errors which make it inappropriately Hz dependent. Additionally I've noticed that the boot process is Hz dependent too so you're right that the boot is quicker from both kernel code and system(d)/init.
DeleteYep, the boot process got a lot faster for me as well. Also, I think there's something about nvidia-drivers that is HZ dependent as well. GDM on my laptop using Intel drivers was much, much faster than my desktop with a nvidia card, and my desktop is A LOT faster than my laptop.
DeleteSo is the GDM slowdown fixed for you now then?
DeleteYes, it seems to be as fast as a kernel previously compiled with 1000hz. I'll make some tests with a 1000hz kernel to confirm later.
DeleteWith the conerns regarding security and privacy, distrobutions like debian have started to release grsecurity in their repos. SID only for now.
ReplyDeleteany plans to make CK compatible with grsec (in the [near] future)?
Not really, no.
DeleteThanks a lot. Very responsive on a i7-870 quadcore (oc) even more than CFS. I also used "KBUILD_CFLAGS += -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1 -fno-builtin -pipe" in Line 200 of /arch/x86/Makefile which seems to give a nice boost.
ReplyDeleteAny source from where you got it? And, may there be a whitespace typo in it before "-pipe"?
DeleteDo you use full -ck8 or plain MuQSS 0.144, and which HZ value have you chosen?
Thanks in advance, BR, Manuel Krause
Full ck8, 1000Hz. I was testing a lot of compiler options, those proved to increase performance significantly.
DeleteThose alignment options I got from there > http://stackoverflow.com/questions/19470873/why-does-gcc-generate-15-20-faster-code-if-i-optimize-for-size-instead-of-speed
-fno-builtin is a recommendation by Agner Fog.
" -pipe" is extra. It pipes output rather than using temp files which speeds up the compilation.
Is that supposed to affect the kernel at runtime or only the compile-time? My Makefile already has -falign-jumps=1 and -falign-loops=1 set for the 64bits architecture before line 200, but none of the others.
DeleteThanks for your reply! Forget my last posting for the moment. Why still 1000Hz?
DeleteAnd when speaking so offtopic about gcc compiler options: Wasn't there a Makefile option for it to compile for more performance? I don't remember any more.
^^ It affects runtime of the kernel, it will be faster (on Intel, can't speak for AMD since I don't own any AMD box).
DeleteOut of lazyness I use always the whole "block" and copy-paste anywhere. I didn't pay any attention if it's already there.
1000 Hz feels more responsive.
Yes, but it is plain -O2 as opposed to -Os which is for small binary size.
^^^ These add some options on top of -O2.
Okay, thank you for your detailed info so far. I'm atm. at 250Hz, what doesn't make troubles like with 100 or 128Hz. Reboot is pending. :-)) I'd come back to this in some hours.
DeleteBR, Manuel Krause
I gave it a little longer time to prove i'ts no Fata Morgana. I'm quite excited about the scope of the effects: interactivity increased while not slowing down other subsystems (display/window refresh, disk i/o, eth transmission) -- these even seem to benefit too. No benchmarks done, but subjectively with these options the 250Hz kernel feels superior than a 1000Hz one without. And, no errors occurred.
DeleteGreat thanks for sharing and BR,
Manuel Krause
You're welcome.
DeleteNice it worked for you aswell.
You said, you've tested many compiler options. If you keep to stay uptodate in this area for the future, please don't hesitate to publish your findings on here. Although a bit offtopic, I can imagine many people on here who may want to benefit from your time consuming testing work.
DeleteI can also only speak for my intel core2duo, hopefully other brands' testers prove the positive effects aswell.
BR, Manuel Krause
If you're finding 250Hz works best that's almost certainly because mainline's default is 250 and no doubt there is code which was developed and optimised at that value and no one ever bothered to test other values to see if they're problematic. Many of the /10 divisions in the code that have been pointed out are still harmless even if they round down.
DeleteI came back to
DeleteCONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
and
CONFIG_HZ_250=y
after I had some problems while compiling palemoon browser a week ago with 100% CPU usage and my system nearly got frozen, no responsiveness, windows weren't redrawn etc.
No problems with actual 4.8.11-1-ck but I still don't understand one thing: now WITHOUT periodic timers I Have about 10% of CPU usage MORE than before. This is shown by htop and other utilities. Is this really fact or some kind of different measurement (periodic timer vs. tickless idle)? I have no problems with nonfluid programs, but see higher CPU usages even when system is idle (4-5 % instead of 0-1 %).
That's the kind of tip I was searching for! Thanks and how Manuel Krause wrote: these little off-topic-tips are great! I read from Agner Fog:
Delete"The first thing that you can do to improve the performance is to drop the builtin versions of memory and string functions. The speed can be improved by up to a factor 5 in some cases by compiling with -fno-builtin. The builtin version is never optimal, except for memcpy in cases where the count is a small compile-time constant so that it can be replaced by simple mov instructions."
Two weeks ago I compiled the actual gcc 6.2.1-1 version by myself (took about 4 hours but I wanted to test if this can improve my system without any necessary cross compiling). As described I edited [...]/arch/x86/Makefile and except always the same one message of declared but unused variable had no other compiler warnings and messages during kernel compile. I don't know if the reason is kernel code or the magic -fno-builtin option. I never had only this one message when compiling my kernels (normally there are few more warnings).
I have no benchmarks or other proving data but it seems to be an improvement as my Core2 Duo Arch Linux System runs fluid and I didn't observe any difficulties with actual 4.8.11-1-ck today. Great! :-)
@ck & @Florian:
DeleteThe reason to test the 250Hz again, was the imagined promise of higher throughput vs. 1000Hz, with the now obtained extra interactivity by the above-mentioned compiler configuration addons. Without the latter, I'd still prefer 1000Hz on my system when aiming at interactiveness.
Over the weekend I've started a round of comparative tests with 250Hz, 200Hz and 160Hz, inspired by the division (by 2/4/10) talks on here, mainly to try to pin down my Firefox forking issue privately on my own. Unfortunately, regarding this goal, my results are inconsistent (meaning: unexpected) and may need more rounds.
What I can say: All three down to the 160Hz version, lowest tested atm., don't throw out the APIC timer confusion+ correction message (reported above).
For the moment I'm quite confident with the 160Hz version (but it can also be a 'one-boot-wonder' ;-)).
Con, can you please tell, maybe again, what reason led you to choose 128Hz?
BR, Manuel Krause
Sure. Division on a CPU is a relatively expensive process in terms of how many cycles are used to perform it. In the kernel code the code X/HZ is used quite a lot as a macro which would be converted to an actual division for all the normal values of HZ used in kernel configuration. The value 128, on the other hand, means the macro X/HZ can be converted into a logical shift operation of X right shift 7 bytes (X >> 7) which is an extremely fast operation in a CPU by comparison. I chose the lowest value that is still in the 100-1000 range since values outside this range are known to break code. However this is truly a micro-optimisation and if code expects values to be a multiple of 10 for other reasons it would cause macro-breakage that greatly offsets any micro-improvement.
Delete@ck:
DeleteThank you, Con. Your explanation was quite programmer-oriented, but after reading it twice... ;-) there still remain questions:
Will using 256Hz take advantage of shift operation too, meaning here X >> 8, etc. with power of 2 (512, 1024), and would also be beneficial? What in fact happens to the divisions /10: do the above mentioned binary shift ops do their job in place anyways, and only the /10 parts suffer when called?
I've now a 256Hz kernel running, and the aforementioned APIC messages came up again, just for info (not with 100, 160, 200, 240, 250, 1000 -- but with 128 and this 256 Hz one):
[ 0.058593] APIC calibration not consistent with PM-Timer: 97ms instead of 100ms
[ 0.058593] APIC delta adjusted to PM-Timer: 1662496 (1623516)
But no issues observed.
Best regards, and thanks in advance, if you find time to answer some of my questions,
Manuel Krause
Yes 256 is also a fast shift instead of a slow division. However the kernel code is using something divided by 10, and you'll always have rounding down unless you use a multiple of 10. There is no power of 2 that is also a multiple of 10 anywhere between 100 and 1000 so you cannot get both.
Delete@ck:
DeleteDoes the "rounding down", in some /10 cases, eat up the advantage of fast shifts? Above, you've written about macro-breakage for /10 cases, and I've not fully understood the circumstances.
I've taken the 256Hz for this test, as I assumed it to round down to 250 as next lower decimal for related /10 division ops, and given your words that 250Hz is a(n old and so far unquestioned mainline) kernel "icon", that many drivers may rely on.
In my everyday usage experience this now advances to my best choice.
BR, Manuel Krause
Thanks for the tip concerning
ReplyDelete-falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1 -fno-builtin -pipe
I used to do "KBUILD_CFLAGS += -march=native -mtune=native -pipe" before and am now testing if I realize some speed boosting on my Core2 Duo. Compiling and starting without any issues, perhaps I discover some milliseconds of speed boosting. ;-)
-O3 -march=native -mtune=generic -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1 -fno-builtin -pipe works well here. Thanks.
DeleteDid some more "experiments" and came up with:
DeleteKBUILD_CFLAGS += -O3 -march=native -mtune=generic -falign-functions=1 -falign-jumps=1 -falign-loops=1 -falign-labels=1 -mno-mmx -mno-sse -mno-sse2 -mno-sse3 -mno-ssse3 -mno-sse4.1 -mno-sse4.2 -mno-sse4 -mno-avx -mno-aes -mno-sse4a -mno-3dnow -fno-builtin -pipe
Atm. I don't understand why you disable cpu specific enhancements (-mno-mmx -mno-sse...). Can you explain the reason, please?
DeleteBR, Manuel Krause
From my test runs it speeds up the kernel considerably. Maybe those enhancements cause latency.
DeleteWhat system do you run? More details, please.
DeleteUnfortunately, I've had to take a break with this Makefile testings. The "-O3" was prone to runtime errors in former compiler/kernel days, but really months or years ago.
I've needed to make sure, that none of these optimisations were causing, that my firefox doesn't playback any flash videos in firefox anymore. It's a pity that I can't rewind the recent MESA updates, what I'd call responsible for my actual issue.
BR, Manuel Krause
The CFLAGS chosen in the kernel makefile are often carefully selected based on regression testing and low level assembly errors discovered with higher levels of optimisation so they appear to be relatively conservative intentionally. Using custom CFLAGS is likely to lead to subtle low level bugs which is why I've never included any in my kernels nor included the option to do so.
DeleteLenovo Thinkstation S20, Xeon W3520 2.66 GHz quadcore, 6GB RAM, 250GB SAMSUNG 850 EVO, 500GB Seagate HDD, NVIDIA Quadro 4000, Intel EXPI9301 ethernet, NEC USB 3.0, no issues.
DeleteRegarding conservative cflags, even gcc will sometimes fail to build a proper kernel (microbugs) even when using just the standard -O2.
But yes -O2 is more safe than -O3.
I've reinstalled a very very old MESA known-good backup now. Same problem with flash.
DeleteEither the source server is misbehaving since some days or the recent firefox-esr update is trash.
The CFLAGS changes are not relevant for my problems, cross tested, but maybe for this subthread's original poster.
@ck & @Florian:
Atm. I'm using a 512Hz kernel, what allows virtualbox modules to compile, and astonishing: don't make APIC timer issues (like with 128, 256).
BR, Manuel Krause
The "-O3" option still remains erratic. Like Con wrote about low level bugs above.
DeleteOn my system reboots or powerdowns get stuck before doing so.
BR, Manuel Krause
Xeon W3520 quadcore 2.66 GHz, 6 GB 1066 MHz RAM, nvidia quadro 4000, samsung evo 850 250GB SSD, seagate 500GB HDD. Old Lenovo Thinkstation S20 but still good (enough).
DeleteI suggest switching from O3 to O2 and all will be fine, well most of the time.
I had gcc produce microbugs even with plain O2 in rare cases.
The O3 adventure went fine here, no problems (and faster in some cases) although I reverted to the above with O2 instead of O3 on the server for reliability.
Thanks for the patches.
ReplyDeleteHello Linux Desktop ;).
Had to revert to 4.8(.0) though since 4.8.7-4.8.11 were too slow for my taste.
4.8.0 with ck8?
Deleteyes
DeleteDid some testing using full ck8 patchset.
DeleteKernel 4.8 got gradually slower starting from 4.8.0. Slight slowdown from 4.8.0 to 4.8.1. Major slowdown from 4.8.1 to 4.8.2. At that point it was already too slow and I stopped.
more details, please
DeleteThat's too vague :/
In what use-cases
Low latency desktop, gaming, input lag, ...
Delete4.8.0-ck8 is the best?
DeleteThe best... I don't know. But the fastest and most responsive.
DeleteCon, is the scheduler responsible for interaction with workqueues ?
ReplyDeleteJust got the 53 second lockup while browsing with chromium, having compiz active
afaik I got more of these in the past few days, X was frozen but it could be rebooted via Magic SYSRQ Key,
didn't know that it would take a minute or longer for it to "pass", otherwise I would have waited longer and reported here earlier ...
http://pastebin.com/tdeKZ9ai
[more than 4096 chars]
okay, screw that - it's regressions galore with the nvidia proprietary driver ;)
Deletehttps://devtalk.nvidia.com/default/topic/977518/linux/problems-with-multiple-opengl-applications-running-simultaneously-with-375-20-on-a-gtx970/1
https://devtalk.nvidia.com/default/topic/977518/linux/problems-with-multiple-opengl-applications-running-simultaneously-with-375-20-on-a-gtx970/post/5024978/#5024978
@kernelOfTTruth:
DeleteThank you, that you've found the culprit. I already was getting anxious.
BR, Manuel Krause
Had to downgrade to 370.28,
Deleteso yeah, it seemingly was the proprietary nvidia-drivers,
I'm suspicious however that it also could be the suggested optimization flags ...
so far it's stable
Your first scheduler related and your very last question regarding the compiler flags were my main concern since I don't use the nvidia drivers.
DeleteHave you found good results with the suggested compiler options on your system too? For my system I'm still convinced of their usefulness.
BR, Manuel Krause
375.xx drivers are riddled with bugs, for the last month or so. I wouldn't get anxious until I saw the next major update. Even Folding@Home is crippled in the newer drivers, so we have to stay with 343.xx.
ReplyDeleteerr...make that 373.xx, the last driver without major bugs.
DeleteIs there a way to configure the kernel with CONFIG_SCHED_BFS_AUTOISO but then for MUQSS?
ReplyDeleteThere was never such an option even for BFS in any of my kernels.
DeleteI see, I assumed that https://github.com/zen-kernel/zen-kernel/blob/4.7/master/init/Kconfig#L75 was your work.
ReplyDeletehaving said that is there a way to use an automatic sched_iso policy for X using MuQSS?
ReplyDeletejust add schedtool -I `pidof Xorg` to rc.local
Delete^ sry, I mean autostart of desktop environment.
Delete@ck:
ReplyDeleteQuite a nice one from Virtualbox after trying 1024Hz:
/tmp/vbox.1/r0drv/linux/the-linux-kernel.h:332:3: error: #error "HZ is not a multiple of 1000, the GIP stuff won't work right!"
# error "HZ is not a multiple of 1000, the GIP stuff won't work right!"
BR, Manuel Krause
@ck:
Deleteaddon: all chosen values tested/ written above haven't led to this msg. 1000 remains a border. I find it funny that virtualbox complains for above 1000.
BR, Manuel Krause
Hi ck, a while back you offered a Ubuntu 4.8.7-ck7 kernel. That is running ever so smoothly that I took to building a more recent 4.8.12 kernel, incl. your latest MuQSS patches. Builds fine, but I must be missing an important part of the puzzle, as I can't get it to boot. Would you consider posting your Ubuntu kernel build script here (if you use one..)?
ReplyDeleteosu! still crashes, hangs, and locks up (the entire system) for me even with the workaround mentioned in earlier post comments. After several tries with ck and ck-ivybridge, I did get a different result in dmesg:
ReplyDeletesnd_hda_intel 0000:00:1b.0: IRQ timing workaround is activated for card #0. Suggest a bigger bdl_pos_adj
CPU: Intel i5-3317u
RAM: 7853MiB
GPU: Intel HD4000/NVIDIA 640M LE
WM (tested): bspwm/i3
Dist: Arch Linux (Reinstalled twice)
Device: Dell 3421
For GPU testing, I ran through using intel only, nvidia through bumblebee and with nvidia only configured through xorg.conf.
DeleteFor runtime testing with osu on wine (staging), notable environment changes were primusrun with bumblebee and rt priority (STAGING_RT_PRIORITY_SERVER=90 STAGING_RT_PRIORITY_BASE=90) in staging. Either randomly freezes, crashes, or locks up the system. Lock ups always happen after setting staging rt priority environments.
I've had similar results with vanilla wine and wine-rt, but not thoroughly tested in this case.
Did you try muqss by itself?
DeleteNot yet. I will compile it now and see how it goes.
DeleteI have removed the ck and commented out the gcc optimization patch and added MuQSS: https://gitlab.com/tom81094/pkgbuild-edits/raw/master/linux-ck-MuQSS.
DeleteStill locks up the system. I did notice that setting a higher buffer size in Cadence for jackd doesn't make it crash until much later or when auto-playing very complex maps. Turning it off and using Pulseaudio locks up on launch.
A few things to note off the top of my head:
- I get very measurable (200+) xruns for jackd (-S/hpet) on 128bit buffer size, 3 periods per buffer on linux-rt and linux stable. It doesn't happen on linux-ck.
When playing osu! with these settings, it crashes or locks up pretty quickly during selection or beginning of songs. Raising it to 256 bit buffer size, 3 periods per buffer - delays this lock up significantly.
- Regardless of whether I set wine-staging RT priorities, from htop specifically only osu!.exe gets RT priority.
--> osu!.exe doesn't crash on linux-rt and linux stable on any Pulseaudio/Cadence setting ... is what I forgot to clearly specify.
DeleteAh if you're getting realtime priority then it's highly likely that it's related to rt capabilities and the CPU caps imposed on rt in mainline kernels that aren't there in muqss. Try sysrq-N when the machine locks up to see if it unlocks it. Make sure you have built support for it in your kernel config (under kernel hacking) and that it's enabled by setting the value or /proc/sys/kernel/sysrq to 1. Then when it hangs try the sysrq-n combination which converts real time priority tasks to sched normal.
DeleteI never knew how cool SysRq was until now.
DeleteUnfortunately, most of the time osu! locks up the system entirely and none of the shortcuts work. When the time comes when it hangs and only hangs, iamready.
Thanks for 4.8-ck8.
ReplyDeleteHow is 4.9 going on? :)
Just how impatient can someone be???
DeleteSorry,
Deleteobviously very impatient :/