Announcing
a new -ck release, 5.2-ck1 with the latest version of the Multiple
Queue Skiplist Scheduler, version 0.193. These are patches designed to
improve system responsiveness and interactivity with specific emphasis
on the desktop, but configurable for any workload.
linux-5.2-ck1:
-ck1 patches:
Git tree:
MuQSS only:
Download:
Git tree:
Web: http://kernel.kolivas.org
This is mostly a resync from 5.1-ck1. A reminder if you're new to using my patches, MuQSS performs best when in combination with the full -ck patchset as they're all complementary changes.
Enjoy!
お楽しみ下さい
-ck
Thank you!!! Broken out set applied to bare kernel cleanly as did the 5.2.1 incremental kernel patch.
ReplyDeleteCan you explain the difference between the broken out patchset vs. the regular patchset?
Broken out is just all the incremental patches that make up the whole -ck1 patch in case people wanted to audit or select unique parts of the patchset.
Delete@ck:
ReplyDeleteBTW, your last blog entry/thread "linux-5.1-ck1, MuQSS version 0.192 for linux-5.1" completely disappeared with your new announcement for kernel 5.2.
Any chance to get it back?
Best regards,
Manuel
Runs great!
ReplyDeleteThanks.
Thank you
ReplyDeleteWith regards to reducing timing freq. to 100, would it be unwise to manually patch it down to 10Mhz? I've already brought CONFIG_RCU_BOOST_DELAY under 100.
ReplyDeleteProbably wouldn't make any demonstrable improvement but might break code that expects things to be 100+
DeleteRuns amazing on a Pentium N3700.
ReplyDeleteThank you.
Can you please check if statistics for PSI (especially memory) are collected correctly?
ReplyDeleteHZ=1000, Idle_dynticks.
cat /proc/pressure/*
[cpu]some avg10=99.00 avg60=99.00 avg300=98.94 total=2059480257
[io]some avg10=13.37 avg60=14.76 avg300=9.24 total=231227064
[io]full avg10=0.00 avg60=0.00 avg300=0.00 total=1985
[memory]some avg10=49.00 avg60=49.00 avg300=48.94 total=933291484
[memory]full avg10=0.00 avg60=0.00 avg300=0.00 total=325
Con, I would like to ask Your opinion on muqss + LLC / rqs.
ReplyDeleteI have checked the code (which don't understand much), but You're checking whether CPUs share cache: when the CPU is the same - locality is set to 0, when CPUs are SMT siblings - locality is set to 1, in case CPUs share cache - locality is set to 2, it seems that the rest is non-important for desktop CPUs.
On Intel desktop CPUs, there is one big L3 or LLC (Last Level Cache) available for all CPUs, CPU topology seems to be 2 levels, however Ryzen has multiple LLCs which are shared to core complexes (groups of cores) (like on Xeon CPUs?) which makes topology to 3 levels. So to me it seems (and I could be wrong here) that this is not taken into account.
I assume that it's debatable whether considering multiple LLCs actually helps, I would guess that on Ryzen that may help more (due to infinity fabric latency) than on Intel, but again, I could be wrong here as well.
So the questions are: whether it could be beneficial to take multiple LLCs into account when finding best CPU to migrate (schedule) task to? Is 3 level topology the issue why on Ryzen (or Xeon as well?,I can not verify), even with rqshare=all (which in code says "/* This should only ever read 1 */"), we have odd number of rqs? Is odd rq count is just cosmetic issue? Is this hard to try out multiple LLC support / fix rq count?
Thanks.
Doubt it will make a demonstrable difference on muqss. No idea why there's an uneven number; I suspect something fundamental is actually broken as it really shouldn't happen. I just don't have the time and hardware to investigate.
DeleteSounds like it might be benefitial to start bringing more people on board. To diversify the test hardware and accelerate the investigative process.
DeleteAnd where exactly do you propose I get these people? I certainly wouldn't turn down patches but it's not like anyone's offering to help.
DeleteEveryone I know that was holding out on a new PC bought one recently with a 3900x since it genuinely is revolutionary for price to number of cores you can buy. Maybe you'll get the help you need in the next 12 months.
DeleteI have debugged the code, a lot of reboots, but now I know where is the problem with runqueue count on Ryzen within the muqss :)
DeleteIt was somewhat long path until got a little familiar with code and understood the issue, but to verify the issue I created a module which checks which CPUs are possible/online/present.
On my intel system i7 quad core + smt, possible/online/present are 8 each, however on Ryzen octa core + smt, online and present are 16 but possible are 32. How come!? :)
In muqss sharing is set up within online cpus, but runqueue count is calculated within possible cpus, thus the difference - 1 (shared queue for 0-15) + 16 (nonshared, for 16-31) = 17.
Non-existing CPUs (>= 16) are counted in because cpu rqs for them are equal to all leaders (smp/mc/smt) as set up initially.
As I understood, possible must be how much CPUs are physically possible, present are those which are actually installed(?) and online are those which are available to scheduler? Can You please explain the difference in couple of sentences? I couldn't find out by quickly googling it :(
I have not much time at the moment to dig deeper, to check whether it's ok to set up sharing within possible cpus (it kinda looks plausible, but I can be wrong here), but I would appreciate Your answer and opinion on this.
Thanks.
Thanks. That sounds like the culprit then. It should be safe to simply replace possible with online to fix the problem.
DeleteCon, I had some time available and I have made a patch that does the following:
Delete* fix runqueue count on Ryzen along with additional changes from possible to online CPUs (where it may matter, to my best knowledge)
* introduced LLC bits into scheduler (may be of little help on Ryzen and should be no change in performance on intel)
* there is now LLC locality, which helps to order cpus and rqs according to LLC cache locality (for intel there is no change)
Here are results of my testing (which took most of the time), mostly on Ryzen 8 core 16 thread CPU.
For games I tested, there is no real change in performance or it's hard to notice it, looks like they are GPU bound, most likley that's why.
Unigine valley (3D test), shows a little, but consistent improvement of 2-3% using mc sharing.
It's very interesting story about Diablo3, there is a little improvement using mc or no sharing (both show rather consistent 137-141 fps). Using smt however, the improvement seems to be quite big, say from consistently about 60-90 fps, to consistently 95-130 fps. The strange thing is that fps is all over the place with smt sharing, with or without my patch.
For kernel compilation, mc or none shows about the same performance, one second better in 11 minute job. smt sharing shows 8-9 seconds improvement over 11 minute job.
So to my best knowledge, I would say that my patch didn't broke stuff and made a little improvement (not couting runqueue count fix).
Since I'm not sure whether the patch is fine, I posted it here: https://drive.google.com/file/d/1fcA1g4LNQTYOokmO8SPVuePVPWNxaDff/view?usp=sharing . Can You please look at it and give Your opinion on it (edzis aaatttt inbox doooot lv)?
If there are brave ppl around, please try the patch (apply on top of MuQSS patch) and report back.
Thanks.
Thanks! It would be nice if you left your name for attribution purposes, and normally you should split up the changes into multiple patches doing only one thing, but I'm happy for any help I can get. On a brief look, it looks good, but I'll probably only get a chance to give it a thorough lookover and test when the next major kernel release comes out.
DeleteI'll split up patches when my time allows, that will be before next kernel is released :)
DeleteMy name is in the patch.
Eduardo
But before that, can someone who has at least Threadripper CPU, apply the patch above, get the kernel compiled and send me this: dmesg | grep -i muqss
DeleteI have put in more informative logging, with LLC bits as well, it would be interesting how that CPU looks from scheduler perspective.
Eduardo
All patches are available here: https://drive.google.com/drive/folders/1MxUcptaOgPbPgJoUdeq0GkEuoeyaRHdG
DeletePatches now are split from 0001-..., previous patch still remains there, it includes 0001-0003 changes.
So, I'm having an issue mounting an image to a directory with the mount command. You can repeat the bug by using using the Archiso releng profile default settings. What happens is on the Arch kernel an image can be mounted to a directory; but, using the CK kernel with the default Arch settings will not mount. Have any idea why this may be occurring Con Kolivas?
ReplyDeleteI have not had this issue at all, albeit with custom compiled kernel based on arch config.
DeletePlease check error messages and syslog after failed attempt for scheduler related issues, it's not that easy to guess them and it's hard to obtain crystal globe these days :(
This comment has been removed by the author.
DeleteThe kernel used is greysky's Linux-ck aur repo. https://aur.archlinux.org/packages/linux-ck
DeleteThere's also no relevant error messages in dmesg or recorded by journalctl after running "sudo mount" I can post the full log if you want that, but I do not believe there is anything in there that would indicate the cause of the problem.
DeleteThis comment has been removed by the author.
ReplyDeleteRunning 5.2.11 with -ck patchset and configured with CONFIG_RQ_SMT=y, i get this:
ReplyDeletecat /sys/devices/system/cpu/vulnerabilities/mds
Mitigation: Clear CPU buffers; SMT disabled
However, running lscpu i am clearly having HT enabled:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
Is this because MuQSS kind of "fools" the kernel into thinking i have 12 cores vs. 6 cores/12 threads when running with CONFIG_RQ_SMT=y?
Default Ubuntu 5.0.0-26-generic kernel reveals:
cat /sys/devices/system/cpu/vulnerabilities/mds
Mitigation: Clear CPU buffers; SMT vulnerable
I have 12 threads either way, and have not set any custom boot options, or disabled HT in the bios.
Thoughts?
I have the same situation on intel laptop CPU, CONFIG_RQ_SMT is not set with my config, so it seems that particular option is not causing this.
DeleteI may look at this, if I'll have some time.
Eduardo
Sveinar, I think I have fixed this behaviour in MuQSS, but I'm not sure because I can not test it. On Ryzen it boots fine, but it shows "Not affected" :)
DeleteI don't have any intel machine where I can test it, except laptop, which I won't be able restart for some time.
If You would like to test the patch, please shoot me an e-mail (I have it in post above) I'll give You a patch, if tests are successful, I'll post it together with revised patches for MuQSS.
Eduardo
I am interested in the patch patch Eduardo, and see what happens with my Intel cpu. Is it a different one than the one you posted above? https://drive.google.com/file/d/1fcA1g4LNQTYOokmO8SPVuePVPWNxaDff/view
DeleteI know from earlier experience that CONFIG_RQ_MC gives a wee bit higher performance when gaming/steam/wine, but it may come at a slight "smoothness" cost otherwise, and that is why i currently use RQ_SMT.
Ill take a look at it and report back. (Probably not time tonight i am afraid).
Patches are here: https://drive.google.com/drive/folders/1MxUcptaOgPbPgJoUdeq0GkEuoeyaRHdG
DeleteYou can apply all 4 patches or just 0004, it should apply fine, with some offsets.
Eduardo
Thanks Eduardo!
DeleteAdded all 4 patches, and recompiled.
[ 0.201482] MDS: Mitigation: Clear CPU buffers
[ 0.405676] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
cat /sys/devices/system/cpu/vulnerabilities/mds
Mitigation: Clear CPU buffers; SMT vulnerable
This is then in line with default kernel (and BMQ) for my Intel 8700K. Did a couple of runs of Unigine Valley, and Monster Hunter Online Benchmark, and results was a wee bit up (but within random variance)... Ie. atleast there does not seem to be a performance problem with this.
Worth a review from Con Kolivas. Perhaps more ppl want to give this a test too?
I will keep doing some tests and report back if i suddenly stumble upon some issues tho, but from initial testing it seems to be oki.
Which runqueue sharing are You using? My best was default mc, but none was very close behind, smt was very strange and slowest as I already wrote.
DeleteCan You please post dmesg | grep -i muqss to pastebin or send me an email with results?
I'll try now to microoptize CPU selection a little and try to debug smt behavior. Dunno when that will happen
Eduardo
https://pastebin.com/raw/Uf8Duar8
DeleteI am using CONFIG_RQ_SMT=y , and i agree that CONFIG_RQ_MC give a wee bit better performance when benchmarking/gaming. Perhaps it is just wishful thinking or my imagination, but general desktop experience SEEMS a wee bit smoother when doing stuff like make -j12 while surfing and whatnot.
Maybe it is some differences between Intel and Ryzen that creates this difference?
Oh, and when i DO bench/game, i also set /proc/sys/kernel/interactive=0, as this seems to help too.
Not sure how to interpret this Eduardo, but did a comparison benchmark between CONFIG_RQ_SMT=y and CONFIG_RQ_MC=y. (Used your patches for both kernels)
DeleteCONFIG_RQ_SMT=y:
Unigine Valley: Score: 6262, Fps: 149.7, Min: 50.8, Max:245
Monster Hunter Online Benchmark: Score: 27917, max: 127.5, Min: 54.6
CONFIG_RQ_MC=y:
Unigine Valley: Score: 6262, Fps: 149.7, Min: 50.6, Max: 247.3
Monster Hunter Online Benchmark: Score: 25591, Max: 116.3, Min: 54.1
Unigine Valley seemed to be within margin of error, while the Monster Hunter Online Benchmark was almost down 10%!
Tests was done with wine-proton-4.11, DXVK and FSYNC kernel patches.
Sveinar, before trying to interpret results, I'll try to run the same benchmark on my Ryzen system. I have mediocre videocard Sapphire Pulse RX570 8GB. Did You adjust benchmark settings in Monster thing or just install and run? Did You OC Your CPU? What is Your videocard?
DeleteI'll benchmark later this day, first half of the day is quite busy :)
I see You always refer to CONFIG_RQ... options, which leads me thinking that You're recompiling kernel for mc or smt separately. I'm using kernel parameter rqshare and set it to smt, mc or none, eg. rqshare=none (config options just set the default rq sharing). This way I'm always using the same binary, the only change is rq sharing. Can You try running Monster on the same kernel just with changing rqshare?
Eduardo
Additoonally, is Your kernel compiled with O2 or O3?
DeleteEduardo
Eduardo: You are absolutely right! I tend to compile various kernels, and find it easy to switch between them when booting vs. the "edit grub" approach... but minor compile-time optimizations can create binary variances that probably should be weeded out when comparing.
DeleteSo, here is my config:
Kernel: Custom 5.2.13 (compiled with -O3 and -march=native)
Os: Ubuntu 18.04 LTS
CPU: Intel i7 8700K (default speed no OC)
Gfx: nVidia RTX2070 8GB (no OC)
Ram: DDR4 @ 3600 (XMP profile, so technically it is factory OC'ed)
Wine: wine-proton-4.11 from STEAM git (WINEFSYNC=1)
DXVK: @b055275
nVidia driver: 435.19.03 (Vulkan BETA)
nVAPI: custom nVAPI (nvidia library) for DXVK.
I run mho_benchmark with the following settings:
Resolution: 1920x1080
Antialiasing: Off
Fullscreen: On
Results:
"rqshare=smt"
Score: 26585 - Max: 120.8 - Min: 53.8
"rqshare=mc"
Score: 25724 - Max: 117.5 - Min: 53.5
"rqshare=none"
Score: 26664 - Max: 121.8 - Min: 53.4
I did not bother with Unigine Valley, as that is mostly GPU limited, so changes in kernel options tend to have very little changes. Driver optimizations/wine/dxvk is ofc another thing.
Conclusion: rqshare=none scores the best, with rqshare=smt a close 2nd. rqshare=mc is almost 4% lower.
I think I have made, supposedly, a bit more improvements on MuQSS :)
DeleteAll numbers - compilations, valley seems to be a little up, 1%, maybe a little more, but on Ryzen that makes things better for sure, like, compilations are 3 seconds faster than previous results.
The patch is at the same place https://drive.google.com/drive/folders/1MxUcptaOgPbPgJoUdeq0GkEuoeyaRHdG , it's 0005-...
Sveinar, if You can, give it a go, but beware I have not booted that up on Intel, it should not change much on Intel, but now RQ and CPU order seems to be better than previously and if RQs were luckily matched good on Intel there should be no change, if not, small improvements should be observable. If You test it, please pastebin dmesg | grep -i muqss, so I can take a look whether the output is ok.
As for testing, I have compared mc, smt, none with the new patch, results (compared to my previous patch) are as follows:
* kernel compilation: mc and smt shows comparable results, faster than previous by couple of seconds, none improved a little but is 2 seconds behind mc/smt, although it saw a nice additional 3-4 second improvement by itself
* valley: results are somewhat the same, none improved - it's surprisingly the fastest now, mc went down a little and smt lingers in between, but they are all under 1% difference
* moh: I installed it and run (nothing specifically was selected, just proton 4.11 + dxvk, not sure whether fsync worked or not, I'm not that interested, I just wanted to see the comparison), the fastest is smt, mc comes next and none is slowest, but again, they are under 0.7% difference from fastest to slowest
* diablo: mc is clear leader, none is second, smt is slowest with a bit less fps variance than previously (it's still somewhat over the place compared to mc and none, but now more stable towards to high numbers)
So again, it seems a little improvement, nothing broke, at least in my testing, please give it a go and let's see what comes up :)
Eduardo
Update: I had a chance to boot intel with latest patch, I can not anything about performance, it works fine so far.
DeleteEduardo
Thanks Eduardo. I will test this patch with 5.2.14 and see what results i get :)
DeleteAs to "FSYNC", you need some kernel patches from STEAM kernel git. I have snipped the patchset from TKGlitch, and have it here: https://github.com/SveSop/kernel_cybmod/blob/master/0027-v5.2-fsync.patch
It is a "testing" thing tho, so its not everything that works ref. https://steamcommunity.com/app/221410/discussions/0/3158631000006906163/
This comment has been removed by the author.
DeleteEduardo: Another evening of benchies... But i did a few tests.
Deleterqshare=none:
Valley: 150.7 (6303) 50.4/245.1
MHO: 26691 121.9/54.1
rqshare=smt:
Valley: 148.6 (6218) 49.2/242.0
MHO: 26636 121.1/55
rqshare=mc:
Valley: 150.0 (6278) 50.5/244.8
MHO: 25766 117.7/53.8
Not huge difference between the lot, rqshare=none slightly ahead.
I decided i would test some more stuff, so i redid the benchies but this time i ran a wine compile in the background.
Wine was configured with: configure CFLAGS='-pipe -O3 -march=native' --enable-win64 --disable-tests
And i compiled with "make -j10" leaving 2 cores free for my i7 6core/12thread cpu.
rqshare=none:
Valley: 80.2 (3354) 30.4/206.2
MHO: 22092 100.9/44.4
rqshare=smt:
Valley: 81.0 (3389) 30.4/199.4
MHO: 23235 105.6/47.9
rqshare=mc:
Valley: 79.0 (3306) 31.3/141.2
MHO: 21180 96.3/30.7
Here rqshare=mc performed the worst, leaving rqshare=smt on the top. And this is what i like to imagine myself i can "sense" when i do various stuff.. Cos i tend to toss around a kernelcompile while watching youtube and whatnot. Or for that matter, i also tend to watch youtube videos when gaming (like game puzzles in Witcher 3 or stuff). I kinda think SMT would be most beneficial with that from the tests i did.
Something to consider atleast :)
But if rqshare=mc takes the longest to compile, that would mean that your gui apps are getting more CPU time so rqshare=mc should give you the best responsiveness.
DeleteTbh, it's rather interesting results and expectations from smt.
DeleteYears back I thought that smt might be the best, but constantly, on Ryzen at least, it's not exactly the case, mc won until I started to mess with runqueues and llc :)
Ryzen has microstutters while running mho or diablo3 using smt, mc and none behaves well. It seems that on Intel all is good and nothing has microtutters. It would be beneficial to test diablo3 on Intel to see whether fps are all around the place :)
Sveinar, mho did not microstutter at all, right?
My feeling and results somehow suggest that at least on Ryzen none is all around "the best" (after my patches).
But, these microstutters with smt really puzzle me, when and if I find and hopefully fix them, things may change.
I'm not exactly sure where to start...
BR, Eduardo
SMT nice may be featuring there.
DeleteIt would be strange if mho only microstutters with smt tho.. But that there is stutters here and there i do not think you can help when using wine/dxvk and shiat.
DeleteThere is a rather big overhead when dxvk compiles shaders, and to even out these problems dxvk has its cache, and nvidia has its own cache. I do not think AMD have the same shader cache functions (3 different driver branches.. i have no clue), so it would be well worth testing a different driverbranch when testing DXVK (amdvlk or whatnot).
And for reference i run wine benchies with:
DXVK_LOG_LEVEL=none
STAGING_SHARED_MEMORY=1
STAGING_RT_PRIORITY_SERVER=90
STAGING_WRITECOPY=1
WINEFSYNC=1 (Requires kjernel patch. Use WINEESYNC=1 otherwise)
/proc/sys/kernel/interactive = 0
schedtool -n -5 -e wine game.exe
Thanks Sveinar, I'm familiar with almost all of the above :)
DeleteI have fsync patch already applied and AMD drivers have on disk cache... I just didn't bother to check whether fsync is actually working as I don't need absolute numbers, I'm interested in relative.
For me only smt microstutters, as I tried to explain above, mc and none behaves well ;)
Will check Cons lead with smt_nice... If microstutters are gone with disabling smt_nice, then I have a suspect, otherwise I'm in square one :)
I just needed at least someone to confirm that Intel has no microstutter issue with smt in mho or diablo3, then I can further narrow it down to Ryzen specifics...
Thanks,
Eduardo
I disabled SMT nice, it's not helping, at least in D3... Somehow frame variance got even larger and stutter as well.
DeleteSo, now it's clear that SMT nice is not the cause of stutter.
BR, Eduardo
I am not sure how to really measure this microstuttering. I never run anything with vsync on, and i must assume you not either?
DeleteHow does the framegraph look when you enable DXVK_HUD=full? And fps?
Vsync issues + dxvk + wine is not at all unheard of, but afaik the wine-proton branch have fixes for disabling things like composition (that tends to fu** things up for wine), so there SHOULD be no reason to have issues with that.
vsync is off, I have not enabled HUD, but I'll check how framegraph looks sometime later.
DeleteMeanwhile I recorded the videos on the phone (so there is no interference with recording software) with MC (no stutter) and SMT (stutter). This is with all 5 patches, without these patches, it's even worse in SMT :)
It's quite visible if one looks at the FPS or globes with liquid or flame wings on my character in D3... In SMT You'll see some stutter / frame skiping or whatever the best name for it is :)
MC (good): https://youtu.be/rspJezj0W04
SMT (stutter): https://youtu.be/jF8Or7xK4wU
BR, Eduardo
Is there any particular cpu core usage when this happens with runqueue=smt?
DeleteCould you try running Unigine Superposition in "1080p Extreme" setting (Game mode), and look at the seeping smoke around the gravity control thingy in the back of the room, and see if this also stutters?
Hi, I have 16fps there with Your recommended settings, it's stuttering all over the place :)
DeleteOn medium settings, all is fine and fps is around the same with mc and smt, no stuttering visible. In Wolfenstein new order and youngblood there is no stuttering whatsoever, only select titles have it with smt, D3 for sure :)
I have experimented a little with D3 and if I offline half of my CPUs, stuttering is gone, it didn't matter which of them, I tried smt siblings (each other CPU in my topology) and whole CCX off (4-7 and 12-15 off), stuttering gone :) If I offline just 4 CPUs (4-7) then stuttering is less visible.
Now at least I have more data to work with :)
BR, Eduardo
Con, I have added rqshare llc which makes 2 queues, one per llc, on Ryzen, I can not say it improved much in my case, but maybe higher core count Ryzen may have some effect.
DeleteI still have to figure out how to make a cpumask for same llc CPUs, so I can use it to determine if llc cpus are free, similarly how You do it with cores (cache_cpu_idle). I have not found one ready to use and I don't think checking each llc CPU state in loops would be efficient.
This is next I'll be checking out when I'll have some time frees up.
Patches are in gdrive (https://drive.google.com/drive/folders/1MxUcptaOgPbPgJoUdeq0GkEuoeyaRHdG), all nicely split :)
BR, Eduardo
Update: I think I found what I wanted, I' ll rework the last 0006 patch sometime in the coming days.
DeleteBR, Eduardo
I have reworked 0006 patch, sanity test results shows no regression, but I have not booted this on Intel, although nothing should change.
DeleteThis one contains new rqshare config and kernel parameter option llc.
Please look at it whether that seems reasonable.
I have made compilation tests using modified LLC cache masks for idle check for the rest of rqshare options and did not found any regression or much of improvement for compilation tasks. I'll try some 3D and games test some time this week. Again, this should not affect anything that has single LLC.
If this shows some improvement, I'll make a patch for this separately.
BR, Eduardo
Hi,did this patch fix the stuttering problem with smt you had?
DeleteNope, unfortunayely not. smt got better with my patches, but do not solve problem entirely. It seems that issues are with Ryzen only, but for me other sharing options give good results so far.
DeleteI'll try to find the issue as my time permits.
BR, Eduardo
I would like to report that this patchset improves performance on my ancient core 2 quad setup, which has a similar topology (2 dies with split llc + single memory controller). good job Eduardo , seems your patchset will help many different systems, not just ryzen.
DeleteCan you define what you mean by improved performance? Some measurable metric or it just feels faster?
Deletespecifically between the stock multicoresiblings vs the mc-llc mode added by the patch. which ends up creating 2 run queues instead 1
DeleteIncreased frame-rate in multiple opengl applications, its not alot maybe 2-4%, it makes sense since the latency penalty between cores that don't share an l2 cache, is quite high on this generation of cpu,(as high as a 50% increase according to this benchmark anyway https://github.com/ajakubek/core-latency)
I did some bench-marking using hl2 lost coast (because it is cpu bound) average of 3 runs
Delete"stock" ck patches rqshare=mc avg fps 269.85
ck+eduardo's patches rqshare=mc avg fps 273.23 +1.2%
ck+eduardo's patches rqshare=llc avg fps 277.45 +2.8%
as you can see, eduardo's patches definitely improves performance on my ancient setup.
I haven't experienced any regressions or bugs.
So is this with a "Core 2 Quad" processor? And this processor does not share any l2 cache?
DeleteRef. https://ark.intel.com/content/www/us/en/ark/products/33924/intel-core-2-quad-processor-q9550-12m-cache-2-83-ghz-1333-mhz-fsb.html this processor have a "CPU Cache is an area of fast memory located on the processor. Intel® Smart Cache refers to the architecture that allows all cores to dynamically share access to the last level cache."
This is the same wording used for a I7 8700K aswell. I dunno what this implies, or if i am viewing the right processor tho. But does using the patches with "llc" show to any degree that it is actually differentiating this?
Should it show 2 runqueues when such a "separated" cache is used? (Cos i am fairly sure that it shows 1 for my 8700K when i tried).
I am just asking. I found no/little difference between llc and mc for the benchies i did, but i am willing to revisit this if it SHOULD be a different behavior.
Yes, I'm not sure how performance improvements from those changes on that CPU are possible either.
DeleteIf I remember correctly, there were times when intel created quad core CPUs by slapping two dual cores in the same package and call it a quad core :)
DeleteI'm not exactly sure how they organized LLC in that case, but it may well be that there were two LLCs.
Can Anonymous please pastebin results of "journalctl -b | grep -i muq", then we'll be more sure, how kernel sees that particular CPU.
On Ryzen LLC sharing (two queues, last 0006 patch) did not give any measurable performance boost or degradation for compilation tasks for my Ryzen 1700, but maybe Threadripper CPUs would get a boost, because cores / CCX etc. are organized in slightly different manner (and I don't know exactly how either) than my Ryzen. I have no access to TR, so I can not verify. TR even has two NUMA nodes.
I still need to test more stuff on that last LLC sharing patch.
If Anonymous shares the output, I could at least theoretically guess whether that may or may not help. We don't even know how cores are organized in that CPU.
BR, Eduardo
Just the reference about "slapping two dual cores together and call it a quad": https://www.extremetech.com/computing/49528-core-2-quad-q6600-four-cores-for-the-masses/2
DeleteBR, Eduardo
as requested
Delete[ 0.519762] MuQSS possible/present/online CPUs: 4/4/4
[ 0.519769] MuQSS locality CPU 0 to 0: 0
[ 0.519769] MuQSS locality CPU 0 to 1: 3
[ 0.519770] MuQSS locality CPU 0 to 2: 3
[ 0.519771] MuQSS locality CPU 0 to 3: 2
[ 0.519771] MuQSS locality CPU 1 to 0: 3
[ 0.519772] MuQSS locality CPU 1 to 1: 0
[ 0.519772] MuQSS locality CPU 1 to 2: 2
[ 0.519773] MuQSS locality CPU 1 to 3: 3
[ 0.519773] MuQSS locality CPU 2 to 0: 3
[ 0.519774] MuQSS locality CPU 2 to 1: 2
[ 0.519774] MuQSS locality CPU 2 to 2: 0
[ 0.519775] MuQSS locality CPU 2 to 3: 3
[ 0.519775] MuQSS locality CPU 3 to 0: 2
[ 0.519776] MuQSS locality CPU 3 to 1: 3
[ 0.519777] MuQSS locality CPU 3 to 2: 3
[ 0.519777] MuQSS locality CPU 3 to 3: 0
[ 0.519778] MuQSS sharing MC runqueue from CPU 1 to CPU 2
[ 0.519780] MuQSS sharing MC runqueue from CPU 0 to CPU 3
[ 0.519788] MuQSS CPU 0 llc 0 RQ order 0 RQ 0 llc 0
[ 0.519789] MuQSS CPU 0 llc 0 RQ order 1 RQ 1 llc 2
[ 0.519790] MuQSS CPU 1 llc 2 RQ order 0 RQ 1 llc 2
[ 0.519790] MuQSS CPU 1 llc 2 RQ order 1 RQ 0 llc 0
[ 0.519791] MuQSS CPU 2 llc 2 RQ order 0 RQ 1 llc 2
[ 0.519792] MuQSS CPU 2 llc 2 RQ order 1 RQ 0 llc 0
[ 0.519792] MuQSS CPU 3 llc 0 RQ order 0 RQ 0 llc 0
[ 0.519793] MuQSS CPU 3 llc 0 RQ order 1 RQ 1 llc 2
[ 0.519794] MuQSS CPU 0 llc 0 CPU order 0 RQ 0 llc 0
[ 0.519794] MuQSS CPU 0 llc 0 CPU order 1 RQ 3 llc 0
[ 0.519795] MuQSS CPU 0 llc 0 CPU order 2 RQ 1 llc 2
[ 0.519796] MuQSS CPU 0 llc 0 CPU order 3 RQ 2 llc 2
[ 0.519797] MuQSS CPU 1 llc 2 CPU order 0 RQ 1 llc 2
[ 0.519797] MuQSS CPU 1 llc 2 CPU order 1 RQ 2 llc 2
[ 0.519798] MuQSS CPU 1 llc 2 CPU order 2 RQ 3 llc 0
[ 0.519799] MuQSS CPU 1 llc 2 CPU order 3 RQ 0 llc 0
[ 0.519799] MuQSS CPU 2 llc 2 CPU order 0 RQ 2 llc 2
[ 0.519800] MuQSS CPU 2 llc 2 CPU order 1 RQ 1 llc 2
[ 0.519801] MuQSS CPU 2 llc 2 CPU order 2 RQ 0 llc 0
[ 0.519801] MuQSS CPU 2 llc 2 CPU order 3 RQ 3 llc 0
[ 0.519802] MuQSS CPU 3 llc 0 CPU order 0 RQ 3 llc 0
[ 0.519803] MuQSS CPU 3 llc 0 CPU order 1 RQ 0 llc 0
[ 0.519803] MuQSS CPU 3 llc 0 CPU order 2 RQ 2 llc 2
[ 0.519804] MuQSS CPU 3 llc 0 CPU order 3 RQ 1 llc 2
[ 0.519804] MuQSS runqueue share type LLC total runqueues: 2
[ 1.500417] MuQSS CPU scheduler v0.193 by Con Kolivas.
I realize that I poorly phrased "2 dies with split llc + single memory controller", I meant that its 2 dies each with their own share LLC (In this case a large shared l2 cache) with an off die memory controller.
DeleteThe core 2 quad and its xeon counterparts, are advertised as having a 12mb l2 cache, but in reality it is 2x6MB.
Thanks, now it's clear about LLC. Interestingly in this case LLC numbers are 0 and 2, not 0 and 1 as in case of Ryzen...
DeleteRQ and CPU orders seem to be right, localities are slightly different, but everything sort of looks ok.
At least theoretically I can see how small improvements could be observed in this CPU topology.
BR, Eduardo
I had some time yesterday, installed and ran cs:go "FPS benchmark" map, results surprised me. One of the first times I have seen that mc is the slowest (at least on Ryzen):
DeleteLLC: Average framerate: 238.53
MC: Average framerate: 229.29
SMT: Average framerate: 238.79
NONE: Average framerate: 236.86
All went smooth, no stuttering and such, smt and llc were the same.
Settings were autodetected to max, I'll try lowering them next time I'll run the tests.
Strange results, but they are repeatable.
BR, Eduardo
So, I had some free time and I have made a nice progress regarding Diablo3 stutter on Ryzen using smt. It seems that I have fixed it :)
DeleteThere is 0007 patch on google drive for someone to try it out.
So, with this patch I'm not exactly sure how it behaves on Intel. I have changed the bits which select the best CPU to schedule task to, it now selects CPU a bit more accurately. I would like Con to look at it as I lack the knowledge of an idea why it was as it was before - CPU cache busyness was not checked in all cases just in siblings locality, however thread busyness is checked always (my guess is because it's not exactly a full core and task would not run as fast on sibling as on normal core, if it's free).
In addition, in this patch I have switched to using llc CPU map to check whether CPU caches are busy in all rq sharing cases, which should not change anything on Intel.
I have comparisons of performance as well: D3 behaves gut, Valley and MHO shows better results, especially with smt, compilations are a bit down since previous patches by fair bit, it seems to be on par with numbers from vanilla muqss, cs:go numbers are up with smt and a little down using anything else.
So after this patch smt seems to be best overall :)
If Sveinar and Anonymous are still around, please give this patch a bit of testing and report back how it behaves for You. Thanks. I have not booted this up on Intel, though :)
BR, Eduardo
Currently i am on 5.3 and a reworked PDS scheduler. This works quite well atm, but i will perhaps give it another go once -ck/MuQSS is put up for 5.3.
DeleteSomewhat limited timewise due to some IRL stuff i am working on atm.
Interesting.
DeleteCare to share?
As I mentioned all of my patches are in my google drive, address as usual: https://drive.google.com/drive/folders/1MxUcptaOgPbPgJoUdeq0GkEuoeyaRHdG
DeleteSveinar, PDS works better than BMQ and MuQSS for You?
BR, Eduardo
https://github.com/SveSop/kernel_cybmod
Delete0001 and 0002 is the PDS patches for 5.3 and is from the TK-Glitch git repo.
Thank you very much, sir.
DeleteBetter.. hmm..
DeleteI used PDS mostly with 4.x kernel branch, and most things worked very well. Then 5.x came and BMQ came, but there was quite a few "starting issues" with BMQ, so i ended up with MuQSS.
I feel PDS is working well for me, but to really compare i would need to do all 3 for 5.3 i guess... but i dunno if i have it in me to fiddle with it.
Perhaps it could be an idea for Phoronix to do some comparison tests with the Phoronix test suite? Would be a fun experiment to suggest. Possibly also comparing AMD/Intel and the different schedulers.
As with all things: "The absolutely best color is green!"
Meaning: Its all in the eye of the beholder :)
Sveinar, this is interesting, thanks for info.
DeleteI'll go and check PDS, I thought it was dead :) I switched to MuQSS because BMQ had teething issues and PDS was not updated to newer kernels.
But MuQSS was acting weird with runqueues and llc, so I hopefully fixed it (at least I tried), it's working well now.
Nevertheless, for the sake of interest, I'll check my usual stuff on PDS similarly how I tested my patches to MuQSS, let's see how it performs.
I need to throw vanilla kernel into the mix as well.
BR, Eduardo
Hi, performance wise, what's the preferred option for CONFIG_SMT_NICE and CONFIG_RQ_*?
ReplyDeleteSince SMT is disabled for Intel processors, I am assuming a multi-core processor would most benefit from CONFIG_SMT_NICE=N and CONFIG_RQ_MC=Y, yes?
Not sure what you mean by "SMT is disabled for intel processors", cos the result of MDS mitigation (as seen discussed above) does not actually mean that SMT is disabled.
DeleteUnless you are talking about the recommended security option to be most secure tho, and you have disabled hyperthreading in the bios? :)
Last weekend a few issues were reported in the Arch Forum thread for linux-ck. Reproducable input/output errors were reported on 5.2.14 when issueing the 'systemd-detect-virt --container' command. This causes systemd (currently at v243 on Arch Linux) to skip all services carrying a 'ConditionVirtualization' entree (rather silently). Amongst affected services are
ReplyDeletehaveged and systemd-random-seed (important for other entropy-seeking units) and time sync units. We thought you wanted to know about this if you didn't already.
Discussion: https://bbs.archlinux.org/viewtopic.php?pid=1863063#p1863063
Fixes: https://aur.archlinux.org/packages/linux-ck
Regards,
glitsj16,
in name of the Arch sub-forum community
Is this related to why ck currently fails to mount images to directories with the mount command?
Delete@zar That might be the case. But to confirm you'd have to build with either:
Delete# https://bbs.archlinux.org/viewtopic.php?pid=1863567#p1863567
sed -i -e '/CONFIG_LATENCYTOP=/ s,y,n,' -e '/CONFIG_SCHED_DEBUG=/ s,y,n,' ./.config
OR patch MuQSS cfr. https://github.com/ckolivas/linux/pull/17
Is this somewhat related to this BMQ problem? https://gitlab.com/alfredchen/bmq/issues/8
ReplyDelete@sveinar That seems to be the same issue yes. Thanks for the detective work, it confirms there's a more generic issue with systemd-detect-virt and schedulers. For the record, I've made a PR at https://github.com/ckolivas/linux/pull/17 to the same effect.
DeleteNice.
DeleteUbuntu also set CONFIG_SCHED_DEBUG=y in the default config, and i have confirmed that files in /proc/pid/sched is "empty" for me running Ubuntu LTS 18.04 and 5.2.14+5.2.15.
Will try this patch later today :)
Added the patch, and it seems to be proper now.
Deleteeg. /proc/sys/34/sched : idle_inject/4 (34, #threads: 1)
I wonder how 5.3 kernel with the new "utilization clamping support" turns out vs. needing to use things like MuQSS/-ck patches?
ReplyDeleteSounds like it WILL give a performance boost to things like gaming and its like tho..
A commit (https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/tools/objtool?h=v5.2.18&id=47af17950b03b748eea68ad7613f8d8b4c688d45) in the 5.2.18 patch conflicts with 5.2-ck1 (due to https://github.com/ckolivas/linux/commit/40846db6244abc4696bcad4f889016e1952630f4).
ReplyDeleteIt should be simple enough to fix by hand, but I thought I'd mention it here, if anyone's looking for a explanation.
Thanks for reporting.
DeleteI am hoping for 5.3-ck1 to arrive soon.
Any patch for the inexperienced?
Deletelinux-ck PKGBUILD on Arch User Repository has a oneliner fix you can run against the patch-5.2-ck1 file: https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=linux-ck#n138
Delete(Replace the './"${_ckpatch}"' at the end with the location of the patch, and it should apply against >=5.2.18 afterwards)
Might as well post the full command in case the above link is updated:
Deletesed -i -e '/^-CFLAGS/ s,+=,:=,' -i -e '/^+CFLAGS/ s,+=,:=,' patch-5.2-ck1
Thank you very much, sir.
Delete