I've been experimenting with improving the locking scheme in BFS and had a few ideas. I'm particularly attached to the global runqueue that makes BFS what it is and obviously having one queue that has one lock protecting all its data structures accessed by multiple CPUs will end up having quite significant scalability limits with many CPUs. Much like a lot of the scalability work, it tends to have the opposite effect with smaller hardware (i.e. the more scalable you make something, the more it will tend to harm the smaller hardware). Fortunately most of the scalability issues with BFS are pretty much irrelevant until you have more than 16 logical CPUs, but we've now reached the point where 8 logical CPUs is not unusual for a standard PC. Whether or not we actually need so many logical CPUs or not and can use them appropriately is an argument for a different time and place, but they're here. So I've always had at the back of my mind how I might go about making BFS more scalable in the long term and locking is one obvious area.
Unfortunately time and realistic limitations means I'm unlikely to ever do anything on a grand scale or be able to support it. However I had ideas for how to change the locking long-term but it would require lots of incremental steps. After all that rambling, this post is about the first step to changing it. Like some of the experimental steps of the past (such as skip lists), there is absolutely no guarantee that these are worth pursuing.
I've implemented a variant of the read/write locks as used in the kernel to better suit being used as a replacement for the spinlock that protects all the global runqueue data. The problem with read/write locks is they favour readers over writers, and you cannot upgrade or downgrade locks. Once you have grabbed them, if you drop the lock and then try to grab the other variant (eg. going from read to write), the data is no longer guaranteed to be the same. What I've put together (and I'm not remotely suggesting this is my original idea, I'm sure it's been done elsewhere) are upgradeable read/write locks where you can take either the read lock, write lock, or an upgradeable variant. Upgradeable locks can be up or downgraded to write or read locks, and write locks can be downgraded to read locks, but read locks can not be changed. These URW locks are unfortunately more overhead than either spinlocks or normal rwlocks since they're actually just taking combinations of spinlocks and rwlocks. However the overhead of the locks themself may be worthwhile if it allows you to convert otherwise locked data into sections of parallel read code.
So here is a patch that implements them and can be put into pretty much any recent linux kernel version:
urw-locks.patch
Note that patch does nothing by itself, but here is a patch to add on top of that one for BFS423 that modifies the global runqueue to use the urw locks and implements some degree of read/write separation that did not exist with the regular spinlock:
bfs423-grq_urwlocks.patch
It's rather early code but I've given it a fairly thorough testing at home and it at least seems to be working as desired. On the simplest of preliminary testing I'm unable to demonstrate any throughput advantage on my quad core hardware, but the reassuring thing is I also find no disadvantage. Whether this translates to some advantage on 8x or 16x is something I don't have hardware to test for myself (hint to readers).
Note that even if this does not prove to be any significant throughput gain, then provided it is not harmful, I hope to eventually use it as a stepping stone to a grander plan I have for the locking and scalability. I don't like vapourware and since I haven't finalised the details myself exactly how I would implement them, there's nothing more to say for now. Then there's the time issue... there never seems to be enough, but I only ever hack for fun so it's no problem really.
P.S. Don't let this long post make you not notice that BFS 423 and 3.4-ck2 are also out in the previous post.
What about RCU (Read-copy-update) — can it be used instead?
ReplyDeleteNot particularly because of the delays in transactions and dependence on quiescent periods and so on.
DeleteI think it will be kinda difficult to find a BFS user with so many cores to do proper tests. It's somewhat of an oxymoron :-)
ReplyDeleteI have a dual quad machine (8 physical cores + 8 HT cores) and will be glad to test this out... the most quantitative benchmark I have is my infamous "make" benchmark which I will run and post the results.
DeleteThanks for hacking CK, don't ever stop!
Con, I think you should try to use the modular scheduler which mainline has.
ReplyDeleteDoing this could make the code more stable. It doesn't means that BFS has to use per-cpu runqueue completely because the modular scheduler has also provide you a free space to do what you want.:-) In the other words, the patch can be smaller with this.(We just need to change a few lines of code in core.c , sched.h and rewrite the Makefile).
Chen
Hi Chen,
Deleteas a normal user I really appreciate you and H.Danton working on the issue. There is a need of differentiaton in the area of schedulers. Wouldn't it be great if there was a plugin infrastructure for this at mainline: I think this would be quiet simple just a few #ifdef SCHEDNAME in mainline code.
Ralph Ulrich
Ok. But using the interface provided by mainline is better because it can prevent producing bugs. Everyone hate bugs especially the bugs in the critical part of a Os
DeleteOK... did some testing using the my "make" benchmark. I think this endpoint is actually relevant in comparing the code with a reproducible and realworld relevant endpoint, albeit a non-interactivity one!
ReplyDelete1) It is a non-latency based measure.
2) Compilation benchmark using gcc to “make bzImage” for a preconfigured linux 3.4 build.
3) Runs benchmarks 28 times totally to get a decent number of observations for a statistical comparison. In all cases, the first run is omitted leaving an n=27.
4) Results are how many seconds it took to compile on a dual Intel E5620 (2x hyperhreaded quadcore CPUs on a single board) @ 2.40 GHz.
5) Make is run with 16 threads (8 physical cores and 8 HT cores).
1st test: 3.4.2-bfs342 vs. 3.4.2-bfs3.4.2+urwlocks:
Result: The URWLocks patched kernel was on the borderline of achieving a statistically significantly difference. If I had powered the analysis with a larger number of replicates (say 63 or 72) I'll bet that it would be statistically significantly SLOWER than the unpatched bfs kernel. You can see from the Anova that the median time difference is 808 ms longer for the kernel running with the experimental patchset:
http://s15.postimage.org/4uv2qx2bt/anova_1.jpg
2nd test: 3.4.2-vanilla vs. 3.4.2-bfs3.4.2:
Result: The vanilla 3.4.2 kernel with the stock queuing system is statically significantly SLOWER than the corresponding bfs patched kernel; median time difference is 152 ms.
http://s15.postimage.org/5rej32ux5/anova_2.jpg
CK - Always glad to support your innovative ideas with regard to schedulers. Just ask and keep up the great work you do for the Linux community :)
Thank you very much. It's clearly not a win, but then it's not a massive lose either. If it opens up other possibilities in the scheduler it may be worth pursuing, but for now it's not going to be part of "official" BFS just yet.
DeleteWell...
ReplyDeleteThis RIFS code optimises for sleeping tasks under extreme loads. In other words, it adds unfairness favouring interactive tasks at loads of 64 or more. This is precisely the opposite direction of where the bfs code was heading, as it contains a sleep-based interactivity estimator (BFS has no interactivity estimator since they introduce unfairness and behave unpredictably) and it optimises for ridiculous loads that desktop users will never see.
So while I'm happy to see people hacking on BFS, this code is pretty much orthogonal to the whole BFS approach.
> Thoughts?
ReplyDeletei find a linear curve reassuring because it indicates a direct relationship between the quantities, meaning all influencers are known and there are no hidden surprises.
Hi Con
ReplyDeleteI just want to clearify one thing. You said that RIFS/-ES optmises for big load, but actually I don't do that. Also there is *NO ANY INTERACTIVITY ESTIMATOR* in the design of RIFS/-ES. If you say that RIFS have sleep-based estimator I can claim that BFS also have sleep-based estimator(Actually both of them don't have this).
Chen
Performance is a term meaning to comprise three dimensions:
ReplyDelete1. energy consideration (think of no NOHZ of RIFS)
2. latency of response to users
3. throughput of data
The last dimension most trivially to measure.
Regarding schedulers there is an additional dimension:
4. predictability (what realtime kernels want to provide)
CFS mainline wants to satify all needs universally. I think a scientific informatical research project could easily show that in principle there cannot be a scheduler to fit the needs of all cases.
Even if you think of a theoretical scheduler with adaptive strategies this artificial intelligence would cost much resources (energy and throughput).
I would like to have a selection choice for users. Something indicated in four dimension by the user from 1-5 points:
5511 a mobile user would want
1155 an audiophile creater would select
1551 a developer compiling his latest source edits.
Ralph Ulrich
"CFS mainline wants to satify all needs universally. I think a scientific informatical research project could easily show that in principle there cannot be a scheduler to fit the needs of all cases. " Yes it is. CFS wants to do that so but failed.
ReplyDelete"1. energy consideration (think of no NOHZ of RIFS)"
Ah, a news on Phoronix has shown that there is only slightly improvement of energy costing with NOHZ.(Very little). On the other hand NOHZ will cause some problems on interactivity.
Chen, I know: NOHZ energy savings only ticks/gigaHerzOfCpu
ReplyDeleteI wrote about four dimensions, because I want the developer of a scheduler (YOU) to emphasize on differentiation not univerality!
What are the differences in regard of these four dimensions
(BFS - RIFS - RIFS/ES - CFS)
Chen you didn't answer this question yet, which I also asked at phoronix. You just say your scheduler performs (universal) better than CFS. This answer is a bug :)
Ralph Ulrich
@Ralph Ulrich
ReplyDeleteRIFS is designed for mobile/desktop user.
Ralph,
ReplyDeleteLatency and throughput are heavily connected. The value of one will affect the value of the other. They are on the same "value scale".
Performance and Energy sound like implementation constraints, It either meets your requirements or it does not. There is no universal measurement.
You have to define the requirements of the task you need a scheduler for; CFS is general, because it has to "fit all needs" as best it can, BFS is designed for desktop interactivity (with fairness) etc.
Mike Brown
@Ralph
ReplyDeleteI have make an RIFS-V3-Test. It is fully designed for desktop and fit the 5511 option.
However, the 1551 option cannot be implement by any scheduler.Or I can say, it is impossible to implement for both low latency or batch/throughput. They are two opposite things and we should prefer one of these two option.
BFS is a 1515 scheduler.EEVDF(With BFS it is actually EDF) is actually the best real-time algorithm people have proven.
Originally CFS(The one 2.6.23 kernel used) is good but Ingo expect to do something to fit all the workload type and CFS becomes not responsive.(On 2.6.31 kernel if we disabled the sleeper fairness CFS will even give very good responsive to users as BFS do, but after 2.6.31 the improved sleeper fairness feature has been hard-coded).
From CFS without sleeper fairness I can see that Unix System V scheduler is the best scheduler for desktop nomally.(CFS has the same approach as Unix Scheduler).Unluckily Linus prefer server users more and the pure Unix Scheduler is not adopted.
Chen
Hi all,
ReplyDeleteBFS and BFQ are now the default in all my linux systems. However, I miss the old SLQB allocator so I was looking for a replacement for SLUB/SLAB.
I've found xvmalloc, from compcache. It uses less memory than slub, however I've never used it:
http://compcache.googlecode.com/svn/wiki/files/allocators_ideal_tlsf_kmalloc_compare.gif
Have any of you tried it? What do you think?
Thanks
(note: crossposted at the bfq ml)
(btw about that diagram: tlsf is the older incarnation of xvmalloc, kmalloc is slub)
Delete