-ck hacking: Upgradeable rwlocks and BFS

Monday, 11 June 2012

Upgradeable rwlocks and BFS

I've been experimenting with improving the locking scheme in BFS and had a few ideas. I'm particularly attached to the global runqueue that makes BFS what it is and obviously having one queue that has one lock protecting all its data structures accessed by multiple CPUs will end up having quite significant scalability limits with many CPUs. Much like a lot of the scalability work, it tends to have the opposite effect with smaller hardware (i.e. the more scalable you make something, the more it will tend to harm the smaller hardware). Fortunately most of the scalability issues with BFS are pretty much irrelevant until you have more than 16 logical CPUs, but we've now reached the point where 8 logical CPUs is not unusual for a standard PC. Whether or not we actually need so many logical CPUs or not and can use them appropriately is an argument for a different time and place, but they're here. So I've always had at the back of my mind how I might go about making BFS more scalable in the long term and locking is one obvious area.

Unfortunately time and realistic limitations means I'm unlikely to ever do anything on a grand scale or be able to support it. However I had ideas for how to change the locking long-term but it would require lots of incremental steps. After all that rambling, this post is about the first step to changing it. Like some of the experimental steps of the past (such as skip lists), there is absolutely no guarantee that these are worth pursuing.

I've implemented a variant of the read/write locks as used in the kernel to better suit being used as a replacement for the spinlock that protects all the global runqueue data. The problem with read/write locks is they favour readers over writers, and you cannot upgrade or downgrade locks. Once you have grabbed them, if you drop the lock and then try to grab the other variant (eg. going from read to write), the data is no longer guaranteed to be the same. What I've put together (and I'm not remotely suggesting this is my original idea, I'm sure it's been done elsewhere) are upgradeable read/write locks where you can take either the read lock, write lock, or an upgradeable variant. Upgradeable locks can be up or downgraded to write or read locks, and write locks can be downgraded to read locks, but read locks can not be changed. These URW locks are unfortunately more overhead than either spinlocks or normal rwlocks since they're actually just taking combinations of spinlocks and rwlocks. However the overhead of the locks themself may be worthwhile if it allows you to convert otherwise locked data into sections of parallel read code.

So here is a patch that implements them and can be put into pretty much any recent linux kernel version:
urw-locks.patch

Note that patch does nothing by itself, but here is a patch to add on top of that one for BFS423 that modifies the global runqueue to use the urw locks and implements some degree of read/write separation that did not exist with the regular spinlock:
bfs423-grq_urwlocks.patch

It's rather early code but I've given it a fairly thorough testing at home and it at least seems to be working as desired. On the simplest of preliminary testing I'm unable to demonstrate any throughput advantage on my quad core hardware, but the reassuring thing is I also find no disadvantage. Whether this translates to some advantage on 8x or 16x is something I don't have hardware to test for myself (hint to readers).

Note that even if this does not prove to be any significant throughput gain, then provided it is not harmful, I hope to eventually use it as a stepping stone to a grander plan I have for the locking and scalability. I don't like vapourware and since I haven't finalised the details myself exactly how I would implement them, there's nothing more to say for now. Then there's the time issue... there never seems to be enough, but I only ever hack for fun so it's no problem really.

P.S. Don't let this long post make you not notice that BFS 423 and 3.4-ck2 are also out in the previous post.

20 comments:

Anonymous12 June 2012 at 00:57
What about RCU (Read-copy-update) — can it be used instead?
ReplyDelete
Replies
Unknown12 June 2012 at 01:05
I think it will be kinda difficult to find a BFS user with so many cores to do proper tests. It's somewhat of an oxymoron :-)
ReplyDelete
Replies
X12 June 2012 at 03:05
Con, I think you should try to use the modular scheduler which mainline has.

Doing this could make the code more stable. It doesn't means that BFS has to use per-cpu runqueue completely because the modular scheduler has also provide you a free space to do what you want.:-) In the other words, the patch can be smaller with this.(We just need to change a few lines of code in core.c , sched.h and rewrite the Makefile).
Chen
ReplyDelete
Replies
graysky13 June 2012 at 05:49
OK... did some testing using the my "make" benchmark. I think this endpoint is actually relevant in comparing the code with a reproducible and realworld relevant endpoint, albeit a non-interactivity one!

1) It is a non-latency based measure.
2) Compilation benchmark using gcc to “make bzImage” for a preconfigured linux 3.4 build.
3) Runs benchmarks 28 times totally to get a decent number of observations for a statistical comparison. In all cases, the first run is omitted leaving an n=27.
4) Results are how many seconds it took to compile on a dual Intel E5620 (2x hyperhreaded quadcore CPUs on a single board) @ 2.40 GHz.
5) Make is run with 16 threads (8 physical cores and 8 HT cores).

1st test: 3.4.2-bfs342 vs. 3.4.2-bfs3.4.2+urwlocks:
Result: The URWLocks patched kernel was on the borderline of achieving a statistically significantly difference. If I had powered the analysis with a larger number of replicates (say 63 or 72) I'll bet that it would be statistically significantly SLOWER than the unpatched bfs kernel. You can see from the Anova that the median time difference is 808 ms longer for the kernel running with the experimental patchset:

http://s15.postimage.org/4uv2qx2bt/anova_1.jpg

2nd test: 3.4.2-vanilla vs. 3.4.2-bfs3.4.2:
Result: The vanilla 3.4.2 kernel with the stock queuing system is statically significantly SLOWER than the corresponding bfs patched kernel; median time difference is 152 ms.

http://s15.postimage.org/5rej32ux5/anova_2.jpg

CK - Always glad to support your innovative ideas with regard to schedulers. Just ask and keep up the great work you do for the Linux community :)
ReplyDelete
Replies
ck22 June 2012 at 19:22
Well...

This RIFS code optimises for sleeping tasks under extreme loads. In other words, it adds unfairness favouring interactive tasks at loads of 64 or more. This is precisely the opposite direction of where the bfs code was heading, as it contains a sleep-based interactivity estimator (BFS has no interactivity estimator since they introduce unfairness and behave unpredictably) and it optimises for ridiculous loads that desktop users will never see.

So while I'm happy to see people hacking on BFS, this code is pretty much orthogonal to the whole BFS approach.
ReplyDelete
Replies
Martin23 June 2012 at 06:40
> Thoughts?

i find a linear curve reassuring because it indicates a direct relationship between the quantities, meaning all influencers are known and there are no hidden surprises.
ReplyDelete
Replies
X24 June 2012 at 20:00
Hi Con
I just want to clearify one thing. You said that RIFS/-ES optmises for big load, but actually I don't do that. Also there is *NO ANY INTERACTIVITY ESTIMATOR* in the design of RIFS/-ES. If you say that RIFS have sleep-based estimator I can claim that BFS also have sleep-based estimator(Actually both of them don't have this).
Chen
ReplyDelete
Replies
Anonymous25 June 2012 at 01:55
Performance is a term meaning to comprise three dimensions:
1. energy consideration (think of no NOHZ of RIFS)
2. latency of response to users
3. throughput of data

The last dimension most trivially to measure.
Regarding schedulers there is an additional dimension:
4. predictability (what realtime kernels want to provide)

CFS mainline wants to satify all needs universally. I think a scientific informatical research project could easily show that in principle there cannot be a scheduler to fit the needs of all cases.

Even if you think of a theoretical scheduler with adaptive strategies this artificial intelligence would cost much resources (energy and throughput).

I would like to have a selection choice for users. Something indicated in four dimension by the user from 1-5 points:
5511 a mobile user would want
1155 an audiophile creater would select
1551 a developer compiling his latest source edits.

Ralph Ulrich
ReplyDelete
Replies
X25 June 2012 at 02:18
"CFS mainline wants to satify all needs universally. I think a scientific informatical research project could easily show that in principle there cannot be a scheduler to fit the needs of all cases. " Yes it is. CFS wants to do that so but failed.

"1. energy consideration (think of no NOHZ of RIFS)"
Ah, a news on Phoronix has shown that there is only slightly improvement of energy costing with NOHZ.(Very little). On the other hand NOHZ will cause some problems on interactivity.
ReplyDelete
Replies
Anonymous25 June 2012 at 09:59
Chen, I know: NOHZ energy savings only ticks/gigaHerzOfCpu

I wrote about four dimensions, because I want the developer of a scheduler (YOU) to emphasize on differentiation not univerality!

What are the differences in regard of these four dimensions
(BFS - RIFS - RIFS/ES - CFS)
Chen you didn't answer this question yet, which I also asked at phoronix. You just say your scheduler performs (universal) better than CFS. This answer is a bug :)
Ralph Ulrich
ReplyDelete
Replies
X25 June 2012 at 19:44
@Ralph Ulrich
RIFS is designed for mobile/desktop user.
ReplyDelete
Replies
MikeyB26 June 2012 at 20:20
Ralph,

Latency and throughput are heavily connected. The value of one will affect the value of the other. They are on the same "value scale".

Performance and Energy sound like implementation constraints, It either meets your requirements or it does not. There is no universal measurement.

You have to define the requirements of the task you need a scheduler for; CFS is general, because it has to "fit all needs" as best it can, BFS is designed for desktop interactivity (with fairness) etc.

Mike Brown
ReplyDelete
Replies
X27 June 2012 at 03:02
@Ralph
I have make an RIFS-V3-Test. It is fully designed for desktop and fit the 5511 option.

However, the 1551 option cannot be implement by any scheduler.Or I can say, it is impossible to implement for both low latency or batch/throughput. They are two opposite things and we should prefer one of these two option.

BFS is a 1515 scheduler.EEVDF(With BFS it is actually EDF) is actually the best real-time algorithm people have proven.

Originally CFS(The one 2.6.23 kernel used) is good but Ingo expect to do something to fit all the workload type and CFS becomes not responsive.(On 2.6.31 kernel if we disabled the sleeper fairness CFS will even give very good responsive to users as BFS do, but after 2.6.31 the improved sleeper fairness feature has been hard-coded).
From CFS without sleeper fairness I can see that Unix System V scheduler is the best scheduler for desktop nomally.(CFS has the same approach as Unix Scheduler).Unluckily Linus prefer server users more and the pure Unix Scheduler is not adopted.
Chen
ReplyDelete
Replies
Kael28 June 2012 at 17:32
Hi all,
BFS and BFQ are now the default in all my linux systems. However, I miss the old SLQB allocator so I was looking for a replacement for SLUB/SLAB.

I've found xvmalloc, from compcache. It uses less memory than slub, however I've never used it:

http://compcache.googlecode.com/svn/wiki/files/allocators_ideal_tlsf_kmalloc_compare.gif

Have any of you tried it? What do you think?

Thanks

(note: crossposted at the bfq ml)
ReplyDelete
Replies

Add comment