I've been experimenting with improving the locking scheme in BFS and had a few ideas. I'm particularly attached to the global runqueue that makes BFS what it is and obviously having one queue that has one lock protecting all its data structures accessed by multiple CPUs will end up having quite significant scalability limits with many CPUs. Much like a lot of the scalability work, it tends to have the opposite effect with smaller hardware (i.e. the more scalable you make something, the more it will tend to harm the smaller hardware). Fortunately most of the scalability issues with BFS are pretty much irrelevant until you have more than 16 logical CPUs, but we've now reached the point where 8 logical CPUs is not unusual for a standard PC. Whether or not we actually need so many logical CPUs or not and can use them appropriately is an argument for a different time and place, but they're here. So I've always had at the back of my mind how I might go about making BFS more scalable in the long term and locking is one obvious area.
Unfortunately time and realistic limitations means I'm unlikely to ever do anything on a grand scale or be able to support it. However I had ideas for how to change the locking long-term but it would require lots of incremental steps. After all that rambling, this post is about the first step to changing it. Like some of the experimental steps of the past (such as skip lists), there is absolutely no guarantee that these are worth pursuing.
I've implemented a variant of the read/write locks as used in the kernel to better suit being used as a replacement for the spinlock that protects all the global runqueue data. The problem with read/write locks is they favour readers over writers, and you cannot upgrade or downgrade locks. Once you have grabbed them, if you drop the lock and then try to grab the other variant (eg. going from read to write), the data is no longer guaranteed to be the same. What I've put together (and I'm not remotely suggesting this is my original idea, I'm sure it's been done elsewhere) are upgradeable read/write locks where you can take either the read lock, write lock, or an upgradeable variant. Upgradeable locks can be up or downgraded to write or read locks, and write locks can be downgraded to read locks, but read locks can not be changed. These URW locks are unfortunately more overhead than either spinlocks or normal rwlocks since they're actually just taking combinations of spinlocks and rwlocks. However the overhead of the locks themself may be worthwhile if it allows you to convert otherwise locked data into sections of parallel read code.
So here is a patch that implements them and can be put into pretty much any recent linux kernel version:
Note that patch does nothing by itself, but here is a patch to add on top of that one for BFS423 that modifies the global runqueue to use the urw locks and implements some degree of read/write separation that did not exist with the regular spinlock:
It's rather early code but I've given it a fairly thorough testing at home and it at least seems to be working as desired. On the simplest of preliminary testing I'm unable to demonstrate any throughput advantage on my quad core hardware, but the reassuring thing is I also find no disadvantage. Whether this translates to some advantage on 8x or 16x is something I don't have hardware to test for myself (hint to readers).
Note that even if this does not prove to be any significant throughput gain, then provided it is not harmful, I hope to eventually use it as a stepping stone to a grander plan I have for the locking and scalability. I don't like vapourware and since I haven't finalised the details myself exactly how I would implement them, there's nothing more to say for now. Then there's the time issue... there never seems to be enough, but I only ever hack for fun so it's no problem really.
P.S. Don't let this long post make you not notice that BFS 423 and 3.4-ck2 are also out in the previous post.