Tuesday, 29 July 2014

Revisiting URW locks for BFS

A couple of years ago I experimented with upgradeable read-write locks to replace the current spinlock that protects all the data in the global runqueue in BFS. Here is the original description: upgradeable-rwlocks-and-bfs

One of the main concerns when using a single runqueue which BFS does, as opposed to multiple runqueues, as the mainline linux scheduler does, is lock contention, where multiple CPUs can be waiting on getting access to read and/or modify data protected by the lock protecting all the data on the single runqueue. This data is currently protected by a spinlock, but there is clear demarcation in areas of the scheduler where we're only interested in reading data and others where we have to write data, which is why I developed the URW locks in the first place.

A brief reminder of why I have developed upgradeable read write locks instead of using regular read-write locks is that if there are multiple readers holding an RW lock, write access is starved until they all release the lock. This situation is unacceptable for a scheduler where it is mandatory that writes take more precedence than reads and the upgradeable version has write precedence as a feature as well as being upgradeable, thus allowing us to hold exclusive write access only for as long as we need it. Their main disadvantage is they are much heavier in overhead since they're effectively using double locks hence they would need to be used only where there is a clear work case for them.

I've updated the urwlocks patch and posted it here:
Note this patch by itself does nothing unless other code uses the locks.

In my original experiments I had done the conversion and performed various benchmarks on a quad core hyperthreaded CPU and had seen neither benefit nor detriment. The locks were originally developed with a view to expanding on the BFS design for scalability improvements primarily, and time and lack of success in deriving benefit from that development led to me shelving the project.

I now have access to a hex core hyperthreaded CPU which acts like 12 logical CPUs and felt it was time to revisit the idea. This time I rehashed the use of URW locks for the global runqueue and was even more aggressive in trying to spend as little time with the write lock held as possible. After extensive benchmarking, though, my conclusions were even worse than last time: Not only did it not improve performance, it statistically significantly ever so slightly worsened throughput. This suggests that whatever scalability improvements are there to be gained by decreasing lock contention are offset by the increased overhead of URW locks versus spinlocks.

The updated version of this patch that depends on the urw locks patch above (to be applied on top of BFS 449 and all pending patches) is here:
It was interesting to note that for whatever reason the context switch rate actually decreased under load compared to regular BFS suggesting the discrete read paths helped contribute to less requirement for rescheduling suggesting it could lead to benefit if not for the increased locking overhead.

In some ways I'm not surprised as complexity and added infrastructure for the sake of apparent benefit generically does not guarantee improvement and simpler code, if not ideal in design, can work better in real world workloads. There are three discrete examples of this now in my experiments with BFS:

1. Single vs multiple runqueues.

The identical design in basic algorithm for BFS when applied to multiple runqueues versus the single runqueue shows no measurable increase in lock contention or even cache trashing on any regularly available hardware (confirmed on up to dual 12 threaded CPU machines).

2. Spinlocks vs rwlocks.

As per this post.

3. O(n) vs O(ln(n)) lookup.

Any improvement in the lookup time of many tasks is offset by the added complexity of insertion and that lookup and what the real world sizes of n will be. The limiting behaviour of the function describes how it changes with n only, it does not tell you what the absolute overhead equates to in real world workloads.


Friday, 4 July 2014

BFS 0.449

Hot on the heels of the BFS448 release, I was doing some experimenting for some ideas I had (nothing productive so far) when I discovered the long-standing "CPU locality" code which determines the relationship between CPUs (eg. if they're SMT siblings or separate physical CPUs etc) was broken. So I've fixed the code that determines that, along with printing out what BFS believes to be the relationship (called locality) in dmesg on startup. An example output from a 2 thread, 2 core CPU would be:

[ 0.100217] LOCALITY CPU 0 to 1: 1
[ 0.100220] LOCALITY CPU 0 to 2: 2
[ 0.100221] LOCALITY CPU 0 to 3: 2
[ 0.100222] LOCALITY CPU 1 to 2: 2
[ 0.100223] LOCALITY CPU 1 to 3: 2
[ 0.100224] LOCALITY CPU 2 to 3: 1

I've also added the namespace fix as posted by here by Bogdan Trach (Thanks!). Diff from BFS 0.448 and full patch here:

BFS 3.15 patches

The changes in this patch may improve CPU throughput and decrease latency under certain circumstances but no benchmarking so far has shown any statistically significant difference.


Thursday, 3 July 2014

BFS 0.448, 3.15-ck1

Announcing a resync and update of BFS for linux kernel 3.15.x. I'm currently on vacation but fortunately had enough downtime to hack this together in the evenings and pinged a few people to do some testing for me before releasing it since I only have my laptop with me and could not do the usual set of build and run tests on multiple configurations (thanks!).
 This is basically a resync of the last BFS along with trivial changes to stay in sync with the mainline kernel, along with some of the queued build fixes submitted by others on this blog (thanks!). Alas the users of ath9k with Tux On Ice that I pinged early on with a test patch have shown the same issue exists (which is not surprising since BFS has only been trivially changed in quite a few releases now) so I'm pretty sure whatever the interaction is was introduced somewhere between 3.13 and 3.14.
I have reviewed Alfred Chen's patches and for the time being have not included them in BFS, though I do like the direction his changes have taken. The first patch sets a flag that isn't used by BFS so it was not necessary. The other changes to resched_best_mask are sound and the only thing they're missing is an equivalent optimisation for compiled in support for MC and SMT schedulers on hardware that doesn't have one and/or the other.
 So here it is:

BFS by itself:

3.15-ck1 patchset directory:


Tuesday, 6 May 2014

BFS 0.447, 3.14-ck1

Announcing a resync and update of BFS for linux kernel 3.14.x:

This is mainly a resync from BFS 0.446, but with the addition of the patches as offered by the generous users as seen in the comments here, Alfred Chen and Oleksandr Natalenko. The changes are to fix a circular locking issue on bootup that rarely hit some people, a fix for kvm soft lockups in SMP mode, and to remove some config options that should not be used with BFS.
What's interesting about working on this latest BFS is that I ran into all sorts of instability due to the new kernel that ironically worked out to be a very serious bug in 3.14.0 and was fixed in 3.14.1 with this patch:
commit 8e58cd80d042569da7af501de897c5e0538d99b0
futex: avoid race between requeue and wake
As is often the case, BFS is exceptional at bringing out race conditions and my machine was almost unusable with any significantly multithreaded application such as firefox which kept hanging. This was a scenario where my delay at syncing up the code worked to my advantage as 3.14.2 is working fine.
So here it is:
BFS by itself:

CK branded BFS:

Somehow I still forgot to include PF's patch for uniprocessor builds, though it's so uncommon to come across a uniprocessor these days! His patch is still valid and can be grabbed here to be applied on top if you need it: