Tuesday 29 July 2014

Revisiting URW locks for BFS

A couple of years ago I experimented with upgradeable read-write locks to replace the current spinlock that protects all the data in the global runqueue in BFS. Here is the original description: upgradeable-rwlocks-and-bfs

One of the main concerns when using a single runqueue which BFS does, as opposed to multiple runqueues, as the mainline linux scheduler does, is lock contention, where multiple CPUs can be waiting on getting access to read and/or modify data protected by the lock protecting all the data on the single runqueue. This data is currently protected by a spinlock, but there is clear demarcation in areas of the scheduler where we're only interested in reading data and others where we have to write data, which is why I developed the URW locks in the first place.

A brief reminder of why I have developed upgradeable read write locks instead of using regular read-write locks is that if there are multiple readers holding an RW lock, write access is starved until they all release the lock. This situation is unacceptable for a scheduler where it is mandatory that writes take more precedence than reads and the upgradeable version has write precedence as a feature as well as being upgradeable, thus allowing us to hold exclusive write access only for as long as we need it. Their main disadvantage is they are much heavier in overhead since they're effectively using double locks hence they would need to be used only where there is a clear work case for them.

I've updated the urwlocks patch and posted it here:
urw-locks.patch
Note this patch by itself does nothing unless other code uses the locks.

In my original experiments I had done the conversion and performed various benchmarks on a quad core hyperthreaded CPU and had seen neither benefit nor detriment. The locks were originally developed with a view to expanding on the BFS design for scalability improvements primarily, and time and lack of success in deriving benefit from that development led to me shelving the project.

I now have access to a hex core hyperthreaded CPU which acts like 12 logical CPUs and felt it was time to revisit the idea. This time I rehashed the use of URW locks for the global runqueue and was even more aggressive in trying to spend as little time with the write lock held as possible. After extensive benchmarking, though, my conclusions were even worse than last time: Not only did it not improve performance, it statistically significantly ever so slightly worsened throughput. This suggests that whatever scalability improvements are there to be gained by decreasing lock contention are offset by the increased overhead of URW locks versus spinlocks.

The updated version of this patch that depends on the urw locks patch above (to be applied on top of BFS 449 and all pending patches) is here:
bfs449-grq_urwlocks.patch
 
It was interesting to note that for whatever reason the context switch rate actually decreased under load compared to regular BFS suggesting the discrete read paths helped contribute to less requirement for rescheduling suggesting it could lead to benefit if not for the increased locking overhead.


In some ways I'm not surprised as complexity and added infrastructure for the sake of apparent benefit generically does not guarantee improvement and simpler code, if not ideal in design, can work better in real world workloads. There are three discrete examples of this now in my experiments with BFS:

1. Single vs multiple runqueues.

The identical design in basic algorithm for BFS when applied to multiple runqueues versus the single runqueue shows no measurable increase in lock contention or even cache trashing on any regularly available hardware (confirmed on up to dual 12 threaded CPU machines).

2. Spinlocks vs rwlocks.

As per this post.

3. O(n) vs O(ln(n)) lookup.

Any improvement in the lookup time of many tasks is offset by the added complexity of insertion and that lookup and what the real world sizes of n will be. The limiting behaviour of the function describes how it changes with n only, it does not tell you what the absolute overhead equates to in real world workloads.

-ck

7 comments:

  1. Thanks for sharing your trying with URW locks and continuing effect to make bfs sharper.
    I am working on a multiple runqueue improvement. It hasn't finished and hard to tell the benefit vs the extra locking overhead yet.
    Here I would like to ask about how to benchmark bfs besides throughput test? What statistics to look at? Many thanks.

    ReplyDelete
  2. Here comes my other branch of patches. 8 minor changes upon bfs 0.449. The previous post should be the right place to post these, but it looks like it is used to debug the issue with ath9k, so I post them here.

    @ck, please review the code when you are free.

    1. bfs: Remove get_cpu()/put_cpu() in try_to_wake_up()

    preempt_disable()/preempt_enable() also called in raw_spin_lock_irqsave()/raw_spin_unlock_irqrestore(), get_cpu()/put_cpu() are not necessary as mainline does.

    https://bitbucket.org/alfredchen/linux-gc/commits/9583be334957efa456bac6f163f87c250b709807

    2. bfs: Make cpu_rq(cpu) a macro instead of a function call.

    https://bitbucket.org/alfredchen/linux-gc/commits/e3c5ab20310f4cb5c5d36d2e8fcf99e5f918b7d5

    3. bfs: clean up wake_entry in task_struct

    This should be part of "[BFS] Remove runqueue wake_list", as wake_entry is just used for wake_list so it's no used in bfs.

    https://bitbucket.org/alfredchen/linux-gc/commits/402ff574ab0661275d9ad0571c295c7712073d36

    4. bfs: Refactory idle!=prev and next task code in schedule()

    Just refactory the code and doesn't change the logic. Reduce a dupicated condition sentence and some not operations.

    https://bitbucket.org/alfredchen/linux-gc/commits/35bad9b9b23fad50189bc2e4051f9f12544a44a7

    5. bfs: Remove unnecessary resched_suitable_idle() in schedule().

    When prev task is not the idle task and needs other cpu, the next
    task will be returned from earliest_deadline_task() and will be
    not the same task as prev task because the affinity.
    resched_suitable_idle() will be called after prev != next check,
    and resched_suitable_idle() call after needs_other_cpu() is not
    necessary.

    https://bitbucket.org/alfredchen/linux-gc/commits/74ad2a8ee96d6cd1382ce21e7c5e06070f07cda0

    6. bfs: comment out unused evaluated stack variables.

    Comment out evaluated cpu, rq and idle stack variables after context
    switch, as they are not used in the rest of schedule(). Keep them in
    source code so can be uncommented and use again.

    https://bitbucket.org/alfredchen/linux-gc/commits/df01fbd98bc81ba4c607190fbbeabdcdde0f76b0

    7. bfs: Refactor rq dither calculation in schedule().

    https://bitbucket.org/alfredchen/linux-gc/commits/b68343be29c8bc9b22ae3115f59388176a50837f

    8. bfs: Refactory set_rq_task and inline reset_rq_task.

    https://bitbucket.org/alfredchen/linux-gc/commits/856c8971381668f0742a841fdcf2140cf428ac25

    ReplyDelete
    Replies
    1. Patch 2 of this series (bfs: Make cpu_rq(cpu) a macro instead of a function call) breaks kernel compilation for me:

      LD init/built-in.o
      kernel/built-in.o: In function `show_schedstat':
      stats.c:(.text+0x2c7e5): undefined reference to `cpu_rq'
      stats.c:(.text+0x2c862): undefined reference to `cpu_rq'
      make: *** [vmlinux] Error 1

      Regards, Manuel

      Delete
    2. @Manuel
      Thanks for testing. It's caused by enabling sched stats, and I haven't tested it with this config. I will find some time to fix it this weekend.

      Delete
    3. This patch to fix compilation issue in #2

      bfs: Fix cpu_rq() compilation issue with CONFIG_SCHEDSTATS.

      https://bitbucket.org/alfredchen/linux-gc/commits/3120cc663b1368dd312839b03c13e6d4966722c3

      Delete
  3. Heya,

    Id be very interested in a benchmark on how BFS performs with NO locking at all.

    I understand that it would be functionally incorrect, but how much overhead are you actually trying to remove from these spinlocks? What is the "absolute" improvement that could be had. zero lock measurements would be able to provide information on a total "gain" that could be had.

    ReplyDelete
    Replies
    1. Kernel code can't run without any locking on SMP processors. It will explode only microseconds into startup so it's impossible to do.

      Delete