Showing posts with label bfs. Show all posts
Showing posts with label bfs. Show all posts

Sunday, 9 August 2015

BFS 464, linux-4.1-ck2

Here's an updated BFS/CK which includes the one test patch I put on this blog after 463 and another trivial fix for the previous release. The patch fixed a lot of regressions including hangs with BTRFS and panics on shutdown.

BFS by itself:

4.1-sched-bfs-464.patch

-ck branded linux-4.1-ck1 patches:

4.1-ck2 patches

Enjoy!
お楽しみください

Sunday, 2 August 2015

BFS 463, linux-4.1-ck1

Finally a resync to linux-4.1 . Sorry I was just too preoccupied to get around to doing this, and I haven't directly addressed a few known problems that have workarounds, and it comes with a warning.

BFS by itself:

4.1-sched-bfs-463.patch

-ck branded linux-4.0-ck1 patches:

4.1-ck1 patches

The usual collection of resyncs and minor updates including pending fixes post 462.

This includes a fix for some uniprocessor build problems courtesy of Serge Belyshev. If you still have boot problems with uniprocessor builds the workaround is to create an SMP kernel.

I've finally bit the bullet and removed the block flush code from within the main schedule() call, in keeping with how mainline does it. This is a problem that has recurred every time I've removed this change from previous kernels and had to re-add it every time. Complete hangs under particularly heavy IO used to be the problem, please report back if these come back with this kernel, hence the warning.

On the previous kernel, some had crashes unless they enabled NUMA. I have no idea what caused these and have done no specific changes to address it. I don't want people to enable NUMA unnecessarily but if you have crashes this is the first thing to try and please report back.

Enjoy!
お楽しみください

Thursday, 16 April 2015

BFS 462, linux-4.0-ck1

Announcing a resync and update of BFS for linux-4.0

BFS by itself:

4.0-sched-bfs-462.patch

-ck branded linux-4.0-ck1 patches:

4.0-ck1 patches

The usual collection of resyncs and minor updates only.

It includes the following changes:
- Minor tweaks to uniprocessor build (though enabling SMP will fix breakage if it still exists).
- Fix for tracing build failure
- SMT nice update to ignore kernel threads
- Decrease log level of locality information to debug

EDIT Fix for 4.0.2+: bfs462-rtmn-fix.patch

Enjoy!
お楽しみください

Friday, 27 February 2015

BFS 461, linux-3.19-ck1

Announcing a resync and update of BFS for linux-3.19

BFS by itself:

3.19-sched-bfs-461.patch

-ck branded linux-3.19-ck1 patches:

3.19-ck1 patches

Apart from a resync with mainline and merging of the pending patches that were around for BFS460, there are no new changes. Apologies if I've been unable to address any new issues posted here - as per usual lack of time is the reason. There are some pending changes to the scheduler for mainline (as pointed out by kernelOfTruth here: link) but they're not finalised so I won't be delaying this release to wait for them.

Enjoy!
お楽しみください

Thursday, 11 December 2014

BFS 460, linux-3.18-ck1

Announcing a resync and update of BFS for linux-3.18

BFS by itself:

3.18-sched-bfs-460.patch

-ck branded linux-3.18-ck1 patches:

3.18-ck1 patches

Uncharacteristically I found time to resync up quickly for this latest stable linux release. There are no new BFS features, but there have been a number of changes to stay in sync with mainline. Apart from keeping up with the usual churn in new releases, of which there was a modest amount this time, a number of other low level changes were committed making this much less of a trivial resync so some caution is warranted before blindly updating.

Hilf Danton pointed out a bug in the yield_to code (thanks!) which is now fixed. Since almost nothing uses this code you probably won't notice anything. He also pointed out some other now outdated components in BFS which are also updated. The above_background_load function has also been removed since the VM tweaks in older -cks no longer exist to use it. 

More substantially, I've reworked the plugged I/O code to match mainline now, which I had been reluctant to touch previously because of the deadlocks the unlocking and relocking in the scheduler code path introduced when the the first plugged I/O code made its way into BFS needing iterations of fixes - watch for any I/O misbehaviour/stalls. There are some changes to how mainline responds to idle CPUs so watch for any unusual behaviour there.

Having said that I've been using it for a while and not noticed anything out of the ordinary, but please report back if there are any issues.

Enjoy!
お楽しみください

Tuesday, 18 November 2014

BFS 458, linux-3.17-ck2

This is a bugfix release for the power usage regression as reported here with BFS 457.

BFS by itself:
3.17-sched-bfs-458.patch

CK branded BFS separate and combined patches:
3.17-ck2

Incremental change from BFS 457-458:
bfs457-458.patch

Enjoy!
お楽しみください

Tuesday, 11 November 2014

BFS 457, linux-3.17-ck1

Finally announcing a resync and minor update of BFS for the linux-3.17(.x) kernel releases.

Only minor updates have gone into this release apart from including one of the rework patches by Alfred Chen (thanks!) and the removal of the old KVM workaround that was no longer required with the bugfixes last release courtesy of Graysky (thanks!).

BFS by itself:
3.17-sched-bfs-457.patch

CK branded BFS separate and combined patches:
3.17-ck1

For those interested in the minor changes that made it up to 457, the incremental patches are available:
bfs457-incremental

Enjoy!
お楽しみください

Monday, 25 August 2014

BFS 453/454/455/456 and 3.16-ck2

Here is an updated set of BFS patches with the accumulated bugfixes as debugged on this blog for kernels 3.13 to 3.16 inclusive. The main obvious bug which affected people was the ath9k module which would hang on suspend/resume. However there were likely a number of subtle bugs across the board that most people would not be aware of and even I only noticed that kvm behaved much better after this applied bugfix which stretches back to every BFS after 3.12.

In order to make up for the fact that there are numerous kernels out there based on BFS across the different versions, I have updated BFS and numbered the versions according to which base kernel they are on. Note that there are no feature backports on the older kernels, only the bugfixes, so SMT nice is only on the 3.16 BFS.

3.13-sched-bfs-453.patch
3.14-sched-bfs-454.patch
3.15-sched-bfs-455.patch
3.16-sched-bfs-456.patch

And along with that an updated ck branded release for 3.16, 3.16-ck2:

3.16-ck2


Enjoy!
お楽しみ下さい

Saturday, 16 August 2014

BFS 450, 3.16-ck1

Announcing a resync and update of BFS for linux kernel 3.16.x. Coding has proven a nice distraction from unpleasant life events so I've been able to bring the patch up to date with the latest kernel.

A number of minor fixes as queued up post 3.15-ck1 made their way into this patchset, along with some changes inspired by the development work of Alfred Chen (thanks!).

The major feature upgrade in this one is the inclusion of SMT nice as discussed at length on this blog. This version of BFS includes an updated version of SMT nice beyond version 6 posted here with one change - 25% of the CPU time of any nice level of SCHED_NORMAL tasks can be shared with any other nice level over and above the nice-based CPU distribution. This is to capitalise on the slightly increased throughput that is available by using the sibling CPU concurrently without too dramatically affecting higher priority process CPU loss. In addition it dramatically reduces the massive latencies that can sometimes otherwise be seen by heavily niced tasks with SMT nice enabled by dithering the metering out of CPU instead of giving it all as a burst only when it's entitled to CPU.

Making SMT nice configurable means users can get to choose if they still want the standard behaviour. The config option will recommend users who enable the SMT scheduler option also enable the SMT nice option. I believe this to be a good default choice for virtually all desktop users, and selectively for server users if they depend heavily on the use of 'nice' or scheduling policies for their work cases (but otherwise it should be disabled).

BFS by itself:
3.16-sched-bfs-450.patch
3.16-ck1 branded BFS patchset directory:
3.16-ck1

EDIT: A build fix for non SMT enabled kernels to prevent it being possible to enable SMT nice is here:
bfs450-nosmt-buildfix.patch
Just disabling SMT nice will achieve the same thing for those affected.


Enjoy!
お楽しみください

Sunday, 10 August 2014

SMT Nice 6

In my last post I discussed the problem with nice levels, scheduling policies and SMT, and my first public patch for BFS to work around the issue, "SMT Nice":

smthyperthreading-nice-and-scheduling.html

With a bit of extra testing, and feedback from a number of users, a few issues were discovered with the first patch, so I've reworked it. Thanks very much to those who tested it and provided feedback. There were a couple of scheduling points where SMT siblings weren't being examined, and the difference between nice levels was far more aggressive than it was supposed to be.

Here is an updated patch for BFS 449 with all pending patches:
bfs449-smtnice-6.patch

And a convenient all inclusive patch for 3.15 with ck1+pending+smtnice3456:
3.15-ck1-smtnice6.patch

EDIT: Added one minor change to not allow kernel threads to deschedule any users tasks on smt siblings and bumped the patch version up to smtnice4.

EDIT2: Added a change to fix the high power usage bug bringing version up to smtnice5

EDIT3: Fixed a logic fail which would cause far too many reschedules and not use full CPU with many niced tasks bringing the version up to smtnice6.

---
Enjoy!
お楽しみ下さい
-ck

Friday, 1 August 2014

SMT/Hyperthreading, nice and scheduling policies

The concept of symmetric multi-threading, which Intel called "Hyperthreading" and introduced into their commodity CPUs first around 2001, is not remotely a new one and goes back a long way before Intel introduced it into the mainstream market. I suspect the introduction of it back then by Intel was them easing the concept of increasing threads and cores for marketing reasons with the imminent walls they'd soon hit with CPU heat and power requirements that would stop the pursuit for higher and higher single CPU frequencies. The idea is that, since a lot of the CPU sits unused even when something is running as fast as it can on part of it, with a bit of extra logic and architecture, you could throw another "virtual core" at some of the unused execution units and behave like 2 (or more) CPUs, putting more of the CPU to good use. These days the vast majority of CPUs sold by Intel have hyperthreading on them, thus doubling the virtual or "logical" cores the CPU has, including even their low power atom offerings.

There have been numerous benchmarks, in-field tests, workloads etc., where people have tried to find whether hyperthreading is better or not. With a bit of knowledge of the workings of hyperthreading, it's pretty easy to know what the answer is, and not surprisingly, it's the frustrating answer of "it depends". And that's the most accurate answer by far, but I'd go further than that and say that if you have any kind of mixed workload, hyperthreading is always going to be better, whereas if you have precisely one workload , then you have to define exactly how it's going to work and whether hyperthreading will be better or not. Which means that in my opinion at least, hyperthreading is advantageous on a desktop, laptop, tablet and even phone since by design they're nothing but mixed workloads. I won't spend much longer on this discussion, but suffice to say that I think about 4 threads (at the moment) is about optimal for most real world desktop(y) workloads.

Imagine for a moment you have a single core CPU which you can run as is, or enable hyperthreading to run as a 2 thread CPU. If you were to run your CPU in single core only mode, then when you run one task at a time it will always use the full power of the CPU, but if you run two tasks, each task runs at 50% the speed and completes in double the time. If you enable hyperthreading, then if you have two mixed workloads that actually use different parts of the CPU, you can actually get effectively (at best) about 140% of the performance of running the CPU in single core mode. This means that instead of the two tasks running at 50% speed when run concurrently, they run at 70% speed. In practice, the actual performance benefit is rarely 40% but it is often on the order of 25%, so each task tends to run about 60% speed instead of 50% speed. Still a nice speedup for "free".

One thing has always troubled me about hyperthreading, though, and that is the way it tends to break priority support in the scheduler. By priority support, I refer to the use of 'nice' and other scheduling policies, such as realtime, sched idleprio etc.

If you have a single core CPU and run a nice 0 task concurrently with a nice +19 task, the nice 0 task will get about 98% of the CPU time and the nice +19 task only about 2%. The scheduler does this by serialising and metering out the time each task gets to spend on the CPU. Now if you enable hyperthreading on that CPU, the scheduler no longer serialises access to the CPU, but gives each of those tasks one logical "core" on the CPU, and you get an overall 25% increase in throughput. However of the total throughput, both the nice 0 and nice +19 task get precisely half. This would be fine if we had two real cores, but they're not, and the performance of both tasks is sacrificed to ~60% to achieve this. Which means that for this contrived but simple example, enabling hyperthreading slows down the overall execution speed of your nice 0 task when you run a nice +19 task much more than on a single core - it runs at 60% speed instead of 98%.

An even more dramatic example is what happens with realtime tasks, which these days most audio backends on linux use (usually through pulseaudio). Running a realtime task concurrently with a SCHED_NORMAL nice 0 task on a single core means the realtime task will get 100% CPU and the nice 0 task will get zero CPU time. Enable hyperthreading and suddenly the realtime task only runs at 60% of its normal speed even with a heavily niced +19 task running in the background.

Enter SMT-nice as I call it. This is not a new idea, and in fact my first iteration of it was for mainline 10(!) years ago. See here: SMT Nice 2.6.4-rc1-mm1

I actually had the patch removed myself from mainline for criticism regarding throughput reasons, though I still argue that worrying about the last percentage points of throughput are not relevant if you break a mechanism as valuable as nice and scheduling policies, but I had lost the energy for defending it which is why I pushed it be removed myself. Note that although throughput overall may be slightly decreased, the throughput of higher priority tasks is not only fairer with respect to low priority tasks, but enhanced because the low priority tasks will have less cache trashing effects.

What this does is it examines all hyperthread "siblings" to see what is running on them, and then decides whether the currently running or next running task should actually have access to the sibling or allow the sibling to go idle completely, allowing a higher priority task to have the actual true core and all its execution units to itself. I'd been meaning to create an equivalent patch for BFS for the longest time but CPUs got faster, cheaper, more cores, I got lazy etc... though I recently found more enthusiasm for hacking.

So here is a reincarnation of the SMT-nice concept for BFS, improved to work across multiple scheduling policies from realtime, iso down to idleprio, and I've made it a compile time option in case people feel they don't wish to sacrifice any throughput:

Patch for BFS449 with pending patches:
bfs449-smtnice-2.patch

And to make life easy, here's an all inclusive ck1+pending+smtnice patch:
3.15-ck1-smtnice2.patch

The TL;DR is: On Intel hyperthreaded CPUs, 'nice', realtime and sched idleprio works better, and background tasks interfere much less with the foreground tasks. Note: This patch does nothing if you don't have a hyperthreaded CPU.

If you wish to do testing to see how this works, try running with and without the patch and running two benchmarks concurrently, one at nice 0 and one at nice +19 (such as 'make -j2' on one kernel and 'nice -19 make -j2' on another kernel on a machine with 2 cores/4 threads) and compare times. Or run some jackd benchmarks of your choice to see  what it takes to get xruns etc.

This patch will almost certainly make its way into the next BFS in some form.

EDIT: It seems people have missed the point of this patch. It improves the performance of foreground applications at the expense of background ones. So your desktop/gui/applications will remain fast even if you run folding@home, mprime, seti@home etc., but those background tasks will slow down more. If you don't want it doing that, disable it in your build config.

---
Enjoy!
お楽しみ下さい
-ck

Tuesday, 29 July 2014

Revisiting URW locks for BFS

A couple of years ago I experimented with upgradeable read-write locks to replace the current spinlock that protects all the data in the global runqueue in BFS. Here is the original description: upgradeable-rwlocks-and-bfs

One of the main concerns when using a single runqueue which BFS does, as opposed to multiple runqueues, as the mainline linux scheduler does, is lock contention, where multiple CPUs can be waiting on getting access to read and/or modify data protected by the lock protecting all the data on the single runqueue. This data is currently protected by a spinlock, but there is clear demarcation in areas of the scheduler where we're only interested in reading data and others where we have to write data, which is why I developed the URW locks in the first place.

A brief reminder of why I have developed upgradeable read write locks instead of using regular read-write locks is that if there are multiple readers holding an RW lock, write access is starved until they all release the lock. This situation is unacceptable for a scheduler where it is mandatory that writes take more precedence than reads and the upgradeable version has write precedence as a feature as well as being upgradeable, thus allowing us to hold exclusive write access only for as long as we need it. Their main disadvantage is they are much heavier in overhead since they're effectively using double locks hence they would need to be used only where there is a clear work case for them.

I've updated the urwlocks patch and posted it here:
urw-locks.patch
Note this patch by itself does nothing unless other code uses the locks.

In my original experiments I had done the conversion and performed various benchmarks on a quad core hyperthreaded CPU and had seen neither benefit nor detriment. The locks were originally developed with a view to expanding on the BFS design for scalability improvements primarily, and time and lack of success in deriving benefit from that development led to me shelving the project.

I now have access to a hex core hyperthreaded CPU which acts like 12 logical CPUs and felt it was time to revisit the idea. This time I rehashed the use of URW locks for the global runqueue and was even more aggressive in trying to spend as little time with the write lock held as possible. After extensive benchmarking, though, my conclusions were even worse than last time: Not only did it not improve performance, it statistically significantly ever so slightly worsened throughput. This suggests that whatever scalability improvements are there to be gained by decreasing lock contention are offset by the increased overhead of URW locks versus spinlocks.

The updated version of this patch that depends on the urw locks patch above (to be applied on top of BFS 449 and all pending patches) is here:
bfs449-grq_urwlocks.patch
 
It was interesting to note that for whatever reason the context switch rate actually decreased under load compared to regular BFS suggesting the discrete read paths helped contribute to less requirement for rescheduling suggesting it could lead to benefit if not for the increased locking overhead.


In some ways I'm not surprised as complexity and added infrastructure for the sake of apparent benefit generically does not guarantee improvement and simpler code, if not ideal in design, can work better in real world workloads. There are three discrete examples of this now in my experiments with BFS:

1. Single vs multiple runqueues.

The identical design in basic algorithm for BFS when applied to multiple runqueues versus the single runqueue shows no measurable increase in lock contention or even cache trashing on any regularly available hardware (confirmed on up to dual 12 threaded CPU machines).

2. Spinlocks vs rwlocks.

As per this post.

3. O(n) vs O(ln(n)) lookup.

Any improvement in the lookup time of many tasks is offset by the added complexity of insertion and that lookup and what the real world sizes of n will be. The limiting behaviour of the function describes how it changes with n only, it does not tell you what the absolute overhead equates to in real world workloads.

-ck

Friday, 4 July 2014

BFS 0.449

Hot on the heels of the BFS448 release, I was doing some experimenting for some ideas I had (nothing productive so far) when I discovered the long-standing "CPU locality" code which determines the relationship between CPUs (eg. if they're SMT siblings or separate physical CPUs etc) was broken. So I've fixed the code that determines that, along with printing out what BFS believes to be the relationship (called locality) in dmesg on startup. An example output from a 2 thread, 2 core CPU would be:

[ 0.100217] LOCALITY CPU 0 to 1: 1
[ 0.100220] LOCALITY CPU 0 to 2: 2
[ 0.100221] LOCALITY CPU 0 to 3: 2
[ 0.100222] LOCALITY CPU 1 to 2: 2
[ 0.100223] LOCALITY CPU 1 to 3: 2
[ 0.100224] LOCALITY CPU 2 to 3: 1


I've also added the namespace fix as posted by here by Bogdan Trach (Thanks!). Diff from BFS 0.448 and full patch here:

BFS 3.15 patches

The changes in this patch may improve CPU throughput and decrease latency under certain circumstances but no benchmarking so far has shown any statistically significant difference.

Enjoy!

Thursday, 3 July 2014

BFS 0.448, 3.15-ck1

Announcing a resync and update of BFS for linux kernel 3.15.x. I'm currently on vacation but fortunately had enough downtime to hack this together in the evenings and pinged a few people to do some testing for me before releasing it since I only have my laptop with me and could not do the usual set of build and run tests on multiple configurations (thanks!).
 This is basically a resync of the last BFS along with trivial changes to stay in sync with the mainline kernel, along with some of the queued build fixes submitted by others on this blog (thanks!). Alas the users of ath9k with Tux On Ice that I pinged early on with a test patch have shown the same issue exists (which is not surprising since BFS has only been trivially changed in quite a few releases now) so I'm pretty sure whatever the interaction is was introduced somewhere between 3.13 and 3.14.
I have reviewed Alfred Chen's patches and for the time being have not included them in BFS, though I do like the direction his changes have taken. The first patch sets a flag that isn't used by BFS so it was not necessary. The other changes to resched_best_mask are sound and the only thing they're missing is an equivalent optimisation for compiled in support for MC and SMT schedulers on hardware that doesn't have one and/or the other.
 So here it is:

BFS by itself:
3.15-sched-bfs-448.patch

3.15-ck1 patchset directory:
3.15-ck1

Enjoy!
お楽しみください

Tuesday, 6 May 2014

BFS 0.447, 3.14-ck1

Announcing a resync and update of BFS for linux kernel 3.14.x:

This is mainly a resync from BFS 0.446, but with the addition of the patches as offered by the generous users as seen in the comments here, Alfred Chen and Oleksandr Natalenko. The changes are to fix a circular locking issue on bootup that rarely hit some people, a fix for kvm soft lockups in SMP mode, and to remove some config options that should not be used with BFS.
What's interesting about working on this latest BFS is that I ran into all sorts of instability due to the new kernel that ironically worked out to be a very serious bug in 3.14.0 and was fixed in 3.14.1 with this patch:
 
commit 8e58cd80d042569da7af501de897c5e0538d99b0
futex: avoid race between requeue and wake
As is often the case, BFS is exceptional at bringing out race conditions and my machine was almost unusable with any significantly multithreaded application such as firefox which kept hanging. This was a scenario where my delay at syncing up the code worked to my advantage as 3.14.2 is working fine.
So here it is:
BFS by itself:
3.14-sched-bfs-447.patch

CK branded BFS:
3.14-ck1

Somehow I still forgot to include PF's patch for uniprocessor builds, though it's so uncommon to come across a uniprocessor these days! His patch is still valid and can be grabbed here to be applied on top if you need it:

0001-ck-3.12-fix-BFS-compiling-with-CONFIG_SMP-n.patch


Enjoy!
お楽しみください

Monday, 3 March 2014

BFS 0.446, 3.13-ck1

Announcing a resync and update of BFS for linux kernel 3.13.x:

Apart from build fixes and synchronisation with new kernel changes, this is only trivially different to BFS 444. A build failure on 445, along with a desire to release only even numbers, prompted version 446.

BFS by itself:
3.13-sched-bfs-446.patch

CK branded BFS:
3.13-ck1


Apologies for the delay, but I simply swamped with my other projects, interests and work.


Enjoy!
お楽しみください

Tuesday, 3 December 2013

3.12-ck2, BFS 0.444

Here is an updated BFS patch, version 0.444:

3.12-sched-bfs-444.patch

And an updated ck tagged 3.12-ck2 patch:

3.12-ck2

The changes in this release, compared to version 0.443 and ck1 are the 2 extra patches I posted in my last announce which were designed to address various suspend to ram/disk and resume problems as discussed in previous posts. Thanks to the various people who posted bug reports and tested experimental patches along the way.

Being an even number, this is clearly a more stable patch than the last one ;)

Enjoy!
お楽しみください

Monday, 2 December 2013

Suspend fixes for BFS 443

Investigating the hibernate/suspend/resume problems I discovered a fix that is necessary in BFS443. These changes have nothing to do with the experimental patches I've posted so if you have applied any of those you will have to roll back to BFS443 or 3.12-ck1 to apply this patch since it affects the same code. Hopefully these changes actually address the suspend issues people are having and the workarounds in the experimental patches won't be required - note that you will not be able to apply any of the experimental patches on top of this:

EDIT: Added an updated bindzero patch; apply in this order:
01 bfs443-suspend-fixes.patch
02 bfs443-bindzero.patch

Those with hibernate/resume issues please report back so I know what more needs to be done on top of this code.

Friday, 29 November 2013

New experimental hibernate patch for BFS443

Here's a new experimental patch for BFS443 for those having suspend/resume issues. Please try it out on top of BFS443 or 3.12-ck1 (discard the old experimental patch). This builds on the idea users submitted for affining tasks to CPU0 as CPUs go offline.

bfs443-hibernate_test2.patch

Tuesday, 26 November 2013

Experimental hibernation patch for BFS 443

In response to the numerous reports of problems with hibernate AKA suspend to disk, here is a purely experimental patch to apply to bfs443/3.12-ck1 to attempt to address the problem.

bfs443-hibernate_test.patch

Please test and report back!