Announcing a resync and update of BFS for linux kernel 3.14.x:
This is mainly a resync from BFS 0.446, but with the addition of the patches as offered by the generous users as seen in the comments here, Alfred Chen and Oleksandr Natalenko. The changes are to fix a circular locking issue on bootup that rarely hit some people, a fix for kvm soft lockups in SMP mode, and to remove some config options that should not be used with BFS.
What's interesting about working on this latest BFS is that I ran into all sorts of instability due to the new kernel that ironically worked out to be a very serious bug in 3.14.0 and was fixed in 3.14.1 with this patch:
commit 8e58cd80d042569da7af501de897c5e0538d99b0 futex: avoid race between requeue and wake
As is often the case, BFS is exceptional at bringing out race conditions and my machine was almost unusable with any significantly multithreaded application such as firefox which kept hanging. This was a scenario where my delay at syncing up the code worked to my advantage as 3.14.2 is working fine.
So here it is:
Somehow I still forgot to include PF's patch for uniprocessor builds, though it's so uncommon to come across a uniprocessor these days! His patch is still valid and can be grabbed here to be applied on top if you need it:
Enjoy!
お楽しみください
another reason why the kernel devs & Linus could give BFS a run from time to time :D
ReplyDeletethanks Con !
your work and of the other devs (Alfred Chen and Oleksandr Natalenko) is very much appreciated =)
So that's what my Firefox hanging was!
ReplyDeleteAnd I think that's the reason why I'm bumping my kernel version since the last major hang 20 minutes ago.
Nice, thanks. Will test it ASAP.
ReplyDeleteWould like to know if my machine hangs again after hibernation. Will report anyway.
It stucked with ksoftirqd/0 consuming 100% of CPU. But I suspect it's not BFS fault but TOI fault. Will test more.
DeleteFor me it also takes a long time when resuming from suspend-to-disk, but it works. Without CK/BFS it wouldn't take much more time (though, I haven't measured the diff). I don't use TOI. The most prominent culprit might be Firefox that consumes and hogs RAM more than ever. I actually use the ESR 24.5.0, but the previous ones were remarkably more calm in these terms. And another one, but it's history now, the first 3.12 kernels never acted that slow.
DeleteRegards, and -- my many thanks for the new BFS/CK to Con himself ! -- Manuel
Hi Con,
ReplyDeleteFirst of all, thanks for all your hard work in bfs and ck-patchset. I've been using it for years and it's really beneficial with all the realtime audio applications I'm using.
However, ever since 3.13 (and now with 3.14) kernel with ck patches I've had an issue where my system hangs after waking up from suspend/hibernation. It seems like the system isn't completely frozen, as it tries to connect to wireless network (although it fails to do so) according to gnome's network indicator, but keyboard, mouse and everything else is completely unresponsive.
I'd like to see this issue resolved, but I really don't know where to look for clues, so I was hoping someone could point me to right direction :)
I'm using graysky's prebuilt ck-kernel packages for arch linux, which also includes bfq io-scheduler on Asus X501A laptop. I can easily rebuild the kernel without bfq or change the kernel config if that's something I should do.
Thanks in advance,
Are you 100% sure it's con's patches on this? I've been having an issue with only on my sandy bridge laptop (Thinkpad T520). Same symptoms coming out of suspend, I also get the same kind of a freeze about once a week if I just let it sit. Switching to a vanilla kernel didn't help.
DeleteNote, the suspend crash sometimes happens after a minute or two of normal responsiveness, so it can connect to wifi sometimes before everything freezes. Screen stays up, can't move mouse, wifi light stays flashing, nothing in logs upon reboot (I'm assuming it can't write).
I intend to try the 3.10-x vanilla kernel to see if I still have issues. My nearly identical setup arrandale-based Thinkpad X201 has never had any of the crash/suspend issues.
I'm also having trouble with this. After resuming either from S3 or S4, a process named ksoftirqd/0 apparently gets overwhelmed with IRQs, using up one core completely. Apparently it's some issue with the wifi card since the problem disappears when this device is disabled. Also, when the problem does actually occur, it seems some modules are incorrectly reactivated. Every time a process or myself try to write anything to disk, the computer freezes.
DeleteThe problem is not present in the vanilla kernel, where I can suspend/hibernate and then resume normally.
Yes, this is quite similar to what I saw.
DeleteSorry I actually had to reinstall my system for reasons unrelated to ck/bfs, and didn't have time to investigate this further until now.
DeleteLooks like my suspend issues are caused by ath9k wireless module. Running 3.14.4-ck1+bfq+arch-linux-patches. If I modprobe -r ath9k, it looks like I can suspend and resume without any issues. If I modprobe ath9k after resuming from suspend, my system freezes again, just as it does when resuming with the ath9k modules loaded.
Can confirm the problem with hibernation and BFS on ath9k. Suspend is working fine, but hibernation leads to an CPU soft hung, further work is nearly impossible. Switching the scheduler to CFS resolves the problem. Tried it with different kernels (plain with ck, zen with additionals) always the same problem. The ksoftirqd process stucks.
DeleteCon, if I could help to resolve the problem, let me know.
Thanks.
cu sysitos
Really, I'm glad to find out I'm not alone. Probably, things could be raped in ath9k drivers and not in BFS.
DeleteBut.
➜ pf-kernel git:(pf-3.14) git shortlog --no-merges v3.13..v3.14 | grep ath9 | wc -l
135
135 commits related to ath9k in between 3.13 and 3.14. Something definitely could be broken.
Thanks. If I had time I'd scour through those commits to see if there's something that gives a hint as to where the problem lies.
DeleteYes, its definitive some problem in the collaboration between ath9k und BFS. Blacklisting the ath9k module and the wake up after hibernation does work fine. Loading than the ath9k module and the soft hung of the CPU starts.
DeleteBtw. replaced my Intel WLAN mPCI card in my DELL Vostro with the Atheros to resolve the loosy throughput and breaks in the WLAN connection ;)
CU sysitos
Thank you again for all your work
ReplyDeleteThanks for the new release. I compiled a new kernel earlier this evening: Gentoo sources 3.14.3 with the CK patchset, BFQ, and UKSM patches. I decided to update my Mesa installation, and the system locked up while LTO linking Mesa. I'm on a an older, single-core system, and I didn't have many applications open. Basically just Konsole with GCC running, and the system monitor open showing LTO using 43% of the CPU. So something is still causing system hangs.
ReplyDeleteThanks Con, 447 working fine here on 3.14.3. So far no issues.
ReplyDeleteOn an unrelated note to those using teh nvidia binary driver: I had to add the following patch to make 334.21 compile for 3.14.3: http://pastebin.com/UgnyrrH5
I'm also getting the system to hang on BFS 447 with using G+ Hangouts video calls and just oding regular web surfing in the background. It may not be the only hang scenario either.
ReplyDeleteWell, caught similar hang with no visible reason :(. No logs available, just freezed machine.
DeleteIf you don't mind, I'd like to ask another more-or-less OT question on here:
ReplyDeleteDoes someone of you have found a setting to ease resuming from swap/hibernation?
Are there kernel config options that are critically involved?
My original problem was that after resuming from hibernation it took about 10+ minutes to get a firefox with "too" many open tabs (100+) being actively usable again. And as a nasty side note: this is longer than a freshly started FF needs to read them from the web.
By coincidence I've found something that was able to reduce this time to approx. 5 minutes: Setting /proc/sys/vm/page-cluster to 5 (default is 3 on here) what is said to logarithmically increase the readahead from swap.
From 10+ down to 5 minutes is a good result, but I haven't faced this issue until mid 3.12.x kernels, but the cause can also be a different memory or online/offline management within the firefox releases in the meantime?
Someone any ideas? Thank you for sharing them,
Manuel
I'm so sorry for having bothered you with this one.
DeleteNow I've tried tuxonice as a kind of "workaround" -- but with the result that it's working much much better and faster than the default kernel hibernate/ suspend-to-disk + resume.
I don't know what the kernel does wrong, that after resuming from disk firefox needs ages to assumingly (re-)read all its allocated memory causing it to be unresponsive until its swapin I/O is done.
With tuxonice firefox is responsive almost at once after resume, so that I can recommend it without any doubt: It's, in fact due to its speed, even a real alternative to sleep/ suspend-to-ram in my opinion.
tuxonice currently works on here on top of ck and bfq-v7r4 with 3.14.4 vanilla.
Regards, Manuel
Hello all,
ReplyDeleteUsing the ck kernel from the repo-ck repository under Arch Linux, I noticed my wireless connection keeps on crashing every few hours.
At times, it will resume on its own after a few seconds and at times a restart is needed.
This does not happen on regular Arch kernel.
dmesg logs and a more comprehensive description can be found on the following Arch forums post:
https://bbs.archlinux.org/viewtopic.php?pid=1415693
Thanks, Adam
Greetings.
ReplyDeleteI gave BFS another try (had to abandon it due to it causing deadlocks in FFmpeg during movies transcoding), but this time, BFS causes Blender (tested with v2.68 and v2.70) to crash while exporting DAE meshes...
Back to the vanilla kernel for me !
Please give us more information about your running kernel version. Maybe, even post your .config to a hoster for later review.
DeleteI assume, you've already checked, that an earlier revision of ffmpeg does NOT bother you?!
Manuel
I am running kernel v3.14, obviously... v3.14.4 to be precise. The kernel config is pretty irrelevant (same issue on 3 different computers with different hardware/config).
DeleteThe problem I was reporting here was about Blender.
I reported on this blog last year about encoding issues (was with mencoder, but also got the same kind of issue (deadlocks) with FFmpeg: and yes, I did try with several versions: IIRC with 0.99.8, 1.0.7 and v1.2.1).
My guess is that there's a race condition somewhere in the BFS-patched kernel (i.e. it could be a problem with a kernel driver, for example... with ext4fs perhaps ?...).
Just for the crack of it I fired up Blender and exported some small scene to DAE without any problem. (Gosh, they make even the export dialog unbelievably complicated, not to mention the rest of the program. No wonder i can't use it :p)
DeleteAnyway, that result was to be expected since application programs do not interact with cpu schedulers. if anything, BFS exposes races in OP's video driver, but they shouldn't be unique to Blender.
Keeping saying that the problems with BFS are just the result of it "exposing race conditions in other software" is not very constructive, especially when BFS renders the system unusable because of such problems.
DeleteI'd expect more serious investigation of those problems by BFS' author, instead of systematically calling the fault on others' work without any proof of such faults.
Each time I have been testing a BFS-enabled kernel, I ran into issues within hours or at most a couple of days: it should not be hard
to reproduce these issues...
In these conditions, it's not a big surprise that BFS didn't yet make its way (even as a "staging" feature) in the official Linux kernel...
Thanks for your carefully thought out comments. I picked one bug in the last release which was exposed by BFS. I made absolutely no such claims with any of the other problems people here are having, but I know how easily people can just lash out given my history and then say things like "systematically calling the fault on others' work"... Look hard and you will find no such systematic trashing here. I continue to maintain BFS in my own time as purely a fun project that some people find useful. If you expect more serious investigation I suggest you look to enterprise supported projects, not some random bit of code an anaesthetist hacked together in his spare time.
DeleteAll i can say is that I tried to reproduce anonymous' alleged issue but could not. Since he doesn't provide any helpful information like stack traces, and nobody else has ever observed this alleged flaky behaviour i can only assume issues specific to his installations.
Delete> In these conditions, it's not a big surprise that BFS didn't yet make its way (even as a "staging" feature) in the official Linux kernel...
DeleteBecause people still keep bringing this up: it never will. BFS doesn't scale too good in systems with dozens, let alone hundreds of cores and upstream isn't interested in maintaining multiple schedulers (same as why BFQ isn't mainlined - they want its features merged with CFQ, not put on its side).
This is simply for maintenance and usability reasons.
Hi I'm trying to run the CK1 patchset on vanilla 3.14.4 Arch distro on a Samsung NP535 laptop (AMD Family 15h/Piledriver/BDVER2), but although the initramfs works just fine, I experience a race condition and lockup at login to TTY1.
ReplyDeleteI've both compiled the kernel locally and used the precompiled packages in Graysky's repo-ck repo, using both generic and CPU specific kernel configurations and also tried the 3.13.11 kernel. I posted more details on the Arch forum here.
Any ideas or what further info would be required?
OK, problem turns out to be a conflict with my wireless kernel module, specifically ath9k which after blacklisting, everything CK/BFS/BFQ flavored runs just fine (but I'm unwirelessed, so not so fine)
DeleteHi, I used my notebook last two day with 3.14.4 with 0447 and notice a huge regression. The visible impact is intel wifi driver crash when active for a while and error msg flush all over the dmesg. I rollback to 3.14.3 with 0447 and it became a litter better, it stands longer but still crashed. Then I rollback to 3.14.2 with *0446*(which I ported in the 0446 thread), system become stable and intel wifi drivers stands good to let me finish 2 TV plays.
ReplyDeleteI check the change list from 3.14.2 to 3.14.4, from 3.14.2 to 3.14.3, I change from my 0446 bfs port to 0447 and there is intel driver file modification during 3.14.3 to 3.14.4. So l have a suspicion that 0447 may have some issue and 3.14.4 intel driver changes make it worse.
For further verification, I will rebase my *0446* port to 3.14.3 and 3.14.4 and test it tonight, will post back.
PS, another machine I used with 3.14.4 and 0447 is running fine.
Delete@Alfred, I wonder if 3.14.5 will play better: some timer tick patches in the stable-queue actually are.
Deleteruns all smoothly with my old apple mini core2duo. Very thanx to the whole supporter team of bfs!
Greetings from suddenly hot summer Hamburg,
Ralph Ulrich
Hi, here comes the test result last night.
Delete#1 kernel 3.14.2 with my 0446 bfs port has intel wifi driver crash while I compiling new kernel. It is identified as issue https://bugzilla.redhat.com/show_bug.cgi?id=1046495 , which is fixed in 3.14.4 mainline kernel.
#2 kernel 3.14.4 with my 0446 bfs port, 4 test runs: reboot machine, active wifi and TV plays > 30mins. All passed.
So if you want to help test and see if it solve your bfs issue, pls check my 0446 bfs port for 3.14.4 at https://bitbucket.org/alfredchen/linux-gc/commits/d463c14ca74aa93049c7135bbc6bfa7ef7201cfe/raw/ , ps, it's not a patch upon bfs 0447, it is a replacement patch.
Or to be simple, you can download the kernel source code from https://bitbucket.org/alfredchen/linux-gc/get/linux-3.14.y-gc-test.tar.bz2 and test with your kernel config.
I will look into the the delta of my 0446 port to 0447, and try make a patch upon bfs 0447.
Thanks Alfred, keep us informed of what you find.
DeleteI checked the delta in the weekend, just very minor difference in syscall of scheduler get/set attribute function and I don't think it is likely causing hang/crash issues.
DeleteSo I recompiled 3.14.4 with 0447 bfs and test again on my notebook. It turns out that system is stable as expected. My best guess is I must installed a wrong kernel image version to boot parition lastweek as 3.14.4, I have over-write it during test and can't check which actual version it is.
In sum, my issue is caused by intel iwlwifi driver bug and it has been fixed in 3.14.4 mainline. It is *NOT BFS RELATED*, but it seems it get worse in 3.14.3 than it is in 3.14.2, which make me guess it is related to bfs b/c I upgrade bfs from 0446 to 04477 at that time too.
I forgot that things could be this speedy. Good work.
ReplyDelete3:14 ck1 kernel crashes all running programs in wine or any task that requires more processor as compliar programs, drivers, or video conversion and 3d rendering the kernel without bfs usually works with the kernel pf the symptom is the same, it crashes latch with wine or similar tasks.
ReplyDeleteHere is my attempt to port BFS to 3.15 kernel: https://gist.github.com/921853bb3e926e3fe5d1
ReplyDeletegit tree: https://github.com/pfactum/pf-kernel/commits/pf-3.15
Boots OK in QEMU, will test on real hardware a little bit later.
Ah crap here we go again :P
DeleteFeel free to make my crappy port less crappy :D.
DeleteLOL I wasn't complaining about your port, just that I have to resync again.
DeleteDon't forget about my small extra patches, please.
DeletePlease point them out so I can consciously forget them again.
DeleteOne option is to keep the patch in sync with linux-next
DeleteI guess the only patch left unmerged is uniprocessor-related fix here: https://gist.githubusercontent.com/pfactum/9332896/raw/0001-ck-3.12-fix-BFS-compiling-with-CONFIG_SMP-n.patch
ReplyDeleteIt seems that you've already merged other patches into 447.
My port of .447 to 3.15 is done. There are "usual scheduler improvement" in the core.c file, I also sync-up these changes in bfs.c.
ReplyDeleteThere are addition 3 patches recently I wrote based on .447
#1 [BFS] Add missing attr.sched_flags for sched_getattr.
Which I found the minor delta when debugging my issue in .447 on 3.14.
https://bitbucket.org/alfredchen/linux-gc/commits/7c64a1257978efef73271a368ae3a69f4f6a1c51
#2 [BFS] Refine locality to ranking and siblings/cache idle code.
One thing I am trying to make some bfs improvement.
https://bitbucket.org/alfredchen/linux-gc/commits/d38a9fd9f97b953ab8a00bc574a9dd23c303300d
#3 [BFS] locality doesn't need to be kmalloc.
An other thing.
https://bitbucket.org/alfredchen/linux-gc/commits/acf66e57edd858a11742a7305b51aaa2b0e9b61c
Ports and patches are tested in my working machine, no noticeable regression found by compiling the kernels. Suspend/Resume works.
But there is a huge delay before system goes to suspend for the first time(by changes?), and I am still trying to finger out is it a HW setup/ kernel version related issue or not.
Could you please post full patch?
DeleteOK, got it from your git tree.
DeleteI'd suggest you to merge two extra patches: 1) SMP=n fix (mentioned above) 2) this one: https://github.com/pfactum/pf-kernel/commit/74cdef988172bd09abe664323a0890cf417eabba
My git tree is: https://bitbucket.org/alfredchen/linux-gc/commits/branch/linux-3.15.y-gc
DeleteYou can pick up the commits you are interested in.
@post-factum
1) patch is merged. 2) Thanks for point it out. I can't find a raw format of your commit in github, so I simply change the code and commit it in my git.
PS, Happy World Cup time! :)
@Alfred Chen:
DeleteI really like your repositories, as they easily provide several more useful and senseful extra patches in one source, this is already meant for 3.14.y-gc -- My assumption is that mostly your own patches (meaning not the well known CK/BFS & BFQ) do stabilize my system especially when dealing with hibernation/resume.
Thank you very much for your work !!!
Manuel
@Manuel
DeleteThanks for checking my git. Like the pf-kernel, I just pull some well known non-mainlined patches(bfs, bfq, phc, compiler options) in my git and I wrote little patches to improve boot-up time for my machines and trying to improve bfs recently. I am not sure whether my own patches contribute to stabilize your system, as I don't have hibernation setup. It may also comes from linux-stable kernel tree updates, I used to keep my -gc branch sync-up with linux-stable tree very 1 or 2 minor release before the next stable kernel release.
Anyway, I am glad to know it helps for you. Thank you.
Also here is proposed fix for ARM platform from Ivan Shapovalov:
ReplyDeletehttps://github.com/pfactum/pf-kernel/commit/80f5dbe7c76aef3c7cb05c381c803eee8024f6b9
An update regarding ath9k issue: http://lists.tuxonice.net/pipermail/tuxonice-devel/2014-June/007501.html
ReplyDeleteAnother small fix from me to avoid compiling error with CONFIG_DEBUG_ATOMIC_SLEEP=y
ReplyDeletehttps://github.com/pfactum/pf-kernel/commit/b5e2f75f42061a12ab074b5ed87c322c553ad9c1
One more update about deadlocks:
ReplyDeletehttp://lists.tuxonice.net/pipermail/tuxonice-devel/2014-June/007502.html
@ ac + pf
ReplyDeletethx for your work
It would be great to have BFS, fixes and patches contributed/suggested by AC and PF in a single patchset. Btw, I'd like to hear Con's thought on AC's patches.
ReplyDeleteYou can have a look at post-factums kernel here: https://pf.natalenko.name/
Deleteand see a summary of what's included. There's also a link to his related forum with announcements upon new releases, reports, etc. To this -pf patch, you can add the small number of specialized patches from Alfred Chen's repo,.
As Alfred Chen's repository clearly shows the entries (that can be reviewed and downloaded separately) and he omits UKSM, that does break my hibernate somehow, I like his approach more. (I'm also unable to save separate commits from post-factums github.)
That CK/BFS, BFQ, and TuxOnIce, too, are essentially useful for a responsive Linux system, should already be known.
Best regards, Manuel
Thanks for your reply, though I'm more interested in a BFS-only patchset, with the addition of those fixes Con often forgets to include :-) and AC's improvements.
Delete+1
DeleteThanks, for providing this one.
DeleteBut, please, can you transparently name _all_ particular patches _in_more_detail_ that went into your compilation?! Perhaps even on the github front page?
I won't patch my kernel unless I know the source is trustworthy and this compilation patch may eventually conflict with already applied ones.
Manuel
OK Manuel, go check Apollinaris again, better readme, patches broken out in a subdirectory
DeleteTony
@Tony /apollinaris:
DeleteThank you very much for your additional work! Now it's fine and more clear to all of us what you mean!
Manuel
Can I extract the ck1 patch from AC's github so I can apply it to the vanilla upstream 3.15?
ReplyDeleteSorry, currently travelling...
ReplyDeleteBTW, don't forget to also check the new BFQ release v7r5:
ReplyDelete[ANNOUNCE] BFQ-v7r5 for 3.13.0-3.15.0: https://groups.google.com/forum/?fromgroups=#!topic/bfq-iosched/VT96u5pbDLo
[ANNOUNCE] BFQ-v7r5 for 3.0.0-3.12.0, plus 3.10.8+: https://groups.google.com/forum/?fromgroups=#!topic/bfq-iosched/n_CqETwVl9w
Patches per kernel as usual in:
http://algo.ing.unimo.it/people/paolo/disk_sched/patches/?C=M;O=D
Have fun, Manuel
I noticed that most of times tasks take a little while before run, anyway it seems faster for destktop tasks than standard scheduler. Patched kernel
ReplyDeleteIt's too sad, that I'm not able to track down, why firefox' CPU usage triples within approx. 10h of it's uptime. Of course, I can eliminate even more open tabs or disable addons like ABP. But doing this would falsify the result.
ReplyDeleteMaybe, this has nothing to do with ck/BFS at all,
Manuel Krause
Hi
ReplyDeleteToo bad! This patch doesn't seems to work any more since the kernel 3.14.56
distro debian linux Wheezy x86_64 :
kernel/sched/bfs.c: In function ‘_cond_resched’:
kernel/sched/bfs.c:4604:2: error: too few arguments to function ‘should_resched’
In file included from include/linux/preempt.h:20:0,
from include/linux/spinlock.h:50,
from include/linux/mmzone.h:7,
from include/linux/gfp.h:5,
from include/linux/mm.h:9,
from kernel/sched/bfs.c:31:
/usr/src/linux-3.14.56-ck1/arch/x86/include/asm/preempt.h:108:29: note: declared here
kernel/sched/bfs.c: In function ‘__cond_resched_lock’:
kernel/sched/bfs.c:4622:2: error: too few arguments to function ‘should_resched’
In file included from include/linux/preempt.h:20:0,
from include/linux/spinlock.h:50,
from include/linux/mmzone.h:7,
from include/linux/gfp.h:5,
from include/linux/mm.h:9,
from kernel/sched/bfs.c:31:
/usr/src/linux-3.14.56-ck1/arch/x86/include/asm/preempt.h:108:29: note: declared here
kernel/sched/bfs.c: In function ‘__cond_resched_softirq’:
kernel/sched/bfs.c:4644:2: error: too few arguments to function ‘should_resched’
In file included from include/linux/preempt.h:20:0,
from include/linux/spinlock.h:50,
from include/linux/mmzone.h:7,
from include/linux/gfp.h:5,
from include/linux/mm.h:9,
from kernel/sched/bfs.c:31:
/usr/src/linux-3.14.56-ck1/arch/x86/include/asm/preempt.h:108:29: note: declared here
make[3]: *** [kernel/sched/bfs.o] Erreur 1
make[2]: *** [kernel/sched] Erreur 2
make[1]: *** [kernel] Erreur 2
make[1]: quittant le répertoire « /usr/src/linux-3.14.56-ck1
if someone has a idea...
Regards
This patch will fix your BFS build: https://github.com/semplice/linux/blob/master/debian/patches/semplice/features/all/bfs/bfs-009-add-preempt_offset-argument-to-should_resched().patch
DeleteIt helps me just now with my 3.16.54
Also apply this: https://github.com/zen-kernel/zen-kernel/issues/24
Delete