A couple of issues showed up with BFS 0.422, one being the "0 load" bug and the other being a build issue on non-hotplug releases. So here is BFS 0.423 and 3.4-ck2 (which is just ck1 with the BFS update) which should fix those:
3.4-sched-bfs-423.patch
3.4-ck2/
and the increment only:
3.4bfs422-423.patch
Enjoy!
お楽しみください
Good experience linux-3.4.2-bfs-423
ReplyDeleteNo slow down after some hours as with linux-3.3.
I suspect the sys-time bug before had issued some side effects, but that is just my FUD ...
Con Kolivas, thank you for your work!
Ralph Ulrich
Thanks Con - you are really quick :)
ReplyDeletefanthom
@Ralph:
ReplyDeleteWhat did you mean in "linux-3.4.2-12queuePatches-bfs423-full-ck2" with '-12queuePatches' in the other Blog-thread?! Something away from my radar?
Just gonna reboot now with 3.4.2+bfs-423+BFQ+mm-drop_swap_cache_aggressively.patch.
Thx,
Manuel
Hi Manuel,
DeleteI dont use BFQ, but I run my system with
linux-3.4.2
12 patches from stable-queue
BFS-423
all other ck2 patches
"12 patches from stable-queue"
DeleteO.K. Your Distro is different from openSUSE, obviously. Also, openSUSE patches things in that I'm not aware of anyways.
Let's give my new kernel some uptime.
Greets,
and many thanks to Con Kolivas for providing us with his works' effort
Manuel
BTW, thinking about what Ralph Ulrich wrote in the other Blog thread, he felt the need to learn benchmarking...
ReplyDeleteDo we have any tool to really benchmark "interactivity" within Linux desktop systems?
Isn't it like only having reports, that things got better/worse? (up to today)
Manuel
@ Con Kolivas:
ReplyDeleteA big THANK YOU for this BFS-423 patch. It really makes a difference. Now having it running for almost 22h and there is no slowdown as noticed with previous kernel&BFS combos (also mentioned by Ralph).
In addition the recovery time of suspected swapped out content of the desktop after making use of possibly swapped-out-shmfs is greatly reduced on here now.
It also looks like that the base CPU load of usually running processes on here has also dropped.
I don't have any insight on how this is related to your patch improvement since 422, but: Very nice experience, indeed.
Again, many thanks for your work!!!
Manuel
@ Con Kolivas:
ReplyDeleteNO, nothing o.k. at all.
I again got a complete system failure (like with 3.4.2 plus BFS 422). Just some minutes after my last posting, while only watching something from disk via vlc.
There's nothing in the logs, the machine hang up completely as last time.
SLUB? SLAB?
Please, inspect the differences in the transition in detail.
For me in the meanwhile, I'll compile with a fresh install and will come back.
Manuel Krause
Again I got a complete lockup without any obvious reason (nothing special done, clearly nothing in the logs) yesterday. Around 24h of uptime again.
ReplyDeleteTime to revert to 3.3.8 + D.H. patches and wait for 3.4.3. and BFS 425.
Manuel
I am runing linux-3.4-bfs-423 for days without errors. And of today with linux-3.4.3rc1. This is still the better than mainline Linux!
ReplyDeleteAt the LKML Hillf Danton is publishing patches every now and then. These are bfs-420 based - named bfs 421 - but Linux-3.3 is dead now - end of lifetime.
Ralph Ulrich
Hamburg, Germany
If I hadn't have had so many random lockups with 3.4.x + BFS I wouldn't have had to have written about.
ReplyDeleteAt least 3 lockups in four days. That has never occurred that often, as of before 2.6.39.
I'm now using 3.3.8 with SLUB (instead of my previous SLAB) + my usual BFS setup + all of Hillf Dantons recent patches. Just to find the same lockups if they're not 4.2.* or BFS 422/423 related.
If it's kernel-3.4 related, I would not suffer.
Manuel
Just a reminder, I do not consider Hilf's patches part of BFS.
DeleteNOT only considering.
DeleteIf H.D. wants his work to become famous he should pack his stuff together and push them to any website.
Why do you insist on using Danton's patches? BFS works great without them.
DeleteI do definitely not insist in Hillf Damon's patches per se.
DeleteBut, including Con Kolivas' self confident reply, what should I do now if 3.4.2 with BFS aborts after 24h? Without any helpout from him or others?
That's the only reason for me to go back to & propagate the last known good.
@Chen / X: Don't make up a flame against H.D. At least his patches work. Your first ones were simple NOGOs.
I've inspected my changed .config now really carefully as of 3.4.2 vs. 3.3.8, for possible wrong automatic choices. Dunno. There are many diffs, but nothing I'd suspect to be the culprit.
Manuel
Alas, there's nothing to debug. A lockup after 24h without any logs is very non-specific and gives me nothing to work off. How does it compare to 3.4.2 withOUT BFS with the rest of the config the same?
Delete@Manuel
DeleteHave I made flame on H.D. ? No.
I am giving advice to H.D. that he suppose to pack up his works and push it to any kind of website(e.g www.danton.org, GoogleCode, github, ...) It is more better for users to review what you have done with your work.
Chen
Try 3.4.1 instead. I also have "lock-ups" with 3.4.2. The lockup comes from X11 ("EQ overflowing" infinite loop.) I have no idea whether this is BFS or NVidia binary driver related. It's so rare, that I couldn't be bothered to find out :-P Going to 3.4.1 cured it.
Delete@ Con Kolivas:
DeleteYes really, it's a pity that the system just simply stops working. I like to have provided you with some more useful BUG messages if it had been possible.
Now I'm running the 3.4.2 with a slightly different config WITH BFS as I saw I may have messed up some settings on the way from 3.3.8, 3.4.1 to 3.4.2. In the most dumb case someone may have shot me off from the web due to a malfunctioning firewall, in which case I would need to apologize for the noise that I made. But it's only up for 3h now.
In the next step I'd compare with the same kernel withOUT BFS as you suggested if it locks up again. BTW, there isn't a kernel command line switch to choose the CPU scheduler (like with the I/O schedulers) ?
Thank you for responding,
Manuel
Suggest:
DeleteCompile Kernel with CONFIG_LOCKDEP=y
When it locks up, do the thing with Magic SysRq key
and have it display all held locks and backtraces off
all CPUs.
See here:
http://en.wikipedia.org/wiki/Magic_SysRq_key
My openSUSE kernels predefine DEBUG_KERNEL=y if I set EXPERT=y. And there's then already set CONFIG_LOCKDEP_SUPPORT=y, too.
DeleteDid you mean this one? I don't have a pure "CONFIG_LOCKDEP" in 3.4.2.
But this wouldn't help any further anyways, when the _machine_ locks up (difference would be if only the kernel failed). I've even rechecked that there hadn't been any bad temperature issues and the hardware didn't change for months.
Thanks, Manuel
Mmmh, there's 3.4.3 out now: Should I wait for the next lockup (after only ~9h uptime) or just try the new kernel?
DeleteManuel
But, let's give it some uptime...
ReplyDelete12 hours are nothing.
Manuel
P.S. It should read "if they're not 3.4.* or BFS 422/423 related".
So, 3.3.8 with BFS & _SLUB_ are now at 26h of uptime.
DeleteSaying, the lockup after 24h is not caused by SLUB in 3.3.8.
Manuel
Subject: BFS-O(1) is now a correct algorithm.
ReplyDeleteCon please take a look of this mail. ;-)
Recently linux-3.4.4rc-bfs I had some top time overflows at
ReplyDeletercuc/0
rcuc/1
Is this rpc related?
I am just deleting the one patch
rpc_pipefs-allow-rpc_purge_list-to-take-a-null-waitq-pointer
I see about that and try again ....
Ralph Ulrich
@ Con Kolivas & all other on here as well:
ReplyDeleteI've now spent some days chacking and reverting some config changes I made between 3.3.8 and 3.4.2/3 and testing the resulting kernels whether they run longer then 24h. One of them hardlocked after almost 32h.
Then I set CONFIG_JUMP_LABEL back to n (like I had with 3.3.8). And this one, including BFS, ran longer than 49h.
Would you consider that this option may harm the BFS or something else in such a way that the machine hardlocks? (gcc version is 4.6.2)
Does someone else have experience with this option?
Now I'll still need to test the standard scheduler with this option set to y although I don't like to run kernels without BFS. ;-)
Manuel
Does BFS have some timing dependencies with side effects?
ReplyDeleteLast help text sentence
Optimize very unlikely/likely branches
CONFIG_JUMP_LABEL:
update of the condition is slower, but those are always very rare.
Ralph Ulrich
PS: I had also disabled this option
This is a low level change to how inbuilt expect functions are compiled by gcc into assembly utilising an x86 feature. Can't see how this is affected by BFS directly or indirectly.
DeleteYes, that is what I wanted to answer to Manuel. And I looked it up, when I saw that I myself had this disabled. And I have this rcuc Kernelthreads systime shows going crazy. I just compile my kernel with enabled CONFIG_JUMP_LABEL. Perhaps this gives more time for BFS to behave normal. As the last sentence in the help of the option says:
Delete"update of the condition is slower"
Ralph Ulrich
Con, wasn't it you, who looked into Solaris code just to recognize it as a saner framework? Which could easily come to the conclusion Linux source is more vulnerable to side effects... Ralph
DeleteAnd the other way round?: May _this option_ affect the way BFS works?
ReplyDeleteIIRC, it made BFS snappier on my old hardware. But that's subjective. Perhaps Ralph would share his experience with us.
Thank you for your replies!
Manuel
Manuel, this Option normally brings a performance boost. This is why I disabled it at first to have a more stable experience. But the last sentence in the help of the option: in rare cases there is a slow down to update conditions.
ReplyDeleteThis must have a side effect for BFS: I have no issues with BFS since I enabled this Jump_Label optimization.
Ciao from happy soccer Germany, Ralph
The "slow down" it mentions is when the branch is the opposite of what is predicted. The idea is that there is a branch point where we know that 99% of the time we do code A and 1% of the time we do code B. Normally it would cost a little overhead to do code A and a little more overhead to do code B. With this optimisation feature enabled, it costs NO overhead to do code A and MORE overhead than before to do code B.
DeleteThis has nothing to do with BFS.
Con, our issue is not performance here:
ReplyDeleteWithout JUMP_LABEL
1. Manuels system halts after a day
2. me (shutting down ervery night)
ps -e -o pcpu,bsdtime,stat,comm --sort -pcpu
gives me a rcuc/0 thread with 175 Million seconds run time
in rare cases (after hours).
Just short to correct the above:
DeleteWith BFS and CONFIG_JUMP_LABEL = y my system halts after 23-32h.
With BFS and CONFIG_JUMP_LABEL = n my system keeps running after 49h.
Now running standard scheduler with this CONFIG_JUMP_LABEL = y. 16h so far. I hope it breaks soon, I don't like it's experience.
And: I don't claim that it has anything to do with BFS, I only asked if it could possibly be so.
Manuel
So what about your last test with CONFIG_JUMP_LABEL, Manuel? Results?
DeleteYes, yes. I'm still waiting for 3.4.3 + standard scheduler with CONFIG_JUMP_LABEL = y to fail. But even with 65h of uptime it keeps running without issues (but, of course, with a worse interactivity than BFS). Side note: The CFS kernel does slow down after a certain time running, too.
DeleteDon't know what to do now. Meanwhile I compiled a 3.4.4 + BFS + CONFIG_JUMP_LABEL, ready for next reboot.
Manuel
Hi Con, I have Hard freezes with BFS,
ReplyDeleteI use zen kernel (with BFS and BFQ, linux 3.4.4). Command "mkfs.ext4 -L Diskname -c -v /dev/sdb1" (SATA drive on esata connector) leeds to an hard freeze within minutes during bad block test. Not even the Magic SysRq keys do work anymore.
So I compiled zen with CFS, no problem, no freeze.
Zen with BFS but without BFQ -> freeze.
So I tested it with Vanilla Kernel 3.4.4 and CK2 Patches -> freeze too.
Vanilla Kernel 3.4.4 with your patches but without BFS -> no freeze!
PS: I even tested to disable in the discussion mentioned CONFIG_JUMP_LABEL and BFS, but freeze too.
I use OpenSuse 12.1 and with the original Tumbleweed kernel 3.4.3 there is no such a problem.
Regards Mike
Short Q: Which of your combinations did you test CONFIG_JUMP_LABEL disabled?
DeleteRegards, Manuel
Hi Manuel,
Deleteshort answer: Only the Vanilla Kernel with CK2 patchset and CONFIG_JUMP_LABEL disabled.
Regards Mike
Hi Con,
Deletesome more tests:
1. Hard Freeze with BFS and USB(2) Stick badblock test (command: "mkfs.ext2 -L "Stickname" -m 0 -c -v /dev/sdb1"). Freeze after approx. 5 Minutes and 30%. (remark: with esata drives the freezes comes within 2 minutes)
2. Starting in Runlevel 3 does not make a difference, freeze too with badblock test command (and nothing other tasks running)
Btw. running the command with CFS and the new RIFS from Chen do not have this problem.
Con, if you need additional infos or requests for this bug, I could try to help you.
Thanks and regards
Mike
Just adding to my reports.
ReplyDeleteI abandonned the 3.4.3 CFS + JUMP_LABEL test after 3d0h10m. Rock stable but really predictably unresponsive.
The 3.4.4 with BFS(only) +JUMP_LABEL crashed after ~8h.
Two days ago I had a complete lockup with that kernel+config but WITHOUT JUMP_LABEL after some hours. So it really has nothing to do with that config setting.
I've tried RIFS but that doesn't work on non-SMP systems at all.
I feel a bit unsafe at the moment, when using BFS-patched kernels.
Manuel
Ok looking at the pattern of lockups people are having, I'm reasonably sure it's the block plugging code which I changed going into this BFS release. I will put together an update soon that backs out those changes to the old proven mechanism. Thanks everyone for your bug reports.
ReplyDeleteHi Con,
Deletethanks for your work. You are right. I recently tested the Vanilla 3.3.8 with the old -ck1 patchset with the old BFS and my mentioned problems with the badblock test during mkfs are completly gone.
So its really the new BFS code.
Thanks so far.
Regards Mike
This comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
Deletenice Tool
ReplyDelete