Monday 11 June 2012

bfs 0.423, 3.4-ck2

A couple of issues showed up with BFS 0.422, one being the "0 load" bug and the other being a build issue on non-hotplug releases. So here is BFS 0.423 and 3.4-ck2 (which is just ck1 with the BFS update) which should fix those:

3.4-sched-bfs-423.patch

3.4-ck2/

and the increment only:

3.4bfs422-423.patch

Enjoy!
お楽しみください

48 comments:

  1. Good experience linux-3.4.2-bfs-423
    No slow down after some hours as with linux-3.3.
    I suspect the sys-time bug before had issued some side effects, but that is just my FUD ...

    Con Kolivas, thank you for your work!
    Ralph Ulrich

    ReplyDelete
  2. Thanks Con - you are really quick :)

    fanthom

    ReplyDelete
  3. @Ralph:
    What did you mean in "linux-3.4.2-12queuePatches-bfs423-full-ck2" with '-12queuePatches' in the other Blog-thread?! Something away from my radar?

    Just gonna reboot now with 3.4.2+bfs-423+BFQ+mm-drop_swap_cache_aggressively.patch.

    Thx,
    Manuel

    ReplyDelete
    Replies
    1. Hi Manuel,
      I dont use BFQ, but I run my system with
      linux-3.4.2
      12 patches from stable-queue
      BFS-423
      all other ck2 patches

      Delete
    2. "12 patches from stable-queue"

      O.K. Your Distro is different from openSUSE, obviously. Also, openSUSE patches things in that I'm not aware of anyways.

      Let's give my new kernel some uptime.

      Greets,

      and many thanks to Con Kolivas for providing us with his works' effort

      Manuel

      Delete
  4. BTW, thinking about what Ralph Ulrich wrote in the other Blog thread, he felt the need to learn benchmarking...

    Do we have any tool to really benchmark "interactivity" within Linux desktop systems?
    Isn't it like only having reports, that things got better/worse? (up to today)

    Manuel

    ReplyDelete
  5. @ Con Kolivas:
    A big THANK YOU for this BFS-423 patch. It really makes a difference. Now having it running for almost 22h and there is no slowdown as noticed with previous kernel&BFS combos (also mentioned by Ralph).
    In addition the recovery time of suspected swapped out content of the desktop after making use of possibly swapped-out-shmfs is greatly reduced on here now.
    It also looks like that the base CPU load of usually running processes on here has also dropped.

    I don't have any insight on how this is related to your patch improvement since 422, but: Very nice experience, indeed.

    Again, many thanks for your work!!!
    Manuel

    ReplyDelete
  6. @ Con Kolivas:
    NO, nothing o.k. at all.

    I again got a complete system failure (like with 3.4.2 plus BFS 422). Just some minutes after my last posting, while only watching something from disk via vlc.

    There's nothing in the logs, the machine hang up completely as last time.

    SLUB? SLAB?
    Please, inspect the differences in the transition in detail.

    For me in the meanwhile, I'll compile with a fresh install and will come back.

    Manuel Krause

    ReplyDelete
  7. Again I got a complete lockup without any obvious reason (nothing special done, clearly nothing in the logs) yesterday. Around 24h of uptime again.

    Time to revert to 3.3.8 + D.H. patches and wait for 3.4.3. and BFS 425.

    Manuel

    ReplyDelete
  8. I am runing linux-3.4-bfs-423 for days without errors. And of today with linux-3.4.3rc1. This is still the better than mainline Linux!

    At the LKML Hillf Danton is publishing patches every now and then. These are bfs-420 based - named bfs 421 - but Linux-3.3 is dead now - end of lifetime.
    Ralph Ulrich
    Hamburg, Germany

    ReplyDelete
  9. If I hadn't have had so many random lockups with 3.4.x + BFS I wouldn't have had to have written about.
    At least 3 lockups in four days. That has never occurred that often, as of before 2.6.39.

    I'm now using 3.3.8 with SLUB (instead of my previous SLAB) + my usual BFS setup + all of Hillf Dantons recent patches. Just to find the same lockups if they're not 4.2.* or BFS 422/423 related.

    If it's kernel-3.4 related, I would not suffer.
    Manuel

    ReplyDelete
    Replies
    1. Just a reminder, I do not consider Hilf's patches part of BFS.

      Delete
    2. NOT only considering.
      If H.D. wants his work to become famous he should pack his stuff together and push them to any website.

      Delete
    3. Why do you insist on using Danton's patches? BFS works great without them.

      Delete
    4. I do definitely not insist in Hillf Damon's patches per se.

      But, including Con Kolivas' self confident reply, what should I do now if 3.4.2 with BFS aborts after 24h? Without any helpout from him or others?
      That's the only reason for me to go back to & propagate the last known good.

      @Chen / X: Don't make up a flame against H.D. At least his patches work. Your first ones were simple NOGOs.

      I've inspected my changed .config now really carefully as of 3.4.2 vs. 3.3.8, for possible wrong automatic choices. Dunno. There are many diffs, but nothing I'd suspect to be the culprit.

      Manuel

      Delete
    5. Alas, there's nothing to debug. A lockup after 24h without any logs is very non-specific and gives me nothing to work off. How does it compare to 3.4.2 withOUT BFS with the rest of the config the same?

      Delete
    6. @Manuel
      Have I made flame on H.D. ? No.
      I am giving advice to H.D. that he suppose to pack up his works and push it to any kind of website(e.g www.danton.org, GoogleCode, github, ...) It is more better for users to review what you have done with your work.
      Chen

      Delete
    7. Try 3.4.1 instead. I also have "lock-ups" with 3.4.2. The lockup comes from X11 ("EQ overflowing" infinite loop.) I have no idea whether this is BFS or NVidia binary driver related. It's so rare, that I couldn't be bothered to find out :-P Going to 3.4.1 cured it.

      Delete
    8. @ Con Kolivas:
      Yes really, it's a pity that the system just simply stops working. I like to have provided you with some more useful BUG messages if it had been possible.

      Now I'm running the 3.4.2 with a slightly different config WITH BFS as I saw I may have messed up some settings on the way from 3.3.8, 3.4.1 to 3.4.2. In the most dumb case someone may have shot me off from the web due to a malfunctioning firewall, in which case I would need to apologize for the noise that I made. But it's only up for 3h now.

      In the next step I'd compare with the same kernel withOUT BFS as you suggested if it locks up again. BTW, there isn't a kernel command line switch to choose the CPU scheduler (like with the I/O schedulers) ?

      Thank you for responding,
      Manuel

      Delete
    9. Suggest:
      Compile Kernel with CONFIG_LOCKDEP=y
      When it locks up, do the thing with Magic SysRq key
      and have it display all held locks and backtraces off
      all CPUs.
      See here:
      http://en.wikipedia.org/wiki/Magic_SysRq_key

      Delete
    10. My openSUSE kernels predefine DEBUG_KERNEL=y if I set EXPERT=y. And there's then already set CONFIG_LOCKDEP_SUPPORT=y, too.
      Did you mean this one? I don't have a pure "CONFIG_LOCKDEP" in 3.4.2.

      But this wouldn't help any further anyways, when the _machine_ locks up (difference would be if only the kernel failed). I've even rechecked that there hadn't been any bad temperature issues and the hardware didn't change for months.

      Thanks, Manuel

      Delete
    11. Mmmh, there's 3.4.3 out now: Should I wait for the next lockup (after only ~9h uptime) or just try the new kernel?

      Manuel

      Delete
  10. But, let's give it some uptime...
    12 hours are nothing.

    Manuel

    P.S. It should read "if they're not 3.4.* or BFS 422/423 related".

    ReplyDelete
    Replies
    1. So, 3.3.8 with BFS & _SLUB_ are now at 26h of uptime.

      Saying, the lockup after 24h is not caused by SLUB in 3.3.8.

      Manuel

      Delete
  11. Subject: BFS-O(1) is now a correct algorithm.
    Con please take a look of this mail. ;-)

    ReplyDelete
  12. Recently linux-3.4.4rc-bfs I had some top time overflows at
    rcuc/0
    rcuc/1
    Is this rpc related?

    I am just deleting the one patch
    rpc_pipefs-allow-rpc_purge_list-to-take-a-null-waitq-pointer
    I see about that and try again ....

    Ralph Ulrich

    ReplyDelete
  13. @ Con Kolivas & all other on here as well:
    I've now spent some days chacking and reverting some config changes I made between 3.3.8 and 3.4.2/3 and testing the resulting kernels whether they run longer then 24h. One of them hardlocked after almost 32h.

    Then I set CONFIG_JUMP_LABEL back to n (like I had with 3.3.8). And this one, including BFS, ran longer than 49h.

    Would you consider that this option may harm the BFS or something else in such a way that the machine hardlocks? (gcc version is 4.6.2)

    Does someone else have experience with this option?

    Now I'll still need to test the standard scheduler with this option set to y although I don't like to run kernels without BFS. ;-)

    Manuel

    ReplyDelete
  14. Does BFS have some timing dependencies with side effects?

    Last help text sentence
    Optimize very unlikely/likely branches
    CONFIG_JUMP_LABEL:

    update of the condition is slower, but those are always very rare.

    Ralph Ulrich
    PS: I had also disabled this option

    ReplyDelete
    Replies
    1. This is a low level change to how inbuilt expect functions are compiled by gcc into assembly utilising an x86 feature. Can't see how this is affected by BFS directly or indirectly.

      Delete
    2. Yes, that is what I wanted to answer to Manuel. And I looked it up, when I saw that I myself had this disabled. And I have this rcuc Kernelthreads systime shows going crazy. I just compile my kernel with enabled CONFIG_JUMP_LABEL. Perhaps this gives more time for BFS to behave normal. As the last sentence in the help of the option says:
      "update of the condition is slower"
      Ralph Ulrich

      Delete
    3. Con, wasn't it you, who looked into Solaris code just to recognize it as a saner framework? Which could easily come to the conclusion Linux source is more vulnerable to side effects... Ralph

      Delete
  15. And the other way round?: May _this option_ affect the way BFS works?

    IIRC, it made BFS snappier on my old hardware. But that's subjective. Perhaps Ralph would share his experience with us.

    Thank you for your replies!
    Manuel

    ReplyDelete
  16. Manuel, this Option normally brings a performance boost. This is why I disabled it at first to have a more stable experience. But the last sentence in the help of the option: in rare cases there is a slow down to update conditions.

    This must have a side effect for BFS: I have no issues with BFS since I enabled this Jump_Label optimization.
    Ciao from happy soccer Germany, Ralph

    ReplyDelete
    Replies
    1. The "slow down" it mentions is when the branch is the opposite of what is predicted. The idea is that there is a branch point where we know that 99% of the time we do code A and 1% of the time we do code B. Normally it would cost a little overhead to do code A and a little more overhead to do code B. With this optimisation feature enabled, it costs NO overhead to do code A and MORE overhead than before to do code B.

      This has nothing to do with BFS.

      Delete
  17. Con, our issue is not performance here:
    Without JUMP_LABEL
    1. Manuels system halts after a day
    2. me (shutting down ervery night)
    ps -e -o pcpu,bsdtime,stat,comm --sort -pcpu
    gives me a rcuc/0 thread with 175 Million seconds run time
    in rare cases (after hours).

    ReplyDelete
    Replies
    1. Just short to correct the above:
      With BFS and CONFIG_JUMP_LABEL = y my system halts after 23-32h.
      With BFS and CONFIG_JUMP_LABEL = n my system keeps running after 49h.

      Now running standard scheduler with this CONFIG_JUMP_LABEL = y. 16h so far. I hope it breaks soon, I don't like it's experience.

      And: I don't claim that it has anything to do with BFS, I only asked if it could possibly be so.

      Manuel

      Delete
    2. So what about your last test with CONFIG_JUMP_LABEL, Manuel? Results?

      Delete
    3. Yes, yes. I'm still waiting for 3.4.3 + standard scheduler with CONFIG_JUMP_LABEL = y to fail. But even with 65h of uptime it keeps running without issues (but, of course, with a worse interactivity than BFS). Side note: The CFS kernel does slow down after a certain time running, too.

      Don't know what to do now. Meanwhile I compiled a 3.4.4 + BFS + CONFIG_JUMP_LABEL, ready for next reboot.

      Manuel

      Delete
  18. Hi Con, I have Hard freezes with BFS,

    I use zen kernel (with BFS and BFQ, linux 3.4.4). Command "mkfs.ext4 -L Diskname -c -v /dev/sdb1" (SATA drive on esata connector) leeds to an hard freeze within minutes during bad block test. Not even the Magic SysRq keys do work anymore.
    So I compiled zen with CFS, no problem, no freeze.
    Zen with BFS but without BFQ -> freeze.

    So I tested it with Vanilla Kernel 3.4.4 and CK2 Patches -> freeze too.
    Vanilla Kernel 3.4.4 with your patches but without BFS -> no freeze!

    PS: I even tested to disable in the discussion mentioned CONFIG_JUMP_LABEL and BFS, but freeze too.

    I use OpenSuse 12.1 and with the original Tumbleweed kernel 3.4.3 there is no such a problem.

    Regards Mike

    ReplyDelete
    Replies
    1. Short Q: Which of your combinations did you test CONFIG_JUMP_LABEL disabled?
      Regards, Manuel

      Delete
    2. Hi Manuel,

      short answer: Only the Vanilla Kernel with CK2 patchset and CONFIG_JUMP_LABEL disabled.

      Regards Mike

      Delete
    3. Hi Con,

      some more tests:

      1. Hard Freeze with BFS and USB(2) Stick badblock test (command: "mkfs.ext2 -L "Stickname" -m 0 -c -v /dev/sdb1"). Freeze after approx. 5 Minutes and 30%. (remark: with esata drives the freezes comes within 2 minutes)

      2. Starting in Runlevel 3 does not make a difference, freeze too with badblock test command (and nothing other tasks running)

      Btw. running the command with CFS and the new RIFS from Chen do not have this problem.

      Con, if you need additional infos or requests for this bug, I could try to help you.

      Thanks and regards
      Mike

      Delete
  19. Just adding to my reports.
    I abandonned the 3.4.3 CFS + JUMP_LABEL test after 3d0h10m. Rock stable but really predictably unresponsive.
    The 3.4.4 with BFS(only) +JUMP_LABEL crashed after ~8h.
    Two days ago I had a complete lockup with that kernel+config but WITHOUT JUMP_LABEL after some hours. So it really has nothing to do with that config setting.
    I've tried RIFS but that doesn't work on non-SMP systems at all.

    I feel a bit unsafe at the moment, when using BFS-patched kernels.

    Manuel

    ReplyDelete
  20. Ok looking at the pattern of lockups people are having, I'm reasonably sure it's the block plugging code which I changed going into this BFS release. I will put together an update soon that backs out those changes to the old proven mechanism. Thanks everyone for your bug reports.

    ReplyDelete
    Replies
    1. Hi Con,

      thanks for your work. You are right. I recently tested the Vanilla 3.3.8 with the old -ck1 patchset with the old BFS and my mentioned problems with the badblock test during mkfs are completly gone.

      So its really the new BFS code.

      Thanks so far.
      Regards Mike

      Delete
  21. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete