Thursday 16 April 2015

BFS 462, linux-4.0-ck1

Announcing a resync and update of BFS for linux-4.0

BFS by itself:

4.0-sched-bfs-462.patch

-ck branded linux-4.0-ck1 patches:

4.0-ck1 patches

The usual collection of resyncs and minor updates only.

It includes the following changes:
- Minor tweaks to uniprocessor build (though enabling SMP will fix breakage if it still exists).
- Fix for tracing build failure
- SMT nice update to ignore kernel threads
- Decrease log level of locality information to debug

EDIT Fix for 4.0.2+: bfs462-rtmn-fix.patch

Enjoy!
お楽しみください

77 comments:

  1. Did a uniprocessor build on Arch for testing (EeePC 701); still panics immediately. Will next verify that enabling SMP still fixes it.

    ReplyDelete
  2. Thanks for the info and sorry. No surprise I guess since I honestly didn't put much effort into it when I found it would boot on my SMP machines with a UP build.

    ReplyDelete
  3. I've verified that enabling SMP fixes the panic, as expected. Things do seem a bit more sluggish...though I've only been booted into it for a few minutes. :-)

    I also have not yet tried the patch mentioned by kernelOfTruth in the 3.19 thread: http://ck-hack.blogspot.com/2015/02/bfs-461-linux-319-ck1.html?showComment=1427417073374#c5057988316819204350

    ReplyDelete
    Replies
    1. Update: I see that linux4.0 already has that patch applied, so I guess it didn't fix my (PREEMPT && !SMP) kernel panic for cpu_startup_entry, that occurs immediately at boot.

      Delete
  4. Hello Con Kol

    I have kernel panic or/and freeze when SDDM login manager starting ksplash. Sometimes it is intel drm coredump or network driver coredump under Linux 4.0 CK. I can't successfully login anyway under 4.0 ck kernel. I saw mouse and SDDM login splash begin animation but hangs. Only hard poweroff (push 5 sec. power button) was usefull for poweroff.

    After power on and boot from stock kernel I had Recovery Jornal again.

    It is something wrong with CK patch for 4.0.

    I patched -mainline kernel with GCC patch and working fine - without any issues.

    For SSD I used NOOP. For HDD I used CFQ (in stock) or BFQ (in -ck, -mainline) and I don't had Issues. Dynamic changing scheduler udev rules script.

    I using this sources (small improvement in PKGUILD and small hidding patches - working like a charm with all kernels): https://github.com/FadeMind/archpkgbuilds/tree/master/linux-ck

    Note: I booted from 3.19.4-ck kernel and PASS fine.
    ( graysky don't commit AUR package update to 4.0-ck jet)

    Linux 3.19.4-ck boot fine. Here is dmesg from it: https://pastebin.com/pKitMCcK
    Lsmod: https://pastebin.com/zwm3gfHS
    inxi -Fxz https://pastebin.com/GQiRMaER

    Kind Regards

    Tomasz Przybył (FadeMind)

    ReplyDelete
    Replies
    1. Wow, no idea on that one, sorry.

      Delete
    2. Tomek, yours problem isn't CK-patch related, rather. I use this patch on linux 4.0 on Arch (with some other patches) with SDDM and it works ok. It's possible, that you are using some linux 3.19 related files to build it, rather and - maybe it's most important - your NVidia proprietary driver isn't for linux 4.0.
      Try to build kernel with CK patch against of configs with Arch's linux 4.0 (in testing) and use it with nouveau or try to build nvidia-ck with patches for linux 4.0 (in testing, too and there is a thread on Arch's BBS about it, too).

      Delete
  5. ... and one more, try this: https://bbs.archlinux.org/viewtopic.php?id=195729

    ReplyDelete
    Replies
    1. Thanks for reply.
      I made screenshot about kernel panic: https://dl.dropboxusercontent.com/u/7244180/bug/Zdj%C4%99cie0004.jpg
      (sorry - bad quality)
      RIP skb_dequeue 0x4b/0x00

      It's look like WLAN Card (Qualcomm Atheros AR9485 Wireless Network Adapter) don't have a time for management interrupts and driver just freak out.

      Seems BFS CPU have bad config value and connection is too fast...

      NVIDIA Drivers what I using are fully compatible with 4.0 kernel.

      Delete
    2. Take a look at: https://github.com/sirlucjan/aur
      I'm linux-uksm-ck user, and I have AR9485, too - works good.

      PS: Wejdź na archlike.darmowefora.pl - może się uda rozwiązać Twój problem.

      Delete
  6. @ Con Kolivas Intel Users have random kernel panic under Linux 4.0 CK1:
    please read this topic: https://bbs.archlinux.org/viewtopic.php?pid=1523304#p1523304

    I have exactly the same issue like on this better quality than my screenshot: https://i.imgur.com/VI78toh.jpg

    Regards

    ReplyDelete
  7. Hi, I also got kernel panics using archlinux [3] with linux 4.0 + CK1. I was unable to retrieve any useful message from the system logs, but I was able to take some pictures when the kernel panic happened at boot time [1] [2].

    --- system information ---
    PC: Samsung NP530U3C-A03IT
    IO Scheduler: deadline
    CPU: Intel(R) Core(TM) i5-3317U CPU @ 1.70GHz
    WiFi card: Intel Corporation Centrino Advanced-N 6235
    kernel: linux 4.0 + CK1 with microcode update

    [1] http://i.imgur.com/VI78toh.jpg
    [2] http://i.imgur.com/DeYitN7.jpg
    [3] https://bbs.archlinux.org/viewtopic.php?pid=1523363#p1523363

    ReplyDelete
  8. I too tried the archlinux package, ivy bridge.

    3.19 with ck worked for me too, but 4.0 is all sorts of broken.

    http://imgur.com/6MGZoiG
    http://imgur.com/jLDENN9

    Here is one full dmesg: http://pastebin.com/raw.php?i=kRcYiez1

    I also have intel wifi:
    04:00.0 Network controller: Intel Corporation Centrino Advanced-N 6235 (rev 24)

    ReplyDelete
  9. Me2:kernel panics & system freezes with linux-ck-sandybridge 4.0-1 archlinux x86_64

    ReplyDelete
    Replies
    1. No problems found with linux-ck-atom 4.0-1 archlinux i686

      Delete
  10. With linux-ck-haswell (4.0-1) on Arch Linux, I see SATA bus related errors (failed command: WRITE FPDMA QUEUED) and SATA link resets which let all mounts stall. The kernel does not panic though; I can escape with Ctrl+Alt+Del (also no SysRq needed).

    I have posted the relevant syslog part on the Arch Linux forums: https://bbs.archlinux.org/viewtopic.php?pid=1523519#p1523519.

    ReplyDelete
  11. For those with instability, can you try the following patch on top please: bfs462-remove_unlocked_unplug.patch

    ReplyDelete
    Replies
    1. Just tested with this patch, and I'm still having issues. Grabbed this from attempting to boot with it: http://pastebin.com/05pQJ3Mt
      No wireless involved, just ethernet, for what that's worth.

      Delete
    2. Thanks for testing it. It was a long shot anyway and your code path doesn't remotely look related. I don't have any further leads on your trace at this stage.

      Delete
    3. I also tested the patch and also for me it does not solve the issue. However I was able to get some information from the journal, maybe they can be useful:

      http://paste2.org/9cm5XPsm
      http://paste2.org/dhxkazXP
      http://paste2.org/04F17z3s

      Delete
  12. http://kr4d.com/rack/uploads/IMG_20150427_205003-A9vVmBG.jpg

    ReplyDelete
  13. A few days ago, i have installed a system-monitor tool. I've noticed that the CPU usage is strange with bfs. Core 1 is much less used than the others. The utilization of core 2 is better, but most of the work is done by core 3 + 4. As an example, the CPU utilization during the compression of a large file (left is with ck1, right is with cfs):

    http://imgur.com/ORSIGmV

    ReplyDelete
    Replies
    1. copy of many files from HDD to USB-HDD with ck1:
      http://imgur.com/fcdxC0o


      time tar -cjf archiv.tar.gz manjaro-kde-0.9.0-pre5-x86_64.iso

      with ck1:
      real 8m50.373s
      user 8m31.739s
      sys 0m3.263s

      with cfs:
      real 8m28.860s
      user 8m9.377s
      sys 0m7.819s

      (in my previous posting I've compressed the file with ark (KDE))

      Delete
    2. That is interesting. Are you using nice levels at all anywhere in your environment or are they used automatically at all by your applications in question? Alternatively is there anything that might be setting CPU affinity for your applications?

      Delete
    3. No, I do not use nice levels.

      But now that you mention it, i remember, that i have played with /sys/block/sd*/queue/rq_affinity a while ago. And ***, it still stands at 2, i have forget to reset the value back. I'll test it again with the default value.

      Also, i have irqbalance installed. I'll test it without. If that does not work, i will build a new kernel, without any patches, only BFS.

      thx

      Delete
    4. Unfortunately, all without success.

      Any ideas?

      Because you've made a lot of changes to bfs, maybe I should retest a previous version?

      Delete
    5. There actually aren't a lot of changes, just syncing with mainline changes. Definitely try an earlier release if you can to see if it behaves differently.

      Delete
  14. CK - A growing body of evidence seems to point to disabling NUMA as a cause of the panics under linux 4.0.x with ck1. I will report back once additional folks have a chance to test. At least 3 users have now reported no panics when NUMA was left enabled. More to come.

    ReplyDelete
    Replies
    1. Thanks graysky. While most of us don't actually enable NUMA in their config in the first place, it might help point to where in the code the fault lies.

      Delete
  15. Not NUMA related but here's a small change that is worth trying: bfs462-ist-change.patch
    The other test patch has been removed as it's of no use.

    ReplyDelete
    Replies
    1. Thanks CK. I have incorporated this patch into 4.0.1-5-ck and asked those affected users to test it. Link to discussion thread.

      Delete
    2. OK. Several users (five as I type this) have reported that when NUMA is enabled and they are running linux-v4.0.1 + CK1 + bfs462-1st-change.patch, the kernel panics are back: discussion thread with details.

      Delete
  16. Hi Con,

    I would only confirm, that zen-kernel 4.0.1 with BFS (and BFQ) is running fine (no numa) here.
    Running on i5 and i7 without a problem.

    Regards sysitos

    ReplyDelete
    Replies
    1. Ok, I must do an alteration to my posting.
      It seems only, that these kernels with BFS were stable, but during heavy IO (and network IO), all my machines with BFS are crashing.
      First seen on my server, writing data over the network to the RAID5 NFS share leads to an crash. But it occurs also on my desktop machine, but more rare. Enabling/Disabling NUMA doesn't help. Only disabling BFS works. Tested with zen Kernel 4.00 .. 4.03.
      So my last working kernel with BFS (on my server) is 3.17.x. Starting with 3.18 the crashes were starting. After your 3.18 patch to resolve it, the problems were evidently gone (as written by my already on your side), or only not enough stress tested by me.

      Maybe the problem with the actual kernel is located in these old changes from 3.18, but this is only my guess.

      PS: I know, the kernel line 3.17 is out of support, but I prefer at the moment an old kernel with BFS over an actual with CFS ;)
      PPS: BFQ is enabled too in zen. But this doesn't affect the crashes (already tested)

      Regards sysitos

      Delete
    2. Interesting. I was having some crashes recently while experimenting with btrfs commands on an external drive (mainly scrub or btrfs-convert). I was starting to think it could be due to bfs bugs and you have the exact same problem. Do you know if the rtmn patch fixes this bug?

      Delete
    3. Or maybe the update-inittask patch? Or NUMA? It would be nice to be using kernel 4.0 with BFS instead of reverting back to 3.18

      Delete
    4. Tested different patches mentioned here, also the Numa, but no succes. At the moment using the actual zen kernel without BFS. And its working too ;) Think that BFQ is enough for a server.

      In my opinion, the break with BFS and heavy IO started with 3.18.

      Regards sysitos

      Delete
  17. Is anyone else experiencing build failures compiling 4.0.2 and ck1?

    ...
    CC [M] drivers/net/wireless/rtlwifi/rtl8821ae/table.o
    LD [M] drivers/net/wireless/rtlwifi/rtl8723be/rtl8723be.o
    CC [M] drivers/net/wireless/rtlwifi/rtl8821ae/trx.o
    LD [M] drivers/net/wireless/rtlwifi/rtl8821ae/rtl8821ae.o
    LD drivers/net/wireless/built-in.o
    LD drivers/net/built-in.o
    LD drivers/built-in.o
    LINK vmlinux
    LD vmlinux.o
    MODPOST vmlinux.o
    GEN .version
    CHK include/generated/compile.h
    UPD include/generated/compile.h
    CC init/version.o
    LD init/built-in.o
    arch/x86/built-in.o: In function `pvclock_init_vsyscall':
    (.init.text+0x1744e): undefined reference to `register_task_migration_notifier'
    Makefile:937: recipe for target 'vmlinux' failed
    make: *** [vmlinux] Error 1

    ReplyDelete
  18. Here's a fix for that build failure: bfs462-rtmn-fix.patch

    Will be interesting to see if this is somehow related to the crashes too, though I still haven't figured out why people are getting them.

    ReplyDelete
  19. There's also this missing change from the original release: bfs462-update_inittask.patch

    ReplyDelete
    Replies
    1. Thanks for the patches. The users are reporting no effect with these two patches (ie still kernel panics) when NUMA is disabled. Just like before, if we enable NUMA, no one has reported a panic. I don't know if the NUMA status + CK1 is to blame for the panics or if it merely catalyzes them. We stand by to test any other patches you can offer up.

      Delete
    2. Thanks as always. As I've been trying to say, NUMA is just papering over the issue as I don't expect anyone should have to enable numa for an ordinary kernel to work. However I don't actually know what the issue is so enabling numa is a decent workaround till I happen to find whatever it is. There is no numa specific code in the latest kernel so it's sheer coincidence and so far the circumstantial evidence points to the assembly changes in do_fork for x86 being responsible somehow. What exactly, I don't know and finding time to go through this with a fine toothed comb is hard.

      Delete
    3. Related to SMP? Seems as though booting with maxcpus=1 stops the panics.

      Delete
    4. @graysky
      From your thread, I have noticed that -3 test kernel with my -gc patches set but NUMA disabled seems to work? Right?
      If this is true, would you please try the -gc patches upon v4.0.2, if it was still confirmed true, please narrow the patch set down to this commit

      https://bitbucket.org/alfredchen/linux-gc/commits/54665090c191462f8dd3c1aaedbeea17bef6edfc?at=v4.0.2-gc

      this should be the only difference introduced in 4.0 release.

      Delete
    5. @AC - Not quite...

      4.0.1-3-ck has NUMA enabled and uses your patches --> no panics
      4.0.1-4-ck has NUMA enabled and does not use your patches --> no panics
      4.0.1-5-ck has NUMA disabled and uses CK's attempted patches --> panics

      This trend was also confirmed in 4.0.2...
      4.0.2-1-ck has NUMA disabled and uses CK's attempted patches --> panics
      4.0.2-2-ck has NUMA enabled and uses CK's attempted patches --> no panics

      So the only common thread I am seeing is NUMA disabled = panics :/

      Do you still feel that using that commit + CK1 + NUMA disabled would be a worth-while experiment?

      Delete
    6. @graysky
      Would you please double check the PKGBUILD of -3 and -4, as I can see, there are diff in the NUMA config session.

      Delete
    7. You're correct that there was a minor difference in the code, but in each case, the variable "$_NUMAdisable" was undefined thus the corresponding sed lines did not get called.

      For -3: see that line 19 was not defined and see line 164 where if the length of the var is non-zero, only then does the sed lines get called to disabled NUMA. Result = NUMA is enabled.

      For -4, I actually commented out line 19 and lines164-178. Result = NUMA is enabled.

      Delete
    8. Thanks the explanation. Seems that you guys have to live with NUMA, but as CK said, it's not reasonable.

      Delete
  20. Pending/queued/review upstream changes:

    http://marc.info/?l=linux-kernel&m=143101842121867&w=2
    [PATCH] sched/preempt: fix cond_resched_lock() and cond_resched_softirq()

    Jumped right at me while taking a random look at the mailing list ;)

    ReplyDelete
  21. So what is going on the 4.0x BFS patch is causing kernel issues?

    ReplyDelete
  22. Hi ck,

    WOW fast answer!

    Sorry, I meant also is the 4.0x kernel by itself a problem, or this is only because of BFS patch?

    ReplyDelete
  23. Ahhh sorry to hear mate! No worries don't let it get ya down, you've always done a great job, I'm sure you'll get it! :)

    Well I am compiling 4.0.3 on my box the moment with these patches applied;

    4.0-sched-bfs-462.patch
    bfs462-rtmn-fix.patch
    bfs462-update_inittask.patch

    And I don't have NUMA compiled in...

    I'll be rebooting with it in just a few minutes so I'll report back anything...

    Cheers

    ReplyDelete
  24. Hi ck,

    Should I use all 3 patches?

    4.0-sched-bfs-462.patch
    bfs462-rtmn-fix.patch
    bfs462-update_inittask.patch

    Also when are people experiencing the crashes, right no boot up, or running X?

    I have an i7-3610QM and so far I have booted it up 3 times and I'm typing in X running it, all running good at this point in time...

    I also use this in my autostart up for Openbox when I log into X;

    sudo schedtool -n -20 -I `pidof X`

    ReplyDelete
    Replies
    1. It's either/or. Either you work perfectly well or crash repeatedly. If it works for you now it won't start crashing.

      Delete
    2. Hi ck,

      Ok looks good here...

      Does it help for you for any of my system specs since it runs good for you?

      If you need anytning let me know...

      Also should I be using all 3 patches?

      4.0-sched-bfs-462.patch
      bfs462-rtmn-fix.patch
      bfs462-update_inittask.patch

      Cheers

      Delete
  25. Hey ck; for your reference, on my uniprocessor setup, which happens to be using graysky's Arch AUR package, I tested using NUMA enabled to see if it also resolved my boot panic like SMP does, in case they might be related. But no...I'm running 4.0.3 now: NUMA off, SMP/HT on.

    ReplyDelete
  26. Hi ck,

    I'm back, same mate from yesterday, and today X locked up on me. So looks like it is giving me some problems...

    Hmm crap, guess I'll stick with 3.19x until this gets worked out.

    Keep up the good work!

    ReplyDelete
  27. I'm now going back to re-test 3.19.8, to see if 4.0.x is such more unreliable with TuxOnIce. At least this happens to someone (me).

    BR, Manuel Krause

    ReplyDelete
    Replies
    1. OK, this issue is solved for me now. And it's most probably not related to BFS/CK (maybe triggered more often, though).
      I was able to get rid of the unreliability of TuxOnIce {resume often hanging @"Doing atomic copy/restore"} by changing .config options related to my graphics driver:
      The only working combination is to compile DRM into the kernel and i915 as a module (not both into kernel and not both as modules).
      Tested on 4.0.4+BFS, 4.0.4 with -gc branch and 3.19.8 with -gc.

      Best regards,
      Manuel Krause

      Delete
  28. 4.0.5 release removed rt_mutex_check_prio() in favour of rt_mutex_get_effective_prio(). I've fixed BFS for it here:

    https://github.com/pfactum/pf-kernel/commit/e32654bb6748455fc112ac6868bec0f9de67c061

    ReplyDelete
    Replies
    1. Thank you very much! It works well for me on top of Alfred's current -gc.

      BTW, can someone check whether this commit is also worth applying to BFS/CK ?:
      "sched: always use blk_schedule_flush_plug in io_schedule_out"
      https://github.com/torvalds/linux/commit/22f546a33bac11aea8af5e570f296234ecdd60d4

      BR, Manuel Krause

      Delete
    2. I meant "adapting" or "adopting" rather than "applying". Now you know what I wanted to express. ;-)

      Delete
  29. Ck-patchset for 4.1, please

    ReplyDelete
  30. Any progress on fixing the issues with the patch? I haven't had any problems with the patches because I've always had NUMA enabled (enabled by default on ubuntu configs), but I would like to see this issue resolved so we can move on to kernel 4.1.

    ReplyDelete
  31. I have ported BFS0462 to 4.1, please check it out at http://cchalpha.blogspot.com/2015/06/time-to-have-fun-with-kernel-41.html and have a try.

    BR Alfred

    ReplyDelete
    Replies
    1. I've used all 22 patches from your repository for kernel 4.1.3:

      Kernel is running fine on OpenMandriva and ROSA Desktop
      http://mib.pianetalinux.org/forum/viewtopic.php?f=38&t=4602

      Delete
  32. What is the easiest way to get the patch(es)?

    Galen

    ReplyDelete
    Replies
    1. I should have been more specific. I meant, what is the easiest way to get your patches, Alfred?

      Galen

      Delete
    2. Learn some Git!? :)
      Next time maybe I can spend some time to pack them in a single patch file for download.

      Delete
    3. Thanks for your reply. I have experimented a bit with git, but I don't use it frequently, so I tend to forget. ;) I do some rpm packages for PCLinuxOS, so it is much more convenient to have a versioned patch file that can be added to the src.rpm.
      Thanks for all of you work. I will likely just wait for the next official -ck release.

      Galen

      Delete
  33. Can we at least get a status update please?

    ReplyDelete
    Replies
    1. Sure. Nothing's happened for 3.1, and no progress has been made on fixing the non-numa build bug. As usual when I start working on it I will finish working on it shortly afterwards, but so far I've been too busy to do anything.

      Delete
    2. Thank you. Much appreciated!

      Delete
  34. tell me bfs in 6 cores (intel 5930) its better than the other options?

    ReplyDelete
  35. anyone have benchmarks CFS vs BFS in 6 physical cores? (not old benchmarks)

    ReplyDelete
    Replies
    1. Send me a machine with 6-cores and I would be glad to test it for you :) The benchmarks I published nearly 3 years ago include a dual quad machine with and without HT enabled. If you are so including, you may use these underlying bash scripts to benchmark your 6-core machine and post the data here. I am happy to plot the results for you.

      Delete