Tuesday 29 January 2013

BFS 0.427 for linux 3.7.x

Announcing an updated BFS patch for linux 3.7, version 0.427

Full patch:
http://ck.kolivas.org/patches/bfs/3.0/3.7/3.7-sched-bfs-427.patch

Incremental patch from bfs 426 (applies to 3.7.x-ck1 as well):
http://ck.kolivas.org/patches/bfs/3.0/3.7/3.7-bfs426-427.patch

The full set of incremental patches, along with a description within each patch is here:
http://ck.kolivas.org/patches/bfs/3.0/3.7/incremental/

A number of minor issues have been reported with BFS over time (interestingly none of them appear to be new). Some of them were cosmetic, like the reported suspicious rcu warning on startup, and the accounting for close to 100% bound cpu tasks flicking between 99 and 101%.

The most interesting actual bug was that clock_nanosleep, and timer_create would not work when used with the clock id of CLOCK_PROCESS_CPUTIME_ID. This is a timer which goes off based on the total CPU used by a process or its thread group, which I have never used myself nor was aware of its intricacies. This bug was only picked up as part of building and glibc testing by Olivier Langlois. This was an interesting bug for a number of reasons to me. First was that it had never manifested as far as I'm aware anywhere in the wild despite being a posix 2001 function, so presumably it is almost never used. Second is it's one of the few functions that tries to get accounting as a total of all the CPU used by a thread group rather than just per thread. Third is that you cannot really use clock_nanosleep with this clock id unless it is done from a separate thread to the one consuming CPU (since it puts the calling thread to sleep) so there would be precious few scenarios it would be in use currently, though coding multithreaded apps that use it for resource monitoring and control would make complete sense. Finally the most interesting part was I can now tell that it had been in BFS since its first release and no one had ever noticed as far as I'm aware. 

Unfortunately it took me quite a while to find since I had to dig deep into figuring out how the whole system of timers works on a low level in the kernel before finally stumbling across one tiny piece of accounting/reporting that was missing on BFS. It's funny that a bug that directly affected almost no one should be so hard to track down. In the meantime it allowed me to tweak a number of bits of internal accounting so hopefully that should have improved as well.

Please enjoy.
お楽しみください
 
 

29 comments:

  1. Thanks for your devotion and vigilance, CK. I will of course run a head-to-head comparison of 0.426 to 0.427 with the usual 'make' benchmark for you and post the results shortly.

    ReplyDelete
  2. Thanks so much for this new update!

    ReplyDelete
  3. Really love your kernel mods. Thank you for your hard work, Con.

    ReplyDelete
  4. No differences between v0.426 and v0.427. I can post the data if anyone is interested.

    ReplyDelete
  5. People complain about glibc inability to be compiled under -ck kernels: https://bbs.archlinux.org/viewtopic.php?id=154594

    Any idea?

    ReplyDelete
    Replies
    1. That is precisely the bug fixed in this release.

      This is the patch that fixes it:
      bfs426-fix_cpu_posix_timers.patch

      It should apply to older BFS as well.

      Delete
  6. @graysky: Could you please also benchmark the latest BFQ released today, which includes the new Early Queue Merge feature? Thanks

    ReplyDelete
    Replies
    1. No idea what you're talking about.

      1) BFQ is an I/O scheduler not a CPU scheduler.
      2) There has been no new release of BFQ as far as I can tell.

      Delete
    2. It's a new addon patch for the BFQ I/O scheduler. You can find it here, if you're interested:

      https://groups.google.com/forum/?fromgroups=#!topic/bfq-iosched/H7PCRpreLAQ

      Delete
  7. Excellent work! And I love having correctly accounting of process times again!

    I observe the nvidia-proprietary settings gui showing less heated graphics processor in my little mac mini pc !?

    Greeting from cold and cloudy Hamburg,
    Ralph Ulirch

    ReplyDelete
    Replies
    1. Uups, got a time accounting overflow again: began at a kworker thread and then consecutive new threads.

      Is there a correctly working initializing of times spent of new jobs in place already? Perhaps it initializes with a random memory value, which would tell me why this occures after some hours?
      Ralph Ulrich

      Delete
    2. If I could tell you, I could fix it.

      Delete
    3. By the way if I can't reproduce it, it's very difficult to debug a problem, and time is always an issue. However since I have had time recently, if you could email me with more details we might be able to debug it together if you're willing to try various patches.

      Delete
    4. I have sent you my .config
      I have boosted RCU and can reduce time accounting overflows.
      Ralph Ulrich

      Delete
    5. I am just doing fine without time account overflows by tuning further RCU:

      zcat /proc/config.gz |grep RCU
      # RCU Subsystem
      CONFIG_TREE_PREEMPT_RCU=y
      CONFIG_PREEMPT_RCU=y
      # CONFIG_RCU_USER_QS is not set
      CONFIG_RCU_FANOUT=32
      CONFIG_RCU_FANOUT_LEAF=4
      # CONFIG_RCU_FANOUT_EXACT is not set
      # CONFIG_TREE_RCU_TRACE is not set
      CONFIG_RCU_BOOST=y
      CONFIG_RCU_BOOST_PRIO=14
      CONFIG_RCU_BOOST_DELAY=320
      CONFIG_HAVE_RCU_USER_QS=y
      # CONFIG_PROVE_RCU_DELAY is not set
      # CONFIG_SPARSE_RCU_POINTER is not set
      CONFIG_RCU_CPU_STALL_TIMEOUT=32
      CONFIG_RCU_CPU_STALL_VERBOSE=y
      CONFIG_RCU_CPU_STALL_INFO=y
      # CONFIG_RCU_TRACE is not set

      Con, if you want to reproduce you could try RCU config variations in the opposite direction. Beside I also use io deadline queue and not set NO_HZ.
      Ralph Ulrich

      Delete
    6. I tried many .config variations, but get a time accounting overflow anyway after some runtime. No luck so far.

      As of curiousity I tried to Bfs patch the new Linux-3.8rc:
      - many line moves
      - 5 fuzzy patches going through
      - 2 rejects at ondemand.c

      I can observe this overflow happening when compiling staff for my Gentoo installation. Activity in the ondemand module might be involved while the time accounting failure occures.
      Maybe the next Linux version is error free :)

      Greetings from Hamburg, Ralph Ulrich

      Delete
  8. is there a patch anywhere for kernel 2.6.22?

    ReplyDelete
    Replies
    1. The oldest version of BFS is 0.414 and is for kernel 3.1. There's nothing available for older kernels.

      Delete
  9. Thanks for your updates, CK. As I reported suspicious_rcu_warn issue before, so I retest it with 427. But another dead lock issue is found. I will just post the log here.

    [ 0.124430] ======================================================
    [ 0.125331] [ INFO: possible circular locking dependency detected ]
    [ 0.126233] 3.7.7+ #114 Not tainted
    [ 0.126666] -------------------------------------------------------
    [ 0.126666] BFS/0/1 is trying to acquire lock:
    [ 0.126666] (sched_domains_mutex){+.+.+.}, at: [] sched_init_smp+0x129/0x2b2
    [ 0.126666]
    [ 0.126666] but task is already holding lock:
    [ 0.126666] (&grq.lock){-.....}, at: [] sched_init_smp+0x101/0x2b2
    [ 0.126666]
    [ 0.126666] which lock already depends on the new lock.
    [ 0.126666]
    [ 0.126666]
    [ 0.126666] the existing dependency chain (in reverse order) is:

    ReplyDelete
    Replies
    1. [ 0.126666]
      [ 0.126666] -> #1 (&grq.lock){-.....}:
      [ 0.126666] [] lock_acquire+0x96/0xc0
      [ 0.126666] [] _raw_spin_lock_irqsave+0x53/0x90
      [ 0.126666] [] get_page_from_freelist+0x587/0x640
      [ 0.126666] [] __alloc_pages_nodemask+0xf6/0x820
      [ 0.126666] [] new_slab+0x67/0x270
      [ 0.126666] [] __slab_alloc.isra.59.constprop.64+0x14d/0x242
      [ 0.126666] [] __kmalloc+0x10c/0x150
      [ 0.126666] [] pcpu_mem_zalloc+0x2b/0x80
      [ 0.126666] [] pcpu_extend_area_map+0x31/0x120
      [ 0.126666] [] pcpu_alloc+0x23b/0x9f0
      [ 0.126666] [] __alloc_percpu+0xb/0x10
      [ 0.126666] [] build_sched_domains+0x66/0xaf0
      [ 0.126666] [] sched_init_smp+0x70/0x2b2
      [ 0.126666] [] kernel_init_freeable+0x86/0x18b
      [ 0.126666] [] kernel_init+0x9/0x100
      [ 0.126666] [] ret_from_fork+0x7c/0xb0
      [ 0.126666]
      [ 0.126666] -> #0 (sched_domains_mutex){+.+.+.}:
      [ 0.126666] [] __lock_acquire+0x1cef/0x1db0
      [ 0.126666] [] lock_acquire+0x96/0xc0
      [ 0.126666] [] mutex_lock_nested+0x72/0x3d0
      [ 0.126666] [] sched_init_smp+0x129/0x2b2
      [ 0.126666] [] kernel_init_freeable+0x86/0x18b
      [ 0.126666] [] kernel_init+0x9/0x100
      [ 0.126666] [] ret_from_fork+0x7c/0xb0
      [ 0.126666]
      [ 0.126666] other info that might help us debug this:
      [ 0.126666]
      [ 0.126666] Possible unsafe locking scenario:
      [ 0.126666]
      [ 0.126666] CPU0 CPU1
      [ 0.126666] ---- ----
      [ 0.126666] lock(&grq.lock);
      [ 0.126666] lock(sched_domains_mutex);
      [ 0.126666] lock(&grq.lock);
      [ 0.126666] lock(sched_domains_mutex);
      [ 0.126666]
      [ 0.126666] *** DEADLOCK ***
      [ 0.126666]
      [ 0.126666] 1 lock held by BFS/0/1:
      [ 0.126666] #0: (&grq.lock){-.....}, at: [] sched_init_smp+0x101/0x2b2
      [ 0.126666]
      [ 0.126666] stack backtrace:
      [ 0.126666] Pid: 1, comm: BFS/0 Not tainted 3.7.7+ #114
      [ 0.126666] Call Trace:
      [ 0.126666] [] print_circular_bug+0x1fb/0x20c
      [ 0.126666] [] __lock_acquire+0x1cef/0x1db0
      [ 0.126666] [] ? sched_clock_cpu+0xa8/0x120
      [ 0.126666] [] lock_acquire+0x96/0xc0
      [ 0.126666] [] ? sched_init_smp+0x129/0x2b2
      [ 0.126666] [] mutex_lock_nested+0x72/0x3d0
      [ 0.126666] [] ? sched_init_smp+0x129/0x2b2
      [ 0.126666] [] ? sched_init_smp+0x129/0x2b2
      [ 0.126666] [] sched_init_smp+0x129/0x2b2
      [ 0.126666] [] ? native_smp_cpus_done+0xa2/0xab
      [ 0.126666] [] kernel_init_freeable+0x86/0x18b
      [ 0.126666] [] ? schedule_tail+0x87/0x130
      [ 0.126666] [] ? schedule_tail+0x48/0x130
      [ 0.126666] [] ? rest_init+0x140/0x140
      [ 0.126666] [] kernel_init+0x9/0x100
      [ 0.126666] [] ret_from_fork+0x7c/0xb0
      [ 0.126666] [] ? rest_init+0x140/0x140

      Delete
  10. Hi, CK. For the accounting problem, I am not sure if this one is the one you trying to solve. But I still find this with 427 with 3.7.7 kernel. To reproduce it:
    1. compile the kernel source, (I used 'time make -j4')
    2. use top to monitoring, some as processes will show with 9999 %CPU and a huge TIME+

    like this

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    15120 root 5 0 18860 5924 1168 S 9999 0.1 5124095h as
    15111 root 7 0 50556 23m 3784 R 8 0.6 0:00.25 cc1

    ReplyDelete
  11. ck are you working on a 3.8 patch ?
    I failed to merge the 3.7 patch into the new kernel. There seem to be some bigger changes in the scheduler interface. Probably related to the new rcu stuff... I'm eager to test this with bfs but no way to "brute force" the old code into the new kernel :)

    ReplyDelete
    Replies
    1. Regarding the RCU stuff... I've experimented with it on my unicore uniprocessor. At first there are less kernel config options for me -- o.k. But changing the remaining rest of them did not make any difference in usability. Before always using kernel standard 500:0, I did 257:10, 331:10, last one, 101:99:

      ~ # zcat /proc/config.gz |grep RCU
      # RCU Subsystem
      CONFIG_TINY_PREEMPT_RCU=y
      CONFIG_PREEMPT_RCU=y
      # CONFIG_TREE_RCU_TRACE is not set
      CONFIG_RCU_BOOST=y
      CONFIG_RCU_BOOST_PRIO=99
      CONFIG_RCU_BOOST_DELAY=101
      # CONFIG_PROVE_RCU_DELAY is not set
      # CONFIG_SPARSE_RCU_POINTER is not set
      # CONFIG_RCU_TRACE is not set

      Only taking prime numbers in delay/frequency as it prevents reoccurrence. ;-)

      Varying these values seems to be useless to systems with one CPU/Core. But disabling RCU Boost resulted in hiccups in video and audio.

      Best regards,

      Manuel Krause

      Delete
    2. BTW, this was done with kernel 3.7.7. Manuel

      Delete
  12. BFQ v6 for the 3.8 kernel is out, as of 20130219, too:
    http://algo.ing.unimo.it/people/paolo/disk_sched/patches/3.8.0-v6/

    Best regards,
    Manuel Krause

    ReplyDelete
  13. @ CK: Do you need help? Is your family well up? Please tell us, how we as community can support you, whereever needed!

    Manuel Krause

    ReplyDelete
    Replies
    1. We're ok, I'm just too busy right now, sorry. It will come out when it does.

      Delete
    2. Fine to read that you and your family are o.k. ! :-) I hope, your brother keeps recovering!

      No need for a hurry for BFS/CK. We all appreciate your well done work, and of course, only when you'd call it that name.

      Best wishes, Manuel

      Delete