Announcing an updated BFS patch for linux 3.7, version 0.427
Full patch:
http://ck.kolivas.org/patches/bfs/3.0/3.7/3.7-sched-bfs-427.patch
Incremental patch from bfs 426 (applies to 3.7.x-ck1 as well):
http://ck.kolivas.org/patches/bfs/3.0/3.7/3.7-bfs426-427.patch
The full set of incremental patches, along with a description within each patch is here:
http://ck.kolivas.org/patches/bfs/3.0/3.7/incremental/
A number of minor issues have been reported with BFS over time (interestingly none of them appear to be new). Some of them were cosmetic, like the reported suspicious rcu warning on startup, and the accounting for close to 100% bound cpu tasks flicking between 99 and 101%.
The most interesting actual bug was that clock_nanosleep, and timer_create would not work when used with the clock id of CLOCK_PROCESS_CPUTIME_ID. This is a timer which goes off based on the total CPU used by a process or its thread group, which I have never used myself nor was aware of its intricacies. This bug was only picked up as part of building and glibc testing by Olivier Langlois. This was an interesting bug for a number of reasons to me. First was that it had never manifested as far as I'm aware anywhere in the wild despite being a posix 2001 function, so presumably it is almost never used. Second is it's one of the few functions that tries to get accounting as a total of all the CPU used by a thread group rather than just per thread. Third is that you cannot really use clock_nanosleep with this clock id unless it is done from a separate thread to the one consuming CPU (since it puts the calling thread to sleep) so there would be precious few scenarios it would be in use currently, though coding multithreaded apps that use it for resource monitoring and control would make complete sense. Finally the most interesting part was I can now tell that it had been in BFS since its first release and no one had ever noticed as far as I'm aware.
Unfortunately it took me quite a while to find since I had to dig deep into figuring out how the whole system of timers works on a low level in the kernel before finally stumbling across one tiny piece of accounting/reporting that was missing on BFS. It's funny that a bug that directly affected almost no one should be so hard to track down. In the meantime it allowed me to tweak a number of bits of internal accounting so hopefully that should have improved as well.
Please enjoy.
お楽しみください
Thanks for your devotion and vigilance, CK. I will of course run a head-to-head comparison of 0.426 to 0.427 with the usual 'make' benchmark for you and post the results shortly.
ReplyDeleteThanks so much for this new update!
ReplyDeleteReally love your kernel mods. Thank you for your hard work, Con.
ReplyDeleteNo differences between v0.426 and v0.427. I can post the data if anyone is interested.
ReplyDeletePeople complain about glibc inability to be compiled under -ck kernels: https://bbs.archlinux.org/viewtopic.php?id=154594
ReplyDeleteAny idea?
That is precisely the bug fixed in this release.
DeleteThis is the patch that fixes it:
bfs426-fix_cpu_posix_timers.patch
It should apply to older BFS as well.
Understood, thanks.
Delete@graysky: Could you please also benchmark the latest BFQ released today, which includes the new Early Queue Merge feature? Thanks
ReplyDeleteNo idea what you're talking about.
Delete1) BFQ is an I/O scheduler not a CPU scheduler.
2) There has been no new release of BFQ as far as I can tell.
It's a new addon patch for the BFQ I/O scheduler. You can find it here, if you're interested:
Deletehttps://groups.google.com/forum/?fromgroups=#!topic/bfq-iosched/H7PCRpreLAQ
Excellent work! And I love having correctly accounting of process times again!
ReplyDeleteI observe the nvidia-proprietary settings gui showing less heated graphics processor in my little mac mini pc !?
Greeting from cold and cloudy Hamburg,
Ralph Ulirch
Uups, got a time accounting overflow again: began at a kworker thread and then consecutive new threads.
DeleteIs there a correctly working initializing of times spent of new jobs in place already? Perhaps it initializes with a random memory value, which would tell me why this occures after some hours?
Ralph Ulrich
If I could tell you, I could fix it.
DeleteBy the way if I can't reproduce it, it's very difficult to debug a problem, and time is always an issue. However since I have had time recently, if you could email me with more details we might be able to debug it together if you're willing to try various patches.
DeleteI have sent you my .config
DeleteI have boosted RCU and can reduce time accounting overflows.
Ralph Ulrich
I am just doing fine without time account overflows by tuning further RCU:
Deletezcat /proc/config.gz |grep RCU
# RCU Subsystem
CONFIG_TREE_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_USER_QS is not set
CONFIG_RCU_FANOUT=32
CONFIG_RCU_FANOUT_LEAF=4
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_BOOST=y
CONFIG_RCU_BOOST_PRIO=14
CONFIG_RCU_BOOST_DELAY=320
CONFIG_HAVE_RCU_USER_QS=y
# CONFIG_PROVE_RCU_DELAY is not set
# CONFIG_SPARSE_RCU_POINTER is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=32
CONFIG_RCU_CPU_STALL_VERBOSE=y
CONFIG_RCU_CPU_STALL_INFO=y
# CONFIG_RCU_TRACE is not set
Con, if you want to reproduce you could try RCU config variations in the opposite direction. Beside I also use io deadline queue and not set NO_HZ.
Ralph Ulrich
I tried many .config variations, but get a time accounting overflow anyway after some runtime. No luck so far.
DeleteAs of curiousity I tried to Bfs patch the new Linux-3.8rc:
- many line moves
- 5 fuzzy patches going through
- 2 rejects at ondemand.c
I can observe this overflow happening when compiling staff for my Gentoo installation. Activity in the ondemand module might be involved while the time accounting failure occures.
Maybe the next Linux version is error free :)
Greetings from Hamburg, Ralph Ulrich
is there a patch anywhere for kernel 2.6.22?
ReplyDeleteThe oldest version of BFS is 0.414 and is for kernel 3.1. There's nothing available for older kernels.
DeleteThanks for your updates, CK. As I reported suspicious_rcu_warn issue before, so I retest it with 427. But another dead lock issue is found. I will just post the log here.
ReplyDelete[ 0.124430] ======================================================
[ 0.125331] [ INFO: possible circular locking dependency detected ]
[ 0.126233] 3.7.7+ #114 Not tainted
[ 0.126666] -------------------------------------------------------
[ 0.126666] BFS/0/1 is trying to acquire lock:
[ 0.126666] (sched_domains_mutex){+.+.+.}, at: [] sched_init_smp+0x129/0x2b2
[ 0.126666]
[ 0.126666] but task is already holding lock:
[ 0.126666] (&grq.lock){-.....}, at: [] sched_init_smp+0x101/0x2b2
[ 0.126666]
[ 0.126666] which lock already depends on the new lock.
[ 0.126666]
[ 0.126666]
[ 0.126666] the existing dependency chain (in reverse order) is:
[ 0.126666]
Delete[ 0.126666] -> #1 (&grq.lock){-.....}:
[ 0.126666] [] lock_acquire+0x96/0xc0
[ 0.126666] [] _raw_spin_lock_irqsave+0x53/0x90
[ 0.126666] [] get_page_from_freelist+0x587/0x640
[ 0.126666] [] __alloc_pages_nodemask+0xf6/0x820
[ 0.126666] [] new_slab+0x67/0x270
[ 0.126666] [] __slab_alloc.isra.59.constprop.64+0x14d/0x242
[ 0.126666] [] __kmalloc+0x10c/0x150
[ 0.126666] [] pcpu_mem_zalloc+0x2b/0x80
[ 0.126666] [] pcpu_extend_area_map+0x31/0x120
[ 0.126666] [] pcpu_alloc+0x23b/0x9f0
[ 0.126666] [] __alloc_percpu+0xb/0x10
[ 0.126666] [] build_sched_domains+0x66/0xaf0
[ 0.126666] [] sched_init_smp+0x70/0x2b2
[ 0.126666] [] kernel_init_freeable+0x86/0x18b
[ 0.126666] [] kernel_init+0x9/0x100
[ 0.126666] [] ret_from_fork+0x7c/0xb0
[ 0.126666]
[ 0.126666] -> #0 (sched_domains_mutex){+.+.+.}:
[ 0.126666] [] __lock_acquire+0x1cef/0x1db0
[ 0.126666] [] lock_acquire+0x96/0xc0
[ 0.126666] [] mutex_lock_nested+0x72/0x3d0
[ 0.126666] [] sched_init_smp+0x129/0x2b2
[ 0.126666] [] kernel_init_freeable+0x86/0x18b
[ 0.126666] [] kernel_init+0x9/0x100
[ 0.126666] [] ret_from_fork+0x7c/0xb0
[ 0.126666]
[ 0.126666] other info that might help us debug this:
[ 0.126666]
[ 0.126666] Possible unsafe locking scenario:
[ 0.126666]
[ 0.126666] CPU0 CPU1
[ 0.126666] ---- ----
[ 0.126666] lock(&grq.lock);
[ 0.126666] lock(sched_domains_mutex);
[ 0.126666] lock(&grq.lock);
[ 0.126666] lock(sched_domains_mutex);
[ 0.126666]
[ 0.126666] *** DEADLOCK ***
[ 0.126666]
[ 0.126666] 1 lock held by BFS/0/1:
[ 0.126666] #0: (&grq.lock){-.....}, at: [] sched_init_smp+0x101/0x2b2
[ 0.126666]
[ 0.126666] stack backtrace:
[ 0.126666] Pid: 1, comm: BFS/0 Not tainted 3.7.7+ #114
[ 0.126666] Call Trace:
[ 0.126666] [] print_circular_bug+0x1fb/0x20c
[ 0.126666] [] __lock_acquire+0x1cef/0x1db0
[ 0.126666] [] ? sched_clock_cpu+0xa8/0x120
[ 0.126666] [] lock_acquire+0x96/0xc0
[ 0.126666] [] ? sched_init_smp+0x129/0x2b2
[ 0.126666] [] mutex_lock_nested+0x72/0x3d0
[ 0.126666] [] ? sched_init_smp+0x129/0x2b2
[ 0.126666] [] ? sched_init_smp+0x129/0x2b2
[ 0.126666] [] sched_init_smp+0x129/0x2b2
[ 0.126666] [] ? native_smp_cpus_done+0xa2/0xab
[ 0.126666] [] kernel_init_freeable+0x86/0x18b
[ 0.126666] [] ? schedule_tail+0x87/0x130
[ 0.126666] [] ? schedule_tail+0x48/0x130
[ 0.126666] [] ? rest_init+0x140/0x140
[ 0.126666] [] kernel_init+0x9/0x100
[ 0.126666] [] ret_from_fork+0x7c/0xb0
[ 0.126666] [] ? rest_init+0x140/0x140
Hi, CK. For the accounting problem, I am not sure if this one is the one you trying to solve. But I still find this with 427 with 3.7.7 kernel. To reproduce it:
ReplyDelete1. compile the kernel source, (I used 'time make -j4')
2. use top to monitoring, some as processes will show with 9999 %CPU and a huge TIME+
like this
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15120 root 5 0 18860 5924 1168 S 9999 0.1 5124095h as
15111 root 7 0 50556 23m 3784 R 8 0.6 0:00.25 cc1
ck are you working on a 3.8 patch ?
ReplyDeleteI failed to merge the 3.7 patch into the new kernel. There seem to be some bigger changes in the scheduler interface. Probably related to the new rcu stuff... I'm eager to test this with bfs but no way to "brute force" the old code into the new kernel :)
Regarding the RCU stuff... I've experimented with it on my unicore uniprocessor. At first there are less kernel config options for me -- o.k. But changing the remaining rest of them did not make any difference in usability. Before always using kernel standard 500:0, I did 257:10, 331:10, last one, 101:99:
Delete~ # zcat /proc/config.gz |grep RCU
# RCU Subsystem
CONFIG_TINY_PREEMPT_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_BOOST=y
CONFIG_RCU_BOOST_PRIO=99
CONFIG_RCU_BOOST_DELAY=101
# CONFIG_PROVE_RCU_DELAY is not set
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_RCU_TRACE is not set
Only taking prime numbers in delay/frequency as it prevents reoccurrence. ;-)
Varying these values seems to be useless to systems with one CPU/Core. But disabling RCU Boost resulted in hiccups in video and audio.
Best regards,
Manuel Krause
BTW, this was done with kernel 3.7.7. Manuel
DeleteBFQ v6 for the 3.8 kernel is out, as of 20130219, too:
ReplyDeletehttp://algo.ing.unimo.it/people/paolo/disk_sched/patches/3.8.0-v6/
Best regards,
Manuel Krause
@ CK: Do you need help? Is your family well up? Please tell us, how we as community can support you, whereever needed!
ReplyDeleteManuel Krause
We're ok, I'm just too busy right now, sorry. It will come out when it does.
DeleteFine to read that you and your family are o.k. ! :-) I hope, your brother keeps recovering!
DeleteNo need for a hurry for BFS/CK. We all appreciate your well done work, and of course, only when you'd call it that name.
Best wishes, Manuel