TL;DR: 2.6.39 BFS fixed maybe?
After walking away from the code for a while, annoyed at the bug I couldn't track down, I had another good look at what might be happening. It appears that while the grq lock is dropped in schedule() to perform the block plug flush, a call to the task via try_to_wake_up may be missed entirely, leaving the task deactivated when it should actually keep running. Anyway, first tests from the people on these blog comments are reassuring.
Here is a cleaned up and slightly modified version of the "test8" patch that has so far been stable and shows to have fixed the problem for a handful of people:
Apply to 2.6.39-ck1 or 2.6.39 with BFS 404:
bfs404-recheck_unplugged.patch
In response to requests for packaged versions, I've uploaded a 2.6.39-ck1-2 ubuntu package which includes this change:
Ubuntu Packages
Please test and report back! If this fixes the problem, I'll be releasing it as ck2.
Seems good to me. test8 made it through the night with Deluge running and I didn't have any problems building with the updated patch when booted into test8. Running the updated patch now and I have no problems to report.
ReplyDeleteConsidering how fast the earlier patches crashed and burned for me, I think the probability of this actually being fixed is quite high now.
You've done an excellent job figuring out a problem that couldn't be reproduced by you locally!
test9 called "bfs404-recheck_unplugged.patch" runs cool here: linux-2.6.39-ck1-bfs404t9-q132-r13
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteFuck, I just hit some variant of the bug again, resulting in a dpkg deadlock when doing an apt-get upgrade.
ReplyDeleteWhat information from /proc/<pid> is the most useful to you? I notice in /proc/<pid>/status it says "State: D (disk sleep)" and the process is unkillable. The timestamp on status also hasn't changed for 15 minutes so I'm guessing that's when it deadlocked.
You're definitely on the right track with your latest attempt at fixing it though. There must be some additional corner/edge case you're not seeing yet.
Interesting. I still have the system up and I manually ran "sync" for shits and giggles... the result is that sync is now deadlocked in "disk sleep" as well.
ReplyDeleteRunning fine 6 hours already.
ReplyDeletekernel26-ck 2.6.39-8 (test8) on archlinux if it matters.
Standard usage:qtransmission,browsing,music,video...
@terminx: yes once something that should be flushing data to disk is blocked, one by one everything that wants to write to disk will also block. Reading the proc entries you may get further hangs of the tasks trying to read from it.
ReplyDeleteThe output of sysrq-p and sysrq-t can be helpful.
wfm (yet :)
ReplyDeletehi, previous comment didn't get through...
ReplyDeletejust wanted to say that i haven't hit the issue yet with plain 2.6.39-ck1 on five machines which d1squalifies me as a tester. ;) admittedly i also use the BFQ2 patch.
Martin
Thanks everyone for your comments and testing! There's no doubt that adding this patch makes things a lot better with no regressions, but it's still not as stable as 2.6.38 given the one report of regressions. There is a small possibility that it's yet another problem, but this patch posted on this thread definitely fixes a real problem. I may release a ck2 anyway but not remove the unstable tag.
ReplyDeleteHi Con,
ReplyDeleteas anonymous I use BFS with BFQv2. And with your new patch kernel 2.6.39 seems as good as my "gold 2.6.38.7-zen kernel". No drawbacks to see at the moment, even under heavy IO it looks good.
Don't get the high load values >8 as before.
So thanks again for your bug hunting ;)
Cu sysitos
If there's an edge case I haven't hit it yet. Chrom{e,ium} hangage reported earlier appears cleared up. I've been up all night hacking away on various things without troubles. Previously the issue showed up within a minute.
ReplyDeleteMy issue with chrom{e,ium} seems solved too, 1 day of uptime and no problems so far :)
ReplyDelete@TerminX. By reporting test8 results you confused ck.
ReplyDelete@ck. test9=recheck_unplugged looks really good.
Those weren't test8 results, they were after I rebuilt with the recheck_unplugged patch. I actually reverted to test8 since then and I haven't had the issue pop up again (yet).
ReplyDeletetest8 and recheck_unplugged should give the same results. I haven't been able to find anything else to blame in that particular part of the code so I'm tempted to release it as ck2 anyway just so that ck1 which is very unstable is not in the wild any more, pending further enlightenment.
ReplyDeleteStable after 24H on my coreduo2, with wine and emule always open, 1 mkv movie seen, 2 hours of browsing and audacious music listened, and some I/O traffic in usb pen.
ReplyDeleteIf we have to do some specific stress test to our patched systems say to as ck!
Thank you for your great work!
I ever reported an issue: "deluge with XFS" on an UP machine.
ReplyDeleteIt works well with test3-patch. However, When I use recheck_unplugged-patch, the issus happens again. It is weird because test3-patch is similar to recheck_unplugged-patch.
Hi thanks for your report. Is your UP machine running a UP kernel? Is it preempt enabled? Can you get a backtrace when it happens please, of sysrq-t and sysrq-p ?
ReplyDeleteYes. it is a UP kernel with preempt enabled.
ReplyDeletesysrq-log: http://pastebin.com/5V6BRnaw
The bfs404-recheck_unplugged patch completely fixes the problem with processes locking up for me and has been working well for two days now. My system is very responsive. I was able to run Handbrake to encode some videos in the background and play World of Warcraft. While playing, it was as if there was nothing running in the background. I think this is one of the best releases to date if not the best. Thanks for all the work making Linux more suitable on the desktop. I'm looking forward to ck2.
ReplyDeleteBy the way, if it helps with hunting down any remaining bugs, here is my config file.
http://tux9656.no-ip.biz/config-2.6.39-ck1.bz2
So we have a mostly better, but still 2 reports of problems. Hrm. I've tried everything I can think of to reproduce it locally and failed. Well, here's a test9 patch which may well be a regression but it's worth a shot:
ReplyDeletefs404-test9.patch
Con, did you look at the related patches of coming 2.6.29.1-rc1 to get a hint? Although most of them identical in stable-queue for 2.6.38 ....
ReplyDelete@Ralph: I did look at them and did not see much that is likely to have any effect in combination with this bfs related bug I'm afraid.
ReplyDeletetest9-patch fixes the "deluge with XFS" issue.
ReplyDeleteThanks.
That's great! Now hopefully the others will also test the test9 patch. I'm hoping this one fixes all the problems.
ReplyDeleteThere will be some people only looking for a new blog entry announcing test9!
ReplyDeleteI will test today ...
Oh really? Ok, I shall do so then!
ReplyDelete