Monday, 30 May 2011

2.6.39 BFS progress

TL;DR: 2.6.39 BFS fixed maybe?

After walking away from the code for a while, annoyed at the bug I couldn't track down, I had another good look at what might be happening. It appears that while the grq lock is dropped in schedule() to perform the block plug flush, a call to the task via try_to_wake_up may be missed entirely, leaving the task deactivated when it should actually keep running. Anyway, first tests from the people on these blog comments are reassuring.

Here is a cleaned up and slightly modified version of the "test8" patch that has so far been stable and shows to have fixed the problem for a handful of people:

Apply to 2.6.39-ck1 or 2.6.39 with BFS 404:
bfs404-recheck_unplugged.patch

In response to requests for packaged versions, I've uploaded a 2.6.39-ck1-2 ubuntu package which includes this change:
Ubuntu Packages

Please test and report back! If this fixes the problem, I'll be releasing it as ck2.

28 comments:

  1. Seems good to me. test8 made it through the night with Deluge running and I didn't have any problems building with the updated patch when booted into test8. Running the updated patch now and I have no problems to report.

    Considering how fast the earlier patches crashed and burned for me, I think the probability of this actually being fixed is quite high now.

    You've done an excellent job figuring out a problem that couldn't be reproduced by you locally!

    ReplyDelete
  2. Ralph Ulrich31 May 2011 02:47

    test9 called "bfs404-recheck_unplugged.patch" runs cool here: linux-2.6.39-ck1-bfs404t9-q132-r13

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Fuck, I just hit some variant of the bug again, resulting in a dpkg deadlock when doing an apt-get upgrade.

    What information from /proc/<pid> is the most useful to you? I notice in /proc/<pid>/status it says "State: D (disk sleep)" and the process is unkillable. The timestamp on status also hasn't changed for 15 minutes so I'm guessing that's when it deadlocked.

    You're definitely on the right track with your latest attempt at fixing it though. There must be some additional corner/edge case you're not seeing yet.

    ReplyDelete
  5. Interesting. I still have the system up and I manually ran "sync" for shits and giggles... the result is that sync is now deadlocked in "disk sleep" as well.

    ReplyDelete
  6. Running fine 6 hours already.
    kernel26-ck 2.6.39-8 (test8) on archlinux if it matters.
    Standard usage:qtransmission,browsing,music,video...

    ReplyDelete
  7. @terminx: yes once something that should be flushing data to disk is blocked, one by one everything that wants to write to disk will also block. Reading the proc entries you may get further hangs of the tasks trying to read from it.
    The output of sysrq-p and sysrq-t can be helpful.

    ReplyDelete
  8. hi, previous comment didn't get through...

    just wanted to say that i haven't hit the issue yet with plain 2.6.39-ck1 on five machines which d1squalifies me as a tester. ;) admittedly i also use the BFQ2 patch.

    Martin

    ReplyDelete
  9. Thanks everyone for your comments and testing! There's no doubt that adding this patch makes things a lot better with no regressions, but it's still not as stable as 2.6.38 given the one report of regressions. There is a small possibility that it's yet another problem, but this patch posted on this thread definitely fixes a real problem. I may release a ck2 anyway but not remove the unstable tag.

    ReplyDelete
  10. Hi Con,

    as anonymous I use BFS with BFQv2. And with your new patch kernel 2.6.39 seems as good as my "gold 2.6.38.7-zen kernel". No drawbacks to see at the moment, even under heavy IO it looks good.
    Don't get the high load values >8 as before.
    So thanks again for your bug hunting ;)

    Cu sysitos

    ReplyDelete
  11. If there's an edge case I haven't hit it yet. Chrom{e,ium} hangage reported earlier appears cleared up. I've been up all night hacking away on various things without troubles. Previously the issue showed up within a minute.

    ReplyDelete
  12. My issue with chrom{e,ium} seems solved too, 1 day of uptime and no problems so far :)

    ReplyDelete
  13. @TerminX. By reporting test8 results you confused ck.

    @ck. test9=recheck_unplugged looks really good.

    ReplyDelete
  14. Those weren't test8 results, they were after I rebuilt with the recheck_unplugged patch. I actually reverted to test8 since then and I haven't had the issue pop up again (yet).

    ReplyDelete
  15. test8 and recheck_unplugged should give the same results. I haven't been able to find anything else to blame in that particular part of the code so I'm tempted to release it as ck2 anyway just so that ck1 which is very unstable is not in the wild any more, pending further enlightenment.

    ReplyDelete
  16. Stable after 24H on my coreduo2, with wine and emule always open, 1 mkv movie seen, 2 hours of browsing and audacious music listened, and some I/O traffic in usb pen.
    If we have to do some specific stress test to our patched systems say to as ck!
    Thank you for your great work!

    ReplyDelete
  17. I ever reported an issue: "deluge with XFS" on an UP machine.

    It works well with test3-patch. However, When I use recheck_unplugged-patch, the issus happens again. It is weird because test3-patch is similar to recheck_unplugged-patch.

    ReplyDelete
  18. Hi thanks for your report. Is your UP machine running a UP kernel? Is it preempt enabled? Can you get a backtrace when it happens please, of sysrq-t and sysrq-p ?

    ReplyDelete
  19. Yes. it is a UP kernel with preempt enabled.
    sysrq-log: http://pastebin.com/5V6BRnaw

    ReplyDelete
  20. The bfs404-recheck_unplugged patch completely fixes the problem with processes locking up for me and has been working well for two days now. My system is very responsive. I was able to run Handbrake to encode some videos in the background and play World of Warcraft. While playing, it was as if there was nothing running in the background. I think this is one of the best releases to date if not the best. Thanks for all the work making Linux more suitable on the desktop. I'm looking forward to ck2.

    By the way, if it helps with hunting down any remaining bugs, here is my config file.
    http://tux9656.no-ip.biz/config-2.6.39-ck1.bz2

    ReplyDelete
  21. So we have a mostly better, but still 2 reports of problems. Hrm. I've tried everything I can think of to reproduce it locally and failed. Well, here's a test9 patch which may well be a regression but it's worth a shot:

    fs404-test9.patch

    ReplyDelete
  22. Ralph Ulrich2 June 2011 23:04

    Con, did you look at the related patches of coming 2.6.29.1-rc1 to get a hint? Although most of them identical in stable-queue for 2.6.38 ....

    ReplyDelete
  23. @Ralph: I did look at them and did not see much that is likely to have any effect in combination with this bfs related bug I'm afraid.

    ReplyDelete
  24. test9-patch fixes the "deluge with XFS" issue.
    Thanks.

    ReplyDelete
  25. That's great! Now hopefully the others will also test the test9 patch. I'm hoping this one fixes all the problems.

    ReplyDelete
  26. Ralph Ulrich3 June 2011 19:11

    There will be some people only looking for a new blog entry announcing test9!

    I will test today ...

    ReplyDelete
  27. Oh really? Ok, I shall do so then!

    ReplyDelete