Wednesday, 28 September 2011

BFS 0.411 test with skiplists

Thanks to those who've tested BFS 0.410 and found a few bugs. I've updated the test patch to 0.411 hopefully addressing all the regressions, and uploaded some incremental versions as well:

The bugs fixed in this version were to fix the Uniprocessor builds, fix the poor interactivity due to completely miscalculating deadline, and the suspend/resume regression. Please keep on testing and hopefully I can declare this one a stable release!

EDIT: I have re-tested and confirmed as reported by others, that 411 is indeed slower than 410 and actually slower than the stable 406. This is really disappointing as 410 was only faster by virtue of a bug that would favour throughput (and cause lousy interactivity). I've tried further optimising the code, but alas it appears that for our desktop workloads, skiplists are not the way forward.

EDIT2: Small improvements to be had by adding this patch, but it's still not matching 406 performance for me:


  1. Ask and ye shall receive. Incremental 406-411 in post.

  2. Tested with clear very fine!
    Bug definitly fixed! :)
    Great ck, great BFS!!!!!!!

  3. Many thanks for the correction.
    I ask if it is possible to make Ubuntu Packages with this patch.

  4. I'm getting horrible hiccups with this version. It reminds me of this days of a computer with an ISA bus where more than one piece of hardware wants to use the same IRQ but not be very nice about sharing it.

    Some examples:
    When I click on a input box in Firefox it can take a full second or more for the focus to go there and scrolling up and down on this page is not at all smooth.
    When I start gnome-alsamixer the application is not usable for five seconds, but the menu bar is responsive right away.

    But maybe I didn't apply the correct patch. Is ck1-bfs406-411.patch meant to be applied to 3.0.4-ck1?

  5. !tux9656 yes that's the correct patch. Odd, what hardware configuration? smp/up etc...

  6. I can attest smooth desktop operations with zero latency, even when compiling a kernel in the background using all cores and listening to an mp3 stream at the same time... 411, the best BFS ever? ;)

  7. This comment has been removed by the author.

  8. CK - there are some serious regressions introduced in v0.411 compared to v0.410. I show you data below for just my quad core machine. The benchmark is simply compiling linux v3.0.4 (bzImage modules) and timing the process. I have run 5 replicates for each version of BFS as you can see in the box plots. I have also repeated this experiment on a C2D and a dual quad machine (16 cores) and the treads are consistent: v0.410 > v0.406 > v0.411 > stock kernel.

    You have any thoughts as to which patch you introduced that causes this regression?

    Note: I will compiled v0.410 + the deadline patch and run it now. This way you can potentially eliminate the deadline patch as the cause of the regression.

  9. Hrm interesting. Thanks. The deadline fix patch would lower throughput indeed. However it also is a correct fix for a bug in the design which would inappropriately favour throughput. Gotta keep watching...

  10. @ck - the problem is the deadline fix patch. I applied it to 0.410 and run the benchmark. The results are the same as the results for v0.411. Any thoughts for a path forward?

  11. Yes. Give up on skiplists and just do the cleanup stuff :(

  12. So skiplists and the offset fix are mutually exclusive? What are my other option? In other words, what configure switch would need to be tweaked to avoid that "choppy" behavior with skiplists? ...or is this a forgone conclusion and was your last post a serious statement? :/

  13. It's a foregone conclusion. The offset fix is to fix a bug in the design. If the fixed version is shit, then there's no point sticking with it.

  14. Then again, it's possible that I simply haven't optimised the code paths enough in the skiplists implementation. I'll try and shave every bit off before giving them up, because they really SHOULD be better.

  15. Glad to hear it, CK! I will post my full analysis later today (probably 15-16 h from now after I have time to work-up and review the data). I'd also like to see someone else to repeat my experiments to verify reproducibility. I have provided the needed file and script to do so, read on.

    What I did:
    1) Configured up a linux-3.0.4 source with the default Arch linux .config (on x86_64) and used this as the test code set.

    2) Booted a system into run level 3 with a minimal set of daemons enabled and run a little bash script that times how long it takes to do a "make -jX bzImages modules" and then repeats the process 5 times totally writing the results to ~/data.txt

    3) Run this process on each kernel
    --> stock
    --> bfs v0.406
    --> bfs v0.410
    --> bfs v0.411

    4) Machines used:

    *E5200 (2 cores)
    *X3360 (4 cores)
    *Dual quad (8 cores + 8 HTs = 16 cores)

    5) Optionally study the effect of make flags on the dual quad system under the following conditions:

    1) Hyperthreading enabled
    a) 16 threads
    b) 8 threads
    c) 4 threads
    d) 2 threads

    1) Hyperthreading disabled
    a) 8 threads
    b) 4 threads
    c) 2 threads

    Preconfigured tarball the bash script expects. Note that I have preconfigured for x86_64 only!

    Script (bash) that does the work.

    Simply edit the bash script to reflect the paths you want to use for:
    1) source tarball
    2) temp space for compile

    Post your ~/data.txt and I can generate the boxplots for you or do them yourself :)

  16. Thanks graysky! I tried my best but even in my testing am not able to match the performance of bfs406 with every improvement I can think of, but you can try for yourself with this test patch on top:


  17. Very nice. I'll test it on the three machines and include the results when I post again later today... maybe 12-13 h from now.

  18. It seems the problem I reported with the hiccups might be a false alarm. I upgraded to Xorg 1.11 at the same time that I upgraded to BFS 0.411. After running for a while on the official prepackaged Debian kernel, I have been experiencing these same hiccups. It is just not as severe on the Debian kernel. I'm going to try to see if I can still get the previous Xorg 1.10 packages that were in Debian testing and let you know how BFS 0.411 works with it.

  19. @tux9656: Thanks. I had a feeling something didn't sound right about that. The interactivity should be virtually unchanged.

  20. I got Xorg reverted back to 1.10 and everything seems to be working just like it did with BFS 0.406. I did some reading and it looks like the issue I was having was with the nvidia binary blob. I read that the latest driver that is suppose to be compatible with Xorg 1.11 has some "drawing issues." Next time I'll try to confirm a bug on Debian stable before I send a bug report.

  21. What's the explanation? The original search is not really O(n), n is too small, or what?

  22. @CK - I put that data package together comparing all your bfs patchsets. It's not good news as you mentioned.

    Link to download a pdf.

    So, skiplists in bfs have no future in production it would seem... unless you have some other ideas? I am glad to benchmark future builds as always!

  23. The original O(n) search was heavily optimised and only includes tasks queued but not running (i.e. not those already on CPU). The thing about O(n) is it tells you the complexity of the algorithm, but not really how fast it is overall. A fast O(n) algorithm can be faster than an O(1) algorithm even for all workloads, if the absolute speed of the O(1) is very slow. My guess with these skiplists is the overhead of the extra data structures and the O(log n) insertion is costing much more than any benefit at the search time. Both insertion and removal is more expensive with skiplists, and only lookup is faster.

  24. This means the new skiplist should be faster with wokloads where there are many long running jobs - not many inertions or removals.

    Ralph Ulrich (I dont get my Name in any more without URL)

  25. I use the make -jx benchmark because it's easy to demonstrate and reproduce. Surprisingly it's also a very good make/break test. Every other more complicated benchmark I could think of throwing at the skiplists has also failed to find a single case where it's better. I really wanted it to be be, but it's not.

  26. I've tested it with nexuiz… :-) Alas, 406 is better, meanwhile 411 was laggy all the time.