Wednesday 13 April 2011

Scalability of BFS?

So it occurred to me that for some time I've been saying that BFS may scale well only up to about 16 CPUs. That was a fairly generic guess based on the design of BFS, but it appears that these more-thread machines and multi-core machines seem to quite like BFS on the real-world benchmarks I'm getting back from various people. With the latest changes to BFS, which bumped the version up to 0.400, it should have improved further. I've tried googling for links to do with BFS and scalability and the biggest machine I've been able to find that benefits from it is a 24 core machine running F@H (folding at home). Given that this was with an older version of BFS, and that there were actually advantages even at 24 cores, I wonder what the point is where it doesn't scale? Obviously scalability is more than just "running F@H" and will depend entirely on architecture and workload and definition of scalability, and so on, but... I wanted to ask the community what's the biggest machine anyone has tried BFS on, and how well did it perform? If someone had access to 16+ cores to try it out I'd be mighty grateful for your results.

14 comments:

  1. From the overwhelming responses I've gotten on this post, I guess the answer is no...

    ReplyDelete
  2. Sorry mate, I only have 2 cores here. I blogged about your request though. I will try asking a friend of mine who has 20 or so cores on his machine.

    ReplyDelete
  3. Nope, he has less cores than I thought. Sorry. :(

    ReplyDelete
  4. No problem, thanks anyway. I appreciate the gesture :)

    ReplyDelete
  5. Whats the probability of the main runqueue lock contention?

    Do you use a fixed quantum? and whats the overhead in ms of the scheduler?

    Ive skimmed over your code, I think you have a lock on each process in the runqueue, and on the runqueue itself. Could you not use an atomic compare and swap on the processes lock and so ignore the runqueue lock? That would allow multiple CPUs to access the queue without waiting.

    ReplyDelete
  6. There is no meaningful process lock, only the runqueue lock. The runqueue lock contention is the most likely culprit for a scalability limit here since it's a shared runqueue. Atomic compare and swap would be extraordinarily painful to use here because you'd look over a list, and processes would be disappearing all the time onto other CPUs so you'd have to re-check it was still there every time you went back to the task you had chosen if you had not already taken it. The whole design would have to be rewritten from scratch to do so and the multiple checking of the existence of the process each time would greatly increase the overhead in the non-contended case - which is the target audience for BFS as it was designed for commodity hardware and desktops. Lower CPU count machines already show less overhead with BFS than mainline, but I've been unable to test 16x or more. The scheduler overhead cannot be measured in ms but in nanoseconds. Runqueue contention was demonstrable as occurring on the 8x with a load of 1000 but was still not contributing significantly in overall time compared to any other lock in the kernel. I estimated once that lock contention would be relevant from 16 CPUs but that was an arbitrary figure that I didn't have hardware to prove it on. As mentioned, the fact it shows scalability outstripping mainline at 24 CPUs suggests the limit is somewhat beyond that. No doubt when the number of CPUs becomes very high the contention becomes relevant.

    ReplyDelete
  7. Oh and yes I use a fixed quantum of only 6ms, but it is configurable at runtime.

    ReplyDelete
  8. Thankyou for such a quick response. Would you happen to have any technical docs on how bfs work as i would love to read anything on it.

    Just some feedback on your posts. So i guess that you run over the whole list first, deside on the best task. And then allocte it to the cpu? (..will post more in the morning)

    ReplyDelete
  9. Sure,the patch itself has documentation in it. When patched in it will appear in Documentation/scheduler/sched-BFS.txt
    The docs should mostly be up to date with the design.

    ReplyDelete
  10. If it helps - the screen of my top on the Phenom X6 6*3200
    The situation on the screenshot is quite typical. When the load increases - is activated 6-e kernel.

    http://www.rasslabyxa.ru/Change/Phenom_X6_3200.jpg

    ReplyDelete
  11. @VamPiro: I'm not sure what you're trying to tell me, sorry. I see lots of nice -20 tasks running but I can't see anything else.

    ReplyDelete
  12. I have an i7 920 running Gentoo amd64 w/ HT (so 8 cores). Feel free to give a holler on IRC if you need testcases run on this hardware and I'd be more than happy to.

    It's very hard for me to objectively tell the difference. Especially lately I notice very little difference in {C,B}FS. My feeling just from staring at htop is BFS gives slightly better throughput on big compiles.

    ReplyDelete
  13. Thanks Ormaaj. Found someone with access to a 24 core AMD machine and he's kindly offered to do some benchmarks. The results so far are promising :)

    ReplyDelete