Optimization Tricks used by the Lockless Memory Allocator

  • This is very similar to a multithreaded memory allocator I implemented many years ago: http://people.cs.vt.edu/~scschnei/streamflow/

    Full paper: http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf I also have an incomplete extension to that paper with more algorithmic details. I can clean it up and send it to interested parties.

    Source code: http://github.com/scotts/streamflow/

    There's also an excellent paper recently published at PACT 2011 that is extends and improves on what we did previously, but the paper is not available anywhere yet: http://aces.snu.ac.kr/Center_for_Manycore_Programming/SFMall... I have a copy of the paper and can email it to interested people.

    Edit after looking through his code: I think he has locks on the main execution path for malloc. That violates one of our design principles, which was to avoid synchronization on the main path, and to only use lock-free synchronization for most operations on the main path. (Allocations that hit the page-block allocation in our work hit a lock. The SFMalloc allocator I mention above takes those locks out and sees performance improvement.)

  • Lockfree slab allocator for smaller blocks is frequently all that's needed, especially in STL-heavy C++ code.

    There is typically heck of a lot allocations of 0-32 byte blocks, about half of that of 33-64, a further half of that of 65-128, etc. - meaning that if one optimizes allocation of blocks smaller than 512 bytes, it would yield 80-90% of achievable speed gain due to a better allocator.

    In practical terms it translates into simpler implementation - just set up a bunch of slab buckets - one for 0-16 byte blocks, next - for up to 32, next - for up to 63, etc. - and pass larger allocation requests to default malloc. Exact size ranges are easy to determine by running the app with a proxy allocator and looking at the block size histogram. I did this with several projects and in all cases histograms stabilized very quickly. The tricky part is a lock-free management of the slabs, especially the disposal, but it is not a rocket science by any means.

    A real-world example - adding a lock-free slab allocator to a Windows app that did some multi-threaded data crunching yielded 400% speed up compared to native HeapAlloc().