Resource efficient Thread Pools with Zig

  • I designed and implemented a mostly lock-free dynamic thread scheduler for streaming runtimes, and I learned some similar lessons: avoid global data and amortize the necessary synchronization that you have to do. One of the main peculiarities of a streaming context is that work-stealing is counter-productive. In a streaming context, it's more like cutting in when the work would be done anyway. It's better to go find a part of the streaming graph that is not currently being executed than to steal some from another thread.

    The paper describing my design is "Low-Synchronization, Mostly Lock-Free, Elastic Scheduling for Streaming Runtimes", https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduli.... The source code for the product implementation is now open source. Most is in https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP... and https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP....

  • Running the benchmarks locally (zig is fastest, but that's probably expected)

    zig (~1 week older than HEAD):

        ./zig-out/bin/qsort
        filling
        shuffling
        running
        took 177.46ms
    
    rust nightly:

        cargo run --release --bin qsort
        warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
        See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
            Finished release [optimized] target(s) in 0.02s
            Running `target/release/qsort`
        filling
        shuffling
        running
        took 896.91656ms
    
        cargo run --release --bin rsort
        warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
        See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
        Finished release [optimized] target(s) in 0.02s
            Running `target/release/rsort`
        filling
        shuffling
        running
        took 212.190694ms
    
    go 1.16.3:

        go run qsort.go
        filling
        shuffling
        running
        took 222.90356ms
    
    on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9

  • To be very pedantic, Vyukov MPSC queue, while indeed awesome in its simplicity and performance, is not technically lock-free, not even obstruction free (and Vyukov never claimed it as such [1]): a producer halting between the atomic swap and atomic store will prevent further nodes enqueued by other producers from ever being dequeued by the consumer.

    [1] edit: and in fact there is a note exactly on this on the linked Vyukov page.

  • Incredible post. Ticks all the boxes and exciting to see this coming to Zig. Protty has got to be a domain specialist when it comes to atomics and threads.

  • Increasing the size of the array 10x (100_000_000) and filling a glaring omission:

    Go (go1.17.1 darwin/amd64)

        took 5.591593544s
        took 5.285948722s
        took 5.218750076s
        took 5.239224787s
        took 5.131232207s
    
    Zig (0.9.0-dev.959+f011f1393)

        took 3.48s
        took 3.37s
        took 3.44s
        took 3.43s
        took 3.49s
    
    Java (openjdk version "17-ea")

        Time: 2684 ms
        Time: 2654 ms
        Time: 2840 ms
        Time: 2676 ms
        Time: 2649 ms
    
    MacBook Pro Early 2013, 2.7 GHz Quad-Core Intel Core i7, 16 GB 1600 MHz DDR3

  • It's always great to see a robust new threadpool. Most of the stock ones I've used have horrible characteristics - for example on my Threadripper lots of production apps will spawn 48 or more threadpool threads and the scheduling/throughput characteristics on that are generally pretty dire because they needlessly compete for resources, etc. They tend to allocate garbage for every work item too, which makes them a bad choice for realtime scenarios. It's genuinely frustrating that most processes on my machine have 128 threads or more due to unintelligently scaling off ProcessorCount without an upper limit. For example, GIMP's bad threading policy means that operations like resizing an image take seconds instead of 100ms or less.

    I've been using a custom threadpool for a while now but writing your own without the expertise necessary can be a real pain.

  • This is the first four-state-machine thread algorithm.

    And I like what I am seeing. Hopefully all transitions between states have been tested.

    Good work.

  • What does tail latency for the Zig pool look like? It seems like any Task that ends up in one of the overflow queues will stay there until at some ring buffer is emptied.

    Put another way, during a period of more push than pop some tasks may see a very long delay before being worked on?

  • Pretty impressive achievement!

  • I know this is off-topic, but does this website not look eerily familiar to dev.to?

  • A pet peeve of mine is calling something using atomic operations lock less. It's very common but once it click that atomic operations are just locks on the instruction level it feels very wrong to call most algorithms lock less. There's still the same concerns to avoid contention when using atomic operations as with regular mutexes. A mutex lock in the uncontended case never enters the kernel and is just an atomic operation... The main strategy regardless of if using locks or atomics directly is to avoid contention.

    The only truly lock less algorithm that I'm aware of (for most CPU archs) is RCU in the read case.

  • Is there a IoC / CSP (ala core.async in Clojure) in Zig yet? Is it planned to have something like this? I could not find too much about that online.

  • How does it compare to Nim's Weave thread manager?

  • please have Thread Priority a native feature. See GCD.

  • How does this compare to OpenMP? Does it support nested parallelism? (disclaimer: haven't read the post thoroughly).

  • T.o a.s.s.i.s.t y.o.u i.n t.r.a.d.e.s f.o.r b.e.t.t.e.r i.m.p.r.o.v.e.m.e.n.t on c.r.y.p.t.o.c.u.r.r.e.n.c.y W.h.a.t.s.A.p.p (+13052395906)

  • How does it compare with a bare-metal C++ threadpool?