Running the benchmarks locally (zig is fastest, but that's probably expected)
zig (~1 week older than HEAD):
./zig-out/bin/qsort
filling
shuffling
running
took 177.46ms
rust nightly: cargo run --release --bin qsort
warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
Finished release [optimized] target(s) in 0.02s
Running `target/release/qsort`
filling
shuffling
running
took 896.91656ms
cargo run --release --bin rsort
warning: the cargo feature `edition2021` has been stabilized in the 1.56 release and is no longer necessary to be listed in the manifest
See https://doc.rust-lang.org/nightly/cargo/reference/manifest.html#the-edition-field for more information about using this feature.
Finished release [optimized] target(s) in 0.02s
Running `target/release/rsort`
filling
shuffling
running
took 212.190694ms
go 1.16.3: go run qsort.go
filling
shuffling
running
took 222.90356ms
on macOS 11.5.2 with a 2.4 GHz 8-Core Intel Core i9To be very pedantic, Vyukov MPSC queue, while indeed awesome in its simplicity and performance, is not technically lock-free, not even obstruction free (and Vyukov never claimed it as such [1]): a producer halting between the atomic swap and atomic store will prevent further nodes enqueued by other producers from ever being dequeued by the consumer.
[1] edit: and in fact there is a note exactly on this on the linked Vyukov page.
Incredible post. Ticks all the boxes and exciting to see this coming to Zig. Protty has got to be a domain specialist when it comes to atomics and threads.
Increasing the size of the array 10x (100_000_000) and filling a glaring omission:
Go (go1.17.1 darwin/amd64)
took 5.591593544s
took 5.285948722s
took 5.218750076s
took 5.239224787s
took 5.131232207s
Zig (0.9.0-dev.959+f011f1393) took 3.48s
took 3.37s
took 3.44s
took 3.43s
took 3.49s
Java (openjdk version "17-ea") Time: 2684 ms
Time: 2654 ms
Time: 2840 ms
Time: 2676 ms
Time: 2649 ms
MacBook Pro Early 2013, 2.7 GHz Quad-Core Intel Core i7, 16 GB 1600 MHz DDR3It's always great to see a robust new threadpool. Most of the stock ones I've used have horrible characteristics - for example on my Threadripper lots of production apps will spawn 48 or more threadpool threads and the scheduling/throughput characteristics on that are generally pretty dire because they needlessly compete for resources, etc. They tend to allocate garbage for every work item too, which makes them a bad choice for realtime scenarios. It's genuinely frustrating that most processes on my machine have 128 threads or more due to unintelligently scaling off ProcessorCount without an upper limit. For example, GIMP's bad threading policy means that operations like resizing an image take seconds instead of 100ms or less.
I've been using a custom threadpool for a while now but writing your own without the expertise necessary can be a real pain.
This is the first four-state-machine thread algorithm.
And I like what I am seeing. Hopefully all transitions between states have been tested.
Good work.
What does tail latency for the Zig pool look like? It seems like any Task that ends up in one of the overflow queues will stay there until at some ring buffer is emptied.
Put another way, during a period of more push than pop some tasks may see a very long delay before being worked on?
Pretty impressive achievement!
I know this is off-topic, but does this website not look eerily familiar to dev.to?
A pet peeve of mine is calling something using atomic operations lock less. It's very common but once it click that atomic operations are just locks on the instruction level it feels very wrong to call most algorithms lock less. There's still the same concerns to avoid contention when using atomic operations as with regular mutexes. A mutex lock in the uncontended case never enters the kernel and is just an atomic operation... The main strategy regardless of if using locks or atomics directly is to avoid contention.
The only truly lock less algorithm that I'm aware of (for most CPU archs) is RCU in the read case.
Is there a IoC / CSP (ala core.async in Clojure) in Zig yet? Is it planned to have something like this? I could not find too much about that online.
How does it compare to Nim's Weave thread manager?
please have Thread Priority a native feature. See GCD.
How does this compare to OpenMP? Does it support nested parallelism? (disclaimer: haven't read the post thoroughly).
T.o a.s.s.i.s.t y.o.u i.n t.r.a.d.e.s f.o.r b.e.t.t.e.r i.m.p.r.o.v.e.m.e.n.t on c.r.y.p.t.o.c.u.r.r.e.n.c.y W.h.a.t.s.A.p.p (+13052395906)
How does it compare with a bare-metal C++ threadpool?
I designed and implemented a mostly lock-free dynamic thread scheduler for streaming runtimes, and I learned some similar lessons: avoid global data and amortize the necessary synchronization that you have to do. One of the main peculiarities of a streaming context is that work-stealing is counter-productive. In a streaming context, it's more like cutting in when the work would be done anyway. It's better to go find a part of the streaming graph that is not currently being executed than to steal some from another thread.
The paper describing my design is "Low-Synchronization, Mostly Lock-Free, Elastic Scheduling for Streaming Runtimes", https://www.scott-a-s.com/files/pldi2017_lf_elastic_scheduli.... The source code for the product implementation is now open source. Most is in https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP... and https://github.com/IBMStreams/OSStreams/blob/main/src/cpp/SP....