With AVX512 (and to a lesser extent with AVX2) one can implement 256 bit addition pretty efficiently with the additional benefit of fitting more numbers in registers.
It looks more or less like this:
__m256i s = _mm256_add_epi64(a, b);
const __m256i all_ones = _mm256_set1_epi64x(~0);
int g = _mm256_cmpgt_epu64_mask(a, s);
int p = _mm256_cmpeq_epu64_mask(s, all_ones);
int carries = ((g << 1) + p) ^ p;
__m256i ret = _mm256_mask_sub_epi64(s, carries, s, all_ones);
The throughput even seems to be better: https://godbolt.org/z/e7zETe8xYIt's trivial to change this to do 512 bit addition where the improvement will be even more significant.
On modern enough x86 CPUs (Intel Broadwell, AMD Ryzen) you could also use ADX [1] which may be faster nowadays in situations where radix 2^51 representation traditionally had an edge (e.g. Curve25519).
Related. Others?
The radix 2^51 trick - https://news.ycombinator.com/item?id=33706153 - Nov 2022 (6 comments)
The radix 2^51 trick (2017) - https://news.ycombinator.com/item?id=23351007 - May 2020 (83 comments)
The main takeaway: doing more operations may be faster if they are largely independent, and thus can execute in parallel. Doing fewer operations may be slower if they are forced to execute serially due to data dependency.
This idea has wider applicability than operations on long integers.
Someone working entirely on x86_64 very nicely demonstrates that RISC-V is not wrong to omit the carry flag.
The 'radix trick' also works for data structures.
Okasaki's book 'Purely Functional Data Structures' has some nice examples.
I wish I came across this article a couple months ago.
I was trying to encode and decode some buffers into an arbitrary base, and I eventually came to the conclusion (after far too long) that a carry could ripple all the way down the buffer, which dramatically slows down the algorithm.
Actually, the eventual solution I came up might have some stuff in common with this trick too. I did eventually chunk up the buffer leaving some unused headroom to 'handle carries'. Not exactly though, I just have some wasted bits which uses a tiny bit more storage or network bandwidth but saves on compute. I wonder if I could instead pool up the carries like this and 'resolve' it in a later step. Have my cake and eat it too? Wishful thinking.
Notwithstanding HN guidelines about not editorializing titles, I don't like these clickbaity titles amplifying a smaller claim to something overly broad: this one should have been titled:
"The radix 2^51 trick to adding 64-bit integers on *some* x86 architectures in parallel without slowing the pipeline due to dependencies on carry"
It's funny that carries don't just make addition difficult to parallelize. Binary addition without carry is XOR. XOR subset sum - find a subset whose XOR gives the desired target - is in P, but proper subset sum with carry is NP-complete.
Would it be legal for a C(++?) compiler to implement this optimization?
I'm seriously doubtful that adc is inherently slower than add on a modern CPU other then the data hazard introduced by the carry bit. I realize the point of the article is the data hazard so this is a really minor nit.
Saw 2^51 and thought it was about storing integers into doubles. But nope, the number for that particular use is 2^53-1, not 2^51.
how to do this for large multiplications instead?
> Aside: Why 13 bits instead of 12? For our purposes, we’re going to ignore the carries in the most significant limb, allowing numbers to wrap when they overflow past 2256 - 1 (just like how unsigned addition works in C with normal size integer types). As a result, we can assign 52 bits to the most significant limb and ignore the fact that it will run out of room for carries before the other limbs do.
Why not give the top limb 64 bits and the other four limbs 48 bits each, then? You can accumulate more additions before normalization, you can take advantage of word alignment during splitting and normalization if your instruction set has anything useful there, and your overflow properties are identical, no?