The BLAKE3 cryptographic hash function

  • It looks like the speedup is coming from two main changes.

    The first change is reducing the number of rounds from 10 to 7. Think of it like making a smoothie - you add bits of fruit to the drink (the input data), then pulse the blades to blend it up (making the output hash). This change basically runs the blades for 7 seconds instead of 10 seconds each time they add fruit. They cite evidence that the extra 3 seconds aren't doing much - once the fruit's fully liquid, extra blending doesn't help - but I worry that this reduces the security margin. Maybe those extra 3 rounds aren't useful against current attacks, but they may be useful against unknown future attacks.

    The other change they make is to break the input into 1KiB chunks, then hash each chunk independently. Finally, they combine the individual chunk hashes into a single big hash using a binary tree. The benefit is that if you have 4KiB of data, you can use 4-way SIMD instructions to process all four chunks simultaneously. The more data you have, the more parallelism you can unlock, unlike traditional hash functions that process everything sequentially. On the flip side, modern SIMD instructions can handle 2 x 32-bit instructions just as fast as 1 x 64-bit instructions, so building the algorithm out of 32-bit arithmetic doesn't cost anything, but gives a big boost to low-end 32-bit CPU's that struggle with 64-bit arithmetic. The tree structure is a big win overall.

  • Another benchmark:

    time openssl sha256 /tmp/bigfile

      real  0m28.160s
      user  0m27.750s
      sys   0m0.272s
    
    time shasum -a 256 /tmp/bigfile

      real  0m6.146s
      user  0m5.407s
      sys   0m0.560s
    
    time b2sum /tmp/bigfile

      real  0m1.732s
      user  0m1.450s
      sys   0m0.244s
    
    time b3sum /tmp/bigfile

      real  0m0.212s
      user  0m0.996s
      sys   0m0.379s
    
    TIL OpenSSL sha256 invocation is really slow compared to the shasum program. Also BLAKE3 is really fast.

    Edit: bigfile is 1GB of /dev/random

  • So this is bao + blake2?

    I remember watching Bao, a general purpose cryptographic tree hash, and perhaps the fastest hash function in the world: https://www.youtube.com/watch?v=Dya9c2DXMqQ a while ago.

    Nice job!

  • Looks like they've taken Too Much Crypto to heart[1] and dropped the number of rounds from Blake2B's 12 down to 7 for Blake3:

    https://github.com/BLAKE3-team/BLAKE3/blob/master/reference_...

    https://github.com/BLAKE2/BLAKE2/blob/master/ref/blake2b-ref...

    Which, yeah, that alone will get you a significant improvement over Blake2B. But definitely doesn't account for the huge improvement they're showing. Most of that is the ability to take advantage of AVX512 parallelism, I think. The difference will be more incremental on AVX2-only amd64 or other platforms, I think.

    [1]: Well, TMC recommended 8 rounds for Blake2B and 7 for Blake2S.

  • Just one variant, that's refreshing. And performance is impressive. What's the short input performance like? Say for 64 bytes of input.

  • That's impressive speedup. I just installed it and holy moly it really is fast. All those extra cores can finally get busy. ;)

    b3sum -l 256 big-2.6Gfile

      real 0m0.384s
      user 0m2.302s
      sys  0m0.175s
    
    
    b2sum -l 256 big-2.6Gfile

      rear  0m3.616s
      user  0m3.360s
      sys   0m0.256s
    
    (Intel® Core™ i7-8550U CPU @ 1.80GHz × 8 )

    EDIT: ah, the catch. blake3 targets 128 bit security. It competes with SipHash for speed and security

    EDIT2 scratch the previous edit.

  • I would be interested in how this compares on the smhasher against some of the other fastest hash competitors like meow hash or xxhash.

  • I can't seem to find any non-rust implementations in the works yet, so I may sit down and adapt the reference to C# this weekend. Anyone know how the single/few threads performance holds up excluding avx512?

  • smhasher results without the Rust version yet (which should be ~2x faster):

    http://rurban.github.io/smhasher/doc/table.html

    It's of course much faster as most of the other crypto hashes, but not faster than the hardware variants of SHA1-NI and SHA256-NI. About 4x faster than blake2.

    Faster than SipHash, not faster than SipHash13.

    The tests fail on MomentChi2 dramatically, which describe how good the user-provided random seed is mixed in. I tried by mixing a seed for IV[0], as with all other hardened crypto hashes, and for all 8 IV's, which didn't help. So I'm not convinced that a seeded IV is properly mixed in. Which is outside the usage pattern of a crypto or digest hash (b3sum), but inside a normal usage.

    Rust staticlib is still in work, which would parallize the hashing in chunks for big keys. For small keys it should be even a bit slower. b3sum is so much faster, because it uses many more tricks, such as mmap.

  • Does anybody know of benchmarks for ARM? Or any research trying to break it?

    The numbers look astonishing.

  • Why are there no AES-like hashing algorithms out there? AES design is very suitable to be used as a building block in a hash if you remove "add round key" operation.

  • Is it possible to benchmark agaist blake2 etc. but where they have the same number of rounds, testing both for reducing blake2 and also increasing blake3? Also, in that vein, offering the version with more rounds could win over the "paranoid" for mostly being a faster Blake2 thanks to SIMD and extra features thanks to the Merkle tree?

  •   Benchmark #1: cat b1
        Time (mean ± σ):      1.076 s ±  0.007 s    [User: 5.3 ms, System: 1069.4 ms]
        Range (min … max):    1.069 s …  1.093 s    10 runs
      Benchmark #2: sha256sum b1
        Time (mean ± σ):      6.583 s ±  0.064 s    [User: 5.440 s, System: 1.137 s]
        Range (min … max):    6.506 s …  6.695 s    10 runs
      Benchmark #3: sha1sum b1
        Time (mean ± σ):      6.322 s ±  0.086 s    [User: 5.212 s, System: 1.103 s]
        Range (min … max):    6.214 s …  6.484 s    10 runs
      Benchmark #4: b2sum b1
        Time (mean ± σ):     13.184 s ±  0.108 s    [User: 12.090 s, System: 1.080 s]
        Range (min … max):   13.087 s … 13.382 s    10 runs
      Benchmark #5: b3sum b1
        Time (mean ± σ):     577.0 ms ±   5.4 ms    [User: 12.276 s, System: 0.669 s]
        Range (min … max):   572.4 ms … 587.0 ms    10 runs
      Benchmark #6: md5sum b1
        Time (mean ± σ):     14.851 s ±  0.175 s    [User: 13.717 s, System: 1.117 s]
        Range (min … max):   14.495 s … 15.128 s    10 runs
        
      Summary
        'b3sum b1' ran
          1.86 ± 0.02 times faster than 'cat b1'
         10.96 ± 0.18 times faster than 'sha1sum b1'
         11.41 ± 0.15 times faster than 'sha256sum b1'
         22.85 ± 0.28 times faster than 'b2sum b1'
         25.74 ± 0.39 times faster than 'md5sum b1'
    
    
    gotdang that's some solid performance. (here running against 10GiB of random bytes; machine has the Sha ASM extensions, which is why sha256/sha1 perform so well)

    edit: actually not a straight algo comparison, as b3sum here is heavily benefiting from multi-threading; without that it looks more like this:

      Benchmark #1: cat b1
        Time (mean ± σ):      1.090 s ±  0.007 s    [User: 2.9 ms, System: 1084.8 ms]
        Range (min … max):    1.071 s …  1.096 s    10 runs
       
      Benchmark #2: sha256sum b1
        Time (mean ± σ):      6.480 s ±  0.097 s    [User: 5.359 s, System: 1.115 s]
        Range (min … max):    6.346 s …  6.587 s    10 runs
       
      Benchmark #3: sha1sum b1
        Time (mean ± σ):      6.120 s ±  0.090 s    [User: 5.027 s, System: 1.082 s]
        Range (min … max):    5.979 s …  6.233 s    10 runs
       
      Benchmark #4: b2sum b1
        Time (mean ± σ):     12.866 s ±  0.208 s    [User: 11.722 s, System: 1.133 s]
        Range (min … max):   12.549 s … 13.124 s    10 runs
       
      Benchmark #5: b3sum b1
        Time (mean ± σ):      5.813 s ±  0.079 s    [User: 4.606 s, System: 1.202 s]
        Range (min … max):    5.699 s …  5.933 s    10 runs
       
      Benchmark #6: md5sum b1
        Time (mean ± σ):     14.355 s ±  0.184 s    [User: 13.305 s, System: 1.039 s]
        Range (min … max):   14.119 s … 14.605 s    10 runs
       
      Summary
        'cat b1' ran
          5.33 ± 0.08 times faster than 'b3sum b1'
          5.62 ± 0.09 times faster than 'sha1sum b1'
          5.95 ± 0.10 times faster than 'sha256sum b1'
         11.81 ± 0.21 times faster than 'b2sum b1'
         13.17 ± 0.19 times faster than 'md5sum b1'
    
    still beating the dedicated sha extensions, but not nearly as dramatically.

  • Where is this useful?

    I'm guessing not for password hashes simply because a fast hash is bad for passwords (makes brute forcing/rainbow tables easier).

    So is this mostly just for file signing?