Building a faster hash table for high performance SQL joins

  • Since the blog post mentioned a PR to replace linear probing with Robin Hood, I just wanted to mention that I found bidirectional linear probing to outperform Robin Hood across the board in my Java integer set benchmarks:

    https://github.com/senderista/hashtable-benchmarks/blob/mast...

    https://github.com/senderista/hashtable-benchmarks/wiki/64-b...

  • I always enjoy reading stuff written by Andrey, he's a brilliant fellow for sure.

    Can highly recommend his personal blog as well: https://puzpuzpuz.dev/

  • From article

    "Imagine that we run this query over a few hundred million rows. This means at least a few hundred million hash table operations. As you might imagine, a slow hash table would make for a slower query. A faster hash table? Faster queries!"

    I'll read the article properly after this, this is just a quick skim, but I can't see this quote can be correct. Unless I'm missing something, hashing function is fast compared to random bouncing around inside ram – very much faster then random memory accesses. So I can't see how it make a difference.

    Okay, I'll read the article now…

    Edit:

    "If you insert "John" and then "Jane" string keys into a FastMap, then that would later become the iteration order. While it doesn't sound like a big deal for most applications, this guarantee is important in the database world.

    If the underlying table data or index-based access returns sorted data, then we may want to keep the order to avoid having to sort the result set. This is helpful in case of a query with an ORDER BY clause. Performance-wise, direct iteration over the heap is also beneficial as it means sequential memory access."

    but "...if the underlying table data or index-based access returns sorted data..." Then you've got sorted data, in which case use a merge join instead of a hash join surely.

  • Hi, I'm curious how you deal with the potential for hash collisions across a large data set - is that a post-join check?

  • Serious question, if performance is the lynchpin, why write it in Java?

    Especially considering they use unsafe "heavily", for big joins they could easily just call out to some native code if the surrounding code reaaaaally must be Java (again, why?). It's the worst of both worlds using unsafe Java: you don't get native speed, there's loads of memory overhead from everything being an Object (besides the rest of the VM stuff), and get to "enjoy" GC pauses in the middle of your hot loops, and with fewer safety guarantees than something like Rust.