The missing tier for query compilers

  • The author striked out the part about CedarDB not being available -- which is true -- but Umbra is available as a docker container[1] for some time now. The "Umbra paper" linked also contains an artifact[2] with a copy of Umbra as well as some instructions on how to control the back-ends used etc. (Cranelift is not available as an option in the docker versions however)

    I kind of disagree with the assumption that baseline compilers are easy to build (depending on your definition of baseline). A back-end like DirectEmit is not easy to write and even harder to verify. If you have a back-end written for your own IR you will likely have to write tests in that IR and it will probably be quite hard to simply port over codegen (or run-) tests from other compilers. Especially in the context of databases it is not very reassuring if you have a back-end that may explode the second you start to generate code slightly differently. We're working on making this a bit more commoditized but in our opinion you will always need to do some work since having another IR (with defined semantics someone could write a code generator for you) for a back-end is very expensive. In Umbra, translating Umbra-IR to LLVM-IR takes more time than compiling to machine code with DirectEmit.

    Also, if it is easy to write them, I would expect to see more people write them.

    Copy-and-patch was also tried in the context of MLIR[3] but the exec-time results were not that convincing and I have been told that it is unlikely for register allocation to work sufficiently well to make a difference.

    [1]: https://hub.docker.com/r/umbradb/umbra

    [2]: https://zenodo.org/records/10357363

    [3]: https://home.cit.tum.de/~engelke/pubs/2403-cc.pdf

  • The tradeoff in SingleStore is interesting. By default and unlike eg postgres, parameterized queries are planned ignoring the values of the parameters. This allows caching the compiled query but prevents adapting the query plan - for the example in the post SingleStore would pick one plan for both queries.

    But you can opt out of this behaviour by wrapping each parameter in NOPARAM (https://docs.singlestore.com/cloud/reference/sql-reference/c...).

  • The author pans meta-tracing and shows a graph of TruffleRuby having really crazy runtime behavior. However, it looks extremely suspicious - like the kind of thing you'd see if there was a memory leak and the garbage collector was running constantly.

    Most people seem really excited about the results coming out of Graal and TruffleRuby seems to have some really impressive results in general, so the graph is surprising. It's also missing a bunch of details, so it's hard to speak to its voracity - for example, what are the different versions of the runtimes, what flags were used, on what hardware?

    As a counter example, there's a different (admittedly old) result where TruffleRuby beats cruby on the same benchmark by 4.38x. https://eregon.me/blog/2022/01/06/benchmarking-cruby-mjit-yj...

  • This is an interesting article!

    ClickHouse does a hybrid approach. It uses a vectorized interpreter plus the JIT compilation (with LLVM) of small expressions (which is also vectorized and generated for the target instruction set by compiling loops). The JIT compilation does not bother to compile the whole query, only simple fragments. It gives only a marginal improvement over the vectorized interpreter.

    Even with this approach, JIT is a huge pile of problems. A few details here: https://clickhouse.com/blog/clickhouse-just-in-time-compiler...

    Overall architecture overview: https://clickhouse.com/blog/first-clickhouse-research-paper-...

  • Clustrix wrote a compiler. It was able to call out to C for string intrinsics and communication and btree operations, so it primarily handled extracting data from serialized formats, simple arithmetic, constructing serializations from downstream.

    this was to replace a little byte code VM, and as far I recall took a very good programmer about 2 weeks for the basic version.

    I dont see any reason to even try to bring something like a general purpose JIT. there's a huge benefit from just doing the basics, and you can move the line between precompiled functions and dynamically compiled ones on an ongoing basis. its also cheap enough that I wouldn't get hung up on needing to amortize that cost with prepared statements.

  • The real original copy and patch paper [0] used gcc's addressable gotos to get the addresses of the "patches" and pointers for the operands which can be replaced at jit-time as opposed to the referenced paper which uses llmv tomfoolery to create a database of code patches as part of the building process.

    IIRC, there may be another "original" paper.

    To help with the author's confusion, they use function arguments to manage the registers. If you have a simple operand like 'add' you pass the left and right arguments as function arguments but if you have some values you want to keep in registers you pass those as additional arguments to the functions and just pass them through to the continuation hoping the compiler does the right thing (i.e. not spill the arguments during the function call).

    Of course this leads to a combinatorial explosion of functions as you need one specialized for every register you want to pass through:

      void add0(int l, int r, kont c);
      void add1(int l, int r, int r0, kont c);
      void add2(int l, int r, int r0, int r1, kont c);
    
    and also need to track in your jitifier who's doing what and how many intermediate results are needed at which program point &etc.

    I believe both ways share the same method of generating a shitton of functions as a linked library and using that to create a database of code patches at compile time which the runtime uses to patch it all together. One does it using llvm tooling and the other uses 'extern int add_l, add_r, *add_r0' to get the locations of the operands to replace at runtime with the the actual value because C doesn't care one bit what you do with pointers.

    I'm probably wrong on some of the details but that's the basics of what's happening.

    __edit__ Reading the part on how extern variables are used and the example function definitions doesn't match perfectly with what is actually happening but the concept is the same -- use the locations the pointers point to to find the places to replace with actual values at runtime. The function defs are more like what a 'musttail interpreter' would use. Not sure this 'correction' is any clearer...

    [0] http://ieeexplore.ieee.org/document/1342540/