Umbra: A Disk-Based System with In-Memory Performance [pdf]

  • I still maintain that the existence of in memory databases has two main sources: scalability bottlenecks in GC, and storage latency falling behind network latency and staying there.

    If general purpose programming languages could store the data efficiently in main memory, the feature set of in memory databases is not so high that you can’t roll your own incrementally. But your GC times are going to go nuts, and you’ll go off the rails.

    If the speed of light governed data access, you’d collect your data locally and let the operating system decide which hot paths to keep in memory versus storage.

    The last time network was faster than disk was the 1980’s, and we got things like process migration systems (Sprite). Those evaporated once the pendulum swung back.

  • > we can achieve comparable performance to an in-memory database system for the cached working set

    Should I keep reading, or is the title misleading?

    The abstract seems to say that the system provides memory-comparable performance for data that is... in-memory

  • Umbra in ClickBench: https://github.com/ClickHouse/ClickBench/pull/161

    The initial submission didn't reproduce successfully due to a segmentation fault in an attempt to restart it after data loading. But after some changes, it started to work and showed exceptionally good results.

  • This is a Database System, if you're checking the comments to understand what type of system this is about. The paper appears in 10th Annual Conference on Innovative Data Systems Research, and appearing in that context makes it clear.

  • You can see additional papers from the same group at https://umbra-db.com/#publications

  • Does this use io_uring?

  • What black magic is this?

  • Can this beat Redis

  • Obligatory link to Neumann’s presentation for the CMU DB lecture series

    https://m.youtube.com/watch?v=pS2_AJNIxzU