A LLM+OLAP Solution

  • For an article about LLM+OLAP, it doesn't spend much time on that part. Specifically it seems like their strategy is around using an LLM to generate a DSL query for an unnamed semantic layer, then everything downstream of that is normal warehousing, with the semantic layer handling actual SQL creation.

    I wish it spent time on talking about how they trained their LLM to reliably generate parsable queries for the semantic layer, and what the accuracy rate of what the user intended vs what they got.

    I do think the only way a LLM based analytics tool can succeed is via a semantic layer rather than direct SQL, since database schemas fail to encode a lot of information about the data (EG a warehouse might not even know user.customer_id = customer.id).

    Malloy could be an interesting target here.

  • From making a few variations on data chatbots in the past year, I found that my favorite / most fun to use ones seem to be more "chain-of-thought" and conversational rather than "retrieval-augmented" style.

    Less about one-shotting the answer, and more about showing its work, if it errors, letting it self-correct. Latency goes up, but quality of the entire conversation also goes up, and feels like it builds more trust with the user. Key steps are asking it to "check its work", and watching it work through new code etc. (I open-sourced one version of this: https://github.com/approximatelabs/datadm that can be run entirely locally / privately)

    From their article: I'm surprised they got something working well by going through an intermediate DSL -- thats moving even further away from the source-material that the LLMs are trained on, so it's an entirely new thing to either teach or assume is part of the in-context learning.

    All that said, interesting: I'll definitely have to try out tencentmusic/supersonic and see how it feels myself.

  • Has anyone attempted to use Doris or evaluated it against Clickhouse? I have to admit Inever heard about it before, is it used beyond Tencent-owned companies ?

  • I would really like to see (and work for) a company that is building novel understanding of actual data and schemas with LLMs. Characterizing data and a limited number of transforms for an LLM should produce much more reliable tools than just piping direct text to a non enhanced LLM. Has anyone seen companies where they are doing this?

  • It looks ClickHouse's competitors are catching up quickly. Particularly StarRocks, which was first a fork of Apache Doris and then a rewrite. They claimed to have faster query engines with cost-based optimizers and cross-table joins. I was wondering if ClickHouse will release something major soon too.

  • Anyone know if you could put something like this over DuckDB?

    I’m prototyping a distributed DuckDB in the same vain as LiteStream for SQLite and I wonder if it would be a good fit for something like this.

  • The main idea of this solution is to make up with the shortage of niche knowledge of Large Language Models.

  • delphihq.com uses LLMs with Semantic layers like Cube/AtScale/dbt/Looker/Lightdash

  • Which semantic layer they are using?

  • Odd choice to have such a small example and then redact most it. How am I supposed to know whether this is useful or not?