DeepSeek's smallpond: Bringing Distributed Computing to DuckDB

  • Deepseek is the real "open<something>" that the world needed. Via these three projects, Deepseek has addressed not only efficient AI but also distributed computing:

    1. smallpond: https://github.com/deepseek-ai/smallpond

    2. 3fs: https://github.com/deepseek-ai/3FS

    3. deepep: https://github.com/deepseek-ai/DeepEP

  • Looks like we are approaching the "distributed" phase of the distributed-centralized computing cycle :)

    Not saying this is bad, but it's just interesting to see after being in the industry for 8 years.

  • Isn’t the whole point of DuckDB is that it’s not distributed?

  • I'm not massively knowledgable about the ins and outs of DeepSeek, but I think I'm in the right place to ask. My understanding is DeepSeek:

    - Created comparable LLM performance for a fraction of the cost of OpenAI using more off-the-shelf hardware.

    - Seem to be open sourcing lots of distributed stuff.

    My question is, are those two things related? Did distributed computing allow the AI model somehow? If so how? Or is it not that simple?

  • Does anyone have blogs with benchmarks to show the performance of running smallpond let alone 3fs + smallpond?

    A lot of blogs praise these new systems, but don't really provide any numbers :/

  • spark is getting a bit long in the tooth.. interesting to see duckdb integrated with Ray for data-access partitioning across (currently) 3FS. probably a matter of time before they (or someone) supports S3. It should be noted that duckdb (standalone) actually does a pretty good job scanning s3 parquet on its own.