Why DeepSeek is cheap at scale but expensive to run locally

  • I run Deepseek V3 locally as my daily driver and I find it affordable, fast and effective. The article assumes GPU which in my opinion is not the best way to serve large models like this locally. I run a mid-range EPYC 9004 series based home server on a supermicro mobo which cost all-in around $4000. It's a single CPU machine with 384GB RAM (you could get 768GB using 64GB sticks but this costs more). No GPU means power draw is less than a gaming desktop. With the RAM limitation I run an Unsloth Dynamic GGUF which, quality wise in real-world use performs very close to the original. It is around 270GB which leaves plenty of room for context - I run 16k context normally as I use the machine for other things too but can up it to 24k if I need more. I get about 9-10 tokens per second, dropping to 7 tokens/second with a large context. There are plenty of people running similar setups with 2 CPUs who run the full version at similar tokens/second.

  • This is an interesting blogpost. While the general conclusion ("We need batching") is true, inference of mixture of experts (MoE) models is actually a bit more nuanced.

    The main reason we want big batches is because LLM inference is not limited by the compute, but my loading every single weight out of VRAM. Just compare the number of TFLOPS of an H100 with the memory bandwidth, there's basically room for 300 FLOP per byte loaded. So that's why we want big batches: we can perform a lot of operations per parameter/weight that we load from memory. This limit is often referred to as the "roofline model".

    As models become bigger, this does not scale anymore because the model weights will not fit into GPU memory anymore and you need to distribute them across GPUs or across nodes. Even with NVLink and Infiniband, these communications are slower than loading from VRAM. NVlink is still fine for tensor parallelism, but across nodes this is quite slow.

    So what MoE allows is expert parallelism, where different nodes keep different experts in memory and don't need to communicate as much between nodes. This only works if there are enough nodes to keep all experts in VRAM and have enough overhead for other stuff (KV cache, other weights, etc). So naturally the possible batch size becomes quite large. And of course you want to maximize this to make sure all GPUs are actually working.

  • For those looking to save time, the answer is batched inference. Pretty much running multiple people's "prompts" through a model instance at the same time instead of just really tightly timesharing each model instance.

    This is also why you may experience a variance in replies when using these services, even when you set the temperature to 0 and the seed to a fixed value. It's cause you don't control the other prompts yours get batched with. Could this be a data exfiltration attack vector? Probably, I didn't "research" that far.

  • Here’s a concise explanation:

    - High sparsity means you need a very large batch size (number of requests being processed concurrently) so that each matrix multiplication is of sufficient arithmetic intensity to get good utilization.

    - At such a large batch size, you’ll need a decent number of GPUs — 8-16 or so depending on the type — just to fit the weights and MLA/KV cache in HBM. But with only 8-16 GPUs your aggregate throughput is going to be so low that each of the many individual user requests will be served unacceptably slowly for most applications. Thus you need more like 256 GPUs for a good user experience.

  • I'm not a ML research or engineer, so take this with a grain of salt, but I'm a bit confused by this post.

    Deepseek V3/R1 are expensive to run locally because they are so big compared to the models people usually run locally. The number of active parameters is obviously lower than the full model size, but that basically just helps with the compute requirements, not the memory requirements. Unless you have multiple H100s lying around, V3/R1 are only run locally as impractical stunts with some or all the model being stored on low bandwidth memory.

    We can't compare the size of Deepseek V3 to that of any proprietary frontier models because we don't know the size of those models at all (or even their architecture). The models being compared to that are "expensive at scale" you can't run locally at all, but surely we have no reason to believe that they'd somehow be cheap to run locally?

    But I thought you'd typically expect exactly the opposite effect than is claimed here? MoE should be the better tradeoff for the local/single-user scenario since the downside of batching being harder / less efficient doesn't matter.

    > Bigger batches raise latency because user tokens might be waiting up to 200ms before the batch is full enough to run, but they boost throughput by allowing larger (and thus more efficient) GEMMs in the feed-forward step

    Is it really that the matrixes being multiplied are larger? My mental model is that the purpose of batching isn't to get larger input matrices. It's to move the bottleneck from memory bandwidth to compute. The matrices are already sharded to a much smaller size than the size of the entire model or even layer. So you'll basically load some slice of the weights from the HBM to SRAM, do the multiplication for that slice, and then aggregate the results once all tiles have been processed. Batching lets you do multiple separate computations with the same weights, meaning you get more effective FLOPS per unit of memory bandwidth.

    > The fact that OpenAI and Anthropic’s models are quick to respond suggests that either:

    Is that actually a fact? The post has no numbers on the time to first token for any of the three providers.

  • Imagine an FPGA big enough to hold the whole model in LUTS (and not RAM) with latches in appropriate places to keep race conditions in check. Even a 100 Mhz clock cycle would beat almost anything else in the world running it. Even if there's 500 stages of pipeline involved, you could still get 200,000 tokens per second for a single stream and have 499 streams ready for other uses.

    With an FPGA like that, you could translate all of the matrix multiplies and weights directly into binary logic, optimizing out every multiply or add of a zero bit. This alone could cut the number of gates and computations, and power consumption in half.

    Because you wouldn't need to throw data to/from RAM, you'd save a huge percentage of the usual latency and eliminate memory bandwidth issues. The effective equivalent memory bandwidth would likely be measured in exabytes per second.

    This is the type of compute load that would perfectly match a bit level systolic array.

  • There's still a lot of opportunity for software optimizations here. Trouble is that really only two classes of systems get optimizations for Deepseek, namely 1 small GPU + a lot of RAM (ktransformers) and the system that has all the VRAM in the world.

    A system with say 192GB VRAM and rest standard memory (DGX station, 2xRTX Pro 6000, 4xB60 Dual, etc.) could still in theory run Deepseek @4bit quite quickly because of the power law type usage of the experts.

    If you aren't prompting Deepseek in Chinese, a lot of the experts don't activate.

    This would be an easier job for pruning, but still I think enthusiast systems are going to trend in a way the next couple years that makes these types of software optimizations useful on a much larger scale.

    There's a user on Reddit with a 16x 3090 system (PCIE 3.0 x4 interconnect which doesn't seem to be using full bandwidth during tensor parallelism) that gets 7 token/s in llama.cpp. A single 3090 has enough VRAM bandwidth to scan over its 24GB of memory 39 times per second, so there's something else going on limiting performance.

  • This is a great explainer from an LLM perspective, and it would be interesting to see a computational scheduling explanation in depth. I presume that hyperscale LLM companies extensively examine the computation trace to identify bottlenecks and idle bubbles, and develop load balancers, pipeline architectures and schedulers in order to optimise their workload.

    The batching requirement for efficiency makes high security applications quite difficult, because the normal technique of isolating unrelated queries would become very expensive. The nVidia vGPU GPU virtualisation time shares GPU memory, and every switch requires unload/reload context switches, doubtful they have deduplication. Multi-Instance GPU (MIG) splits GPU memory between users, but it is a fixed partitioning scheme (you have to reboot the GPU to change), and nobody wants to split their 96GB GPU into 4x24GB GPUs.

    Makes me wonder what the tradeoff is for putting second level memory on the GPU board (i.e. normal DRAM), so that different matrix data can be loaded in faster than over PCIe, i.e. the HBM becomes a cache.

    (I'm also really liking the honesty in the authors book on Software Engineering, not in the dry IEEE sense, but as a survival guide in a large enterprise. https://www.seangoedecke.com/book/ )

  • > mixture of experts requires higher batch sizes

    Or apple silicon for low batch size (=1 ideally). The unified memory allows for running larger models on the expense of them running slower, because of lower bandwidth/flops than a normal gpu. But MoEs require computing only few parameters every time, so the computational needs are low. I have seen people reporting decent speeds for deepseek for single batch inference on macs. It is still expensive though to many people's standards because it requires a lot of $$$ to get enough memory.

    In some ways, MoE models are perfect fit for macs (or any similar machines that may come out). In contrast, ordering a mac with upgraded ram size and running dense models that just fit in the vram can be very painful.

  • I was talking with a colleague the other day and we came to the conclusion that in our experience if you're using llms as a programming help models are really being optimised for the wrong things.

    At work I often compare locallly run 4-30B models against various GPTs (we can only use non-local models for few things, because of confidentiality issues). While e.g. GPT-4o gives better results on average, the chances of it making parts of the response up is high enough that one has to invest significant amount to check and iterate over results. So the difference in effort is not much lower compared to the low parameter models.

    The problem is both are just too slow to really iterate quickly, which makes things painful. I'd rather have a lower quality model (but with large context) that gives me near instant responses instead of a higher quality model that is slow. I guess that's not giving you the same headlines as the improved score on some evaluation.

  • It is not "slow and expensive", although it could be "or". You can get 3 tokens / second running on DDR4 memory on a two generation old workstation system that costs ~1K, via llama.cpp .

  • >It’s a peculiar feature of transformer-based LLMs that computing a batch of completions at the same time is almost as fast as computing a single completion. Why is that?

    Incorrect. Transformers usually contain a classical MLP layer. Only the MLP layer can be batched. Hence all classical neural networks including convolutional networks (via im2col) can be batched.

    If there's anything that the transformer architecture changes, it is that the attention layer cannot be batched.

  • If I understand it correctly, the effect of experts is a weighted sum of the individual calculation of each token meeting each expert, where experts to be met by a token are selected on an individual basis. Since a sum is commutative, though, it should be possible to send a large batch of tokens copied to multiple GPUs, where experts are streamed into VRAM, partitioned across GPUs. Then the bottleneck is your PCI-E bandwidth. With 2 GPUs at Gen 4 x16, you should have 60 GB/s of TX bandwidth, allowing you to upload a half precision quant of DeepSeek (about 360 GB) in about 6 seconds.

      1 GPU  -  30 GB/s TX - 12 seconds
      2 GPUs -  60 GB/s TX - 6 seconds
      4 GPUs - 120 GB/s TX - 3 seconds
    
    Then you just optimize your batch size to match the compute time to the upload time of each GPU. The expert calculation results can be retrieved from the GPUs and summed up.

  • Do the individual requests in a batch influence each-other?

    Not in a floating point non-deterministic kind of way, where exact ordering might introduce some non-determinism (begin position 5th versus being position 10th in the batch lets say).

    I'm asking in a semantic way, can context from one request leak into another because they are in the same batch?

  • I haven't looked for awhile but is deepseek online still about 1/100th the cost of its competitors?

  • yeah so everyone’s flexing their home lab setups with 384GB RAM and no GPU like it’s the new normal. cool if you’ve got a spare $4k and time to tinker. but for most of us? that’s not happening.

    also, these quantized models like unsloth, gguf, whatever..people say they’re “close enough” to the original. but close enough for what? summarizing blog posts? sure. complex reasoning or code generation? not so much. without solid benchmarks, it’s all just numbers for now.

    and let’s not forget the hidden costs: time spent configuring, debugging, and maintaining these systems. it adds up.

    so yeah, running deepseek locally is a fun experiment. but calling it affordable or practical for the average guy like me? that’s a stretch.

  • This reminded me that the economies of scale in AI, especially inference, is huge.

    When people say LLMs will be commoditised, I am not sure that means that the market is going to be super competitive. As the economies of scale of AI get even bigger (larger training costs + batch inference etc.) it just seems likely only around 3 companies will dominate LLMs.

  • Isn’t this an arbitrage opportunity? Offer to pay a fraction of the cost per token but accept that your tokens will only be processed when the batch window isn’t big enough, then resell that for a markup to people who need non-time sensitive inference?

  • It's not expensive to run locally at all if you know how big of GPT4.

  • MoE is in general kind of a stupid optimization. It seems to require around 5x more total parameters for the same modeling power as a dense model in exchange for around 2x less memory bandwidth needs.

    The primary win of MoE models seems to be that you can list an enormous parameter count in your marketing materials.

  • this statement holds true for all large parameter open weight models.

  • [flagged]

  • I am so sincerely amused that “we” figured out how to monetize LLMs from the jump using tokens.

    It isn’t tech for techs sake, it’s a money grab. Reminds me of paying to send a text message or buying minutes for a phone plan. Purely rent-seeking.