Cerebras achieves 2,500T/s on Llama 4 Maverick (400B)

  • > At over 2,500 t/s, Cerebras has set a world record for LLM inference speed on the 400B parameter Llama 4 Maverick model, the largest and most powerful in the Llama 4 family.

    This is incorrect. The unreleased Llama 4 Behemoth is the largest and most powerful in the Llama 4 family.

    As for the speed record, it seems important to keep it in context. That comparison is only for performance on 1 query, but it is well known that people run potentially hundreds of queries in parallel to get their money out of the hardware. If you aggregate the tokens per second across all simultaneous queries to get the total throughput for comparison, I wonder if it will still look so competitive in absolute performance.

    Also, Cerebras is the company that not only was saying that their hardware was not useful for inference until some time last year, but even partnered with Qualcomm with the claim that Qualcomm’s accelerators had a 10x price performance improvement over their things:

    https://www.cerebras.ai/press-release/cerebras-qualcomm-anno...

    Their hardware does inference with FP16, so they need ~20 of their CSE-3 chips to run this model. Each one costs ~$2 million, so that is $40 million. The DGX B200 that they used for their comparison costs ~$500,000:

    https://wccftech.com/nvidia-blackwell-dgx-b200-price-half-a-...

    You only need 1 DGX B200 to run Llama 4 Maverick. You could buy ~80 of them for the price it costs to buy enough Cerebras hardware to run Llama 4 Maverick.

    Their latencies are impressive, but beyond a certain point, throughput is what counts and they don’t really talk about their throughput numbers. I suspect the cost to performance ratio is terrible for throughput numbers. It certainly is terrible for latency numbers. That is what they are not telling people.

    Finally, I have trouble getting excited about Cerebras. SRAM scaling is dead, so short of figuring out how to 3D stack their wafer scale chips, during fabrication at TSMC, or designing round chips, they have a dead end product since it relies on using an entire wafer to be able to throw SRAM at problems. Nvidia, using DRAM, is far less reliant on SRAM and can use more silicon for compute, which is still shrinking.

  • I think it is too risky to build a company around the premise that someone won't soon solve the quadratic scaling issue. Especially, when that company involves creating ASICs.

    E.g.: https://arxiv.org/abs/2312.00752

  • Maybe one day they’ll have an actual api that you can pay per token. Right now it’s the standard “talk to us” if you want to use it.

  • > The most important AI applications being deployed in enterprise today—agents, code generation, and complex reasoning—are bottlenecked by inference latency

    Is this really true today? I don't work in enterprise, so don't know how things look like, but I'm sure lots of people here do, and it feels unlikely that inference latency is the top bottleneck, even above humans or waiting for human input? Maybe I'm just using LLMs very differently from how they're deployed in a enterprise, but I'm by far the biggest bottleneck in my setup currently.

  • Investors list include Altman and Ilya

    https://www.cerebras.ai/company

  • I tried some Llama 4s on Cerebras and they were hallucinating like they were on drugs. I gave it a URL to analyse a post for style and it made it all up and didn't look at the url (or realize that it hadn't looked at it).

  • I love Cerebrus. 10-100x faster than the other options. I really wish the other companies realized that some of us prefer our computer be instant. I use their API (with Qwen3 reasoning model) for ~99% of my questions, and the whole answer finishes in under 0.1 seconds. Keeps me in a flow state. Latency is jarring. Especially the 5-10 seconds most AIs take these days, where it's just enough to make switching tasks not worth it. You just have to sit there in statis. If I'm willing to accept any latency, might as well make it a couple minutes in the background, and use a full agent mode or deep research AI at that point. Otherwise I want instant.

  • Very nice. Now for their next trick they should offer inference on actually useful models like DeepSeek R1 (not the distills).

  • yes, was not obvious it's not terabytes per second

  • are the Llama 4 issues fixed? what is it good at? coding is out of the window after the updated R1.

  • [flagged]