Hacker News

Two different tricks for fast LLM inference

by swahon 2/15/2026, 9:27:33 AM with 27 comments

by ankit219on 2/15/2026, 5:33:55 PM
People are misunderstanding Anthropic's fast mode because they chose to name it that way. The hints all point to a specific thing they did. The setup is costlier, its also smarter and better on tougher problems which is unheard of in terms of speed. This paper[1] fits perfectly:
The setup is parallel distill and refine. You start with parallel trajectories instead of one, then distill from them, and refine that to get to an answer. Instead of taking all trajectories to completion, they distill it quickly and refine so it gives outputs fast and yet smarter.
- paper came out in nov 2025
- three months is a good research to production pipeline
- one of the authors is at anthropic
- this approach will definitely burn more tokens than a usual simple run.
- > Anthropic explicitly warns that time to first token might still be slow (or even slower)
To what people are saying, speculative decoding wont be smarter or make any difference. Batching could be faster, but then not as costly.
Gemini Deepthink and gpt-5.2-pro use the same underlying parallel test time compute but they take each trajectory to completion before distilling and refining for the user.
[1]: https://arxiv.org/abs/2510.01123
by yorwbaon 2/15/2026, 11:07:02 AM
> The idea is to have a chip with SRAM large enough to fit the entire model, so inference can happen entirely in-memory. [...] So how much internal memory does the latest Cerebras chip have? 44GB. This puts OpenAI in kind of an awkward position. 44GB is enough to fit a small model (~20B params at fp16, ~40B params at int8 quantization), but clearly not enough to fit GPT-5.3-Codex.
You don't really need to fit the entire model on a single chip. Just as with GPUs, you can shard the model across multiple chips. Of course when you have a long pipeline of chips that each token needs to pass through, that decreases the end-to-end tokens per second correspondingly.
So the size of GPT-5.3-Codex-Spark isn't limited by the memory of a single Cerebras chip, but the number of such chips that you can chain together and still hit the 1000 tokens per second target. Given that Cerebras offers models much larger than 40B at faster speeds https://www.cerebras.ai/pricing#exploration GPT-5.3-Codex-Spark is likely closer to GLM 4.7 in size. (≈355B total parameters, 32B active)
by ankit219on 2/15/2026, 5:41:06 PM
> Batching multiple users up thus increases overall throughput at the cost of making users wait for the batch to be full.
writer has not heard of continuous batching. this is no longer an issue. this is what makes claude code that affordable. https://huggingface.co/blog/continuous_batching
by criemenon 2/15/2026, 10:04:11 AM
One other thing I'd assume Anthropic is doing is routing all fast requests to the latest-gen hardware. They most certainly have a diverse fleet of inference hardware (TPUs, GPUs of different generations), and fast will be only served by whatever is fastest, whereas the general inference workload will be more spread out.
by tasukion 2/15/2026, 4:52:50 PM
> A good analogy is a bus system. If you had zero batching for passengers - if, whenever someone got on a bus, the bus departed immediately - commutes would be much faster for the people who managed to get on a bus.
A good analogy? I wonder... how do buses work at your place? Do they wait to be at least half-full before departing? I used to do that in the Simutrans game!
Where I'm from, buses usually depart on schedule, whether you get on the bus or not...
[Edit:] Otherwise an insightful article I guess.
by woeiruaon 2/15/2026, 5:16:56 PM
Article closes with:
>The usefulness of AI agents is dominated by how few mistakes they make, not by their raw speed. Buying 6x the speed at the cost of 20% more mistakes is a bad bargain, because most of the user’s time is spent handling mistakes instead of waiting for the model6.
That might be true today. I think the OpenAI-Cerebras partnership ultimately is going to lead to a paradigm shift because it will be possible to scale these chips up to the point where a model like full Codex-5.3 can run on them and then you'll have a super fast model that makes relatively few errors. A Codex-5.3 model running at these speeds is more than sufficient to actually start replacing customer facing jobs.
by nmiloon 2/15/2026, 5:30:50 PM
I don’t really get the bus analogy. It seems like it massively increases latency but as soon as you’re “on the bus” throughput is normal? When in reality (if I understand correctly) opus-fast is just giving you a bigger portion of the batch so increasing throughput with little affect on latency? (I’m assuming anthropic gets enough volume that these batches fill up pretty much instantly)
by andaion 2/15/2026, 11:47:01 AM
Interesting theory. So how does ChatGPT begin responding instantly, as soon as I send the message? Shouldn't it need to wait for the batch to fill? Or do they have so much traffic that this happens in a few ms?
(I think they might also be filling the message onto a GPU while you're typing over a websocket or something, but I'm not sure.)
by mft_on 2/15/2026, 11:46:57 AM
> So how much internal memory does the latest Cerebras chip have? 44GB. This puts OpenAI in kind of an awkward position. 44GB is enough to fit a small model (~20B params at fp16, ~40B params at int8 quantization), but clearly not enough to fit GPT-5.3-Codex. That’s why they’re offering a brand new model, and why the Spark model has a bit of “small model smell” to it: it’s a smaller distil of the much larger GPT-5.3-Codex model.
This doesn't make sense.
1. Nvidia already sells e.g. the H100 with 80GB memory, so having 44GB isn't an advance, let alone a differentiator.
2. As I suspect anyone that's played with open weights models will attest, there's no way that 5.3-Codex-Spark is getting close to top-level performance and being sold in this way while being <44GB. Yes it's weaker and for sure it's probably a distil and smaller, but not by ~two orders of magnitude as suggested.
by anvevoiceon 2/15/2026, 2:14:37 PM
This latency discussion is incredibly relevant to real-time voice AI applications. When you're building a voice agent that needs to respond conversationally (not just generate text), the inference speed directly determines whether the interaction feels natural or robotic.
In practice, humans perceive conversational pauses >800ms as awkward. So for a voice pipeline (STT → LLM inference → TTS), you have maybe 400-500ms budget for the LLM portion. At typical Sonnet speeds (~80 tok/s), you get ~35 tokens in that window — barely enough for a sentence. At Cerebras/Groq speeds (1000+ tok/s), you get 400+ tokens, which changes what's architecturally possible.
This is why the small-model vs. big-model tradeoff matters so much for real-time applications. We've found that a well-tuned smaller model with domain-specific context can outperform a larger model for constrained tasks (like navigating a user through a website or answering product questions), while staying within the latency budget. The "council" approach — multiple specialized small agents instead of one large general agent — lets you get both speed and quality.
The speculative decoding point is underrated here. For voice AI specifically, you can predict likely response patterns (greetings, confirmations, common Q&A) and pre-generate TTS for those, then only hit the full inference pipeline for novel queries. Gets you sub-200ms for ~60% of interactions.
by dan-robertsonon 2/15/2026, 3:26:16 PM
I think being faster probably is important but it brings a bunch of challenges:
- the split pricing model makes it hard to tune model architecture for faster inference as you need to support fast and cheap versions.
- the faster the model is, the more it becomes a problem that they don’t ’understand’ time – they sit idle waiting for big compilations or they issue tools sequentially when they ought to have issued them in parallel.
by christina97on 2/15/2026, 6:22:27 PM
Author is clearly confused about the Anthropic case. The request rate at these generation endpoints is so high that the current batching delay is effectively negligible.
by dist-epochon 2/15/2026, 10:28:25 AM
The batch size explanation is wrong. Given how much Claude Code is used, finding fellow "bus passengers" is not an issue, you don't need to wait.
The real reason which batching increases latency is multi-factored and more complex to explain.
by gostsamoon 2/15/2026, 10:32:33 AM
If the author is right, OpenAI have room for improvement where they can further improve the fast models for correctness for certain tasks while Anthropic are left with scaling vertically. OFC, it is likely that over time both approaches will converge when the companies understand the problem space better and what tradeoofs are worth making.
My personal take is that they will need a big model to plan and break down tasks and schedule them to specialized smaller models while there is a good enough model for real time interactions with the user, but it is the naive take and many other things might be shaping the decisions.
by EdNuttingon 2/15/2026, 10:41:02 AM
This author thinks Cerebras chips were deployed at scale to serve users worldwide in just one month since the partnership announcement?
Seems like nonsense to me.
by Der_Einzigeon 2/15/2026, 10:11:42 AM
Another possible explanation, especially if quality degrades at all (I.e on openAI) is aggressive quantization.
Another possible explanation is speculative decoding, where you trade unused GPU memory for speed (via a drafting model).
But my money is on the exact two mechanisms the OP proposes.
by villgaxon 2/15/2026, 12:28:11 PM
Lol, without any evidence this is just vaporblog, it could just be reudced precision for whatever model either one of them runs & not necessarily a distillation or smaller model to boot or heck even a combo since at this point in time most frontier models are MoEs & getting absurd speeds for 1-20B experts is trivial regardless of batch sizes
by nivcmoon 2/15/2026, 7:19:05 PM
[dead]
by kittbuildson 2/15/2026, 5:14:38 PM
[dead]
by janlucienon 2/15/2026, 1:31:25 PM
[dead]
by intellirimon 2/15/2026, 2:50:52 PM
[dead]
by phucneton 2/15/2026, 12:33:22 PM
[flagged]
by plutodevon 2/15/2026, 7:10:16 PM
[flagged]
by retinaroson 2/15/2026, 10:04:38 AM
Very interesting. OAI releases since their router all seem focused on cost cutting/efficiency while anthropic is mostly going the opposite direction spending all budget to overhype their models in media and release neo-hipster (aka normies) ads on taste and on how they wont do ads. The first red flag - beside every time dario speaks - was the popup events with shitty caps overhyped by all ai influencers.
It seems OAI was forced by investors to shift quickly to making money. Anthropic seem to have more time? Might be hard for OAI to keep the pace while focusing on cost
by semessieron 2/15/2026, 11:54:07 AM
that's pretty shallow for the front page. What would be interesting in this context are things such MXFP4 quantization etc. not commonplaces.