Hacker News

21.2× faster than llama.cpp? plus 40% memory usage reduction

by helloericsfon 6/12/2024, 9:58:03 PM with 5 comments

by worstspotgainon 6/12/2024, 10:34:00 PM
The speed improvement is only for models that don't entirely fit in memory, i.e. memory-starved llama.cpp degenerates to ~20x slower.
However, this scheme does reduce memory usage by 40%, meaning it allows models that are 67% bigger. It's a quality improvement, not a performance one.
> For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM.
by helloericsfon 6/12/2024, 9:58:28 PM
Project Github link:https://github.com/SJTU-IPADS/PowerInfer
by russianGuy83829on 6/12/2024, 10:32:36 PM
It seems like this can’t run all models, and needs custom ones trained from scratch: “ We introduce two new models: TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B. These models are sparsified versions of Mistral and Mixtral […]. Notbly, our models are trained with just 150B tokens within just 0.1M dollars”.
It remains to be seen how good these custom models are.
by x0non 6/12/2024, 10:27:50 PM
The hyperbollocks marketingspeak in the summary paragraph put me off:
"The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. Additionally, it introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining, which effectively minimize and conceal the overhead caused by I/O operations."
Ahem, what? Let's overload a biological construct "neuron" to imbue it with magical technopowers and then derive the rest of our BS from this. No sale.