Project Github link:https://github.com/SJTU-IPADS/PowerInfer
It seems like this can’t run all models, and needs custom ones trained from scratch: “ We introduce two new models: TurboSparse-Mistral-7B and TurboSparse-Mixtral-47B. These models are sparsified versions of Mistral and Mixtral […]. Notbly, our models are trained with just 150B tokens within just 0.1M dollars”.
It remains to be seen how good these custom models are.
The hyperbollocks marketingspeak in the summary paragraph put me off:
"The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. Additionally, it introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining, which effectively minimize and conceal the overhead caused by I/O operations."
Ahem, what? Let's overload a biological construct "neuron" to imbue it with magical technopowers and then derive the rest of our BS from this. No sale.
The speed improvement is only for models that don't entirely fit in memory, i.e. memory-starved llama.cpp degenerates to ~20x slower.
However, this scheme does reduce memory usage by 40%, meaning it allows models that are 67% bigger. It's a quality improvement, not a performance one.
> For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM.