Today I ran some perplexity benchmarks comparing F16 and Q8_0 for the K/V, I used Qwen 2.5 Coder 7b as I've heard people say things to the effect of Qwen being more sensitive to quantisation than some other models.
Well, it turns out there's barely any increase in perplexity at all - an increase of just 0.0043.
Added to the post: https://smcleod.net/2024/12/bringing-k/v-context-quantisatio...
What's the best way to use Ollama with a GUI, just OpenWebUI? Any options as well for mobile platforms like Android (or, I don't even know if we can run LLMs on the phone in the first place).
Great project! Do you think there might be some advantages to bringing this over to LLaMA-BitNet?
Nice.
That said... I mean...
> The journey to integrate K/V context cache quantisation into Ollama took around 5 months.
??
They incorrectly tagged #7926 which is a 2 line change, instead of #6279 where it was implemented, which made me dig a bit deeper and reading the actual change it seems:
The commit (1) is:
> params := C.llama_context_default_params()
> ...
> params.type_k = kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this
> params.type_v = kvCacheTypeFromStr(strings.ToLower(kvCacheType)) <--- adds this
Which has been part of llama.cpp since Dev 7, 2023 (2).So... mmmm... while this is great, somehow I'm left feeling kind of vaguely put-off by the comms around what is really 'we finally support some config flag from llama.cpp that's been there for really quite a long time'.
> It took 5 months, but we got there in the end.
... I guess... yay? The challenges don't seem like they were technical, but I guess, good job getting it across the line in the end?
[1] - https://github.com/ollama/ollama/commit/1bdab9fdb19f8a8c73ed...
[2] - since https://github.com/ggerganov/llama.cpp/commit/bcc0eb4591bec5...
Shout out to everyone from Ollama and the wider community that helped with the reviews, feedback and assistance along the way. It's great to contribute to such a fantastic project.