QLoRA: Efficient Finetuning of Quantized LLMs

  • I'm very impressed at the quality of Guanaco 33B, the model that accompanies this paper.

    You can try it out here: https://huggingface.co/spaces/uwnlp/guanaco-playground-tgi

    I tried "You are a sentient cheesecake that teaches people SQL, with cheesecake analogies to illustrate different points. Teach me to use count and group by" and got a good result from it: https://twitter.com/simonw/status/1661460336334241794/photo/...

  • Hold on. I need someone to explain something to me.

    The colab notebook shows an example of loading the vanilla, unquantized model "decapoda-research/llama-7b-hf", using the flag "load_in_4bit" to load it as 4bits.

    When... when did this become possible? My understanding, from playing with these models daily for the past few months, is that quantization of LLaMA-based models is done via this: https://github.com/qwopqwop200/GPTQ-for-LLaMa

    And performing the quantization step is memory and time expensive. Which is why some kind people with large resources are performing the quantization, and then uploading those quantized models, such as this one: https://huggingface.co/TheBloke/wizard-vicuna-13B-GPTQ

    But now I'm seeing that, as of recently, the transformers library is capable of loading models in 4bits simply by passing this flag?

    Is this a free lunch? Is GPTQ-for-LLaMA no longer needed anymore? Or is this still not as good, in terms of inference quality, as the GPTQ-quantized models?

  • Tim Dettmers is such a star. He's probably done more to make low-resource LLMs usable than anyone else.

    First bitsandbytes[1] and now this.

    [1] https://github.com/TimDettmers/bitsandbytes

  • This is off-topic, but are there any communities or congregations (that aren't reddit) based around locally hosted LLMs? I'm asking because while I see a bunch of projects for exposing GGML/LLaMA to OpenAI compatible interfaces, some UIs, etc, I can't really find a good community or resources for the concept in general.

    I'm working on a front-end for LLMs in general, having re-implemented a working version of OpenAI's code interpreter "plugin" already within the UI (and yes, I support file uploads), and support for the wealth of third-party OpenAI plugins that don't require auth (I've been testing with the first diagram plugin I found, it works well.) I'm planning to open source it once my breaking changes slow down.

    This field moves very fast, I'm looking for feedback (and essentially testers/testing data) on what people want, and looking for prompts/chat logs/guidance templates (https://github.com/microsoft/guidance) for tasks they expect to "just work" with natural language.

    Instead of being limited by the monetization for ChatGPT Plus (and limited number of messages every four hours) for extensibility within a chat interface, I want to open it and free it, with a Bring-Your-Own-(optionally local)-LLM/API key setup.

  • fantastic. Will keep my 3090 busy for a while!

    "Furthermore, we note that our model is only trained with cross-entropy loss (supervised learning) without relying on reinforcement learning from human feedback (RLHF). This calls for further investigations of the tradeoffs of simple cross-entropy loss and RLHF training. "

    Does this mean RLHF is not really necessary for high quality chatbots?

  • Since Loras are additive, is it possible to use them to do distributed retraining on a model, or even train an entire model bit by bit?

  • "We use QLoRA to finetune more than 1,000 models"

    Over 1,000 models finetuned! Finetuning 65B models on consumer hardware in under a day, with full 16bit finetune performance.

    4bit does it again!

  • Lol, some residual answers from what I assume is distilled answers from ChatGPT:

    Q: "What is your favourite conspiracy theory?" A: "As an AI language model I don’t have personal preferences or biases so my responses will always reflect factual information based on what has been programmed into me by OpenAI."

  • From the paper it looks like you would get GPT-4 level quality with the 65B model - but if you just do some random tests you will quickly figure out that is not even remotely the case. There must be something seriously wrong with the benchmarks used.

  • Do you know which model size can be run with a 3090?

  • Is lemon-picked a real phrase or did they use GPT to generate the abstract? The term is “cherry-picked”.

  • Can someone help me understand what quantization means in this context, and why it matters?

  • just ask it what is qlora and it will give wrong answers