Some you may have seen this but I have a Llama 2 finetuning live coding stream from 2 days ago where I walk through some fundamentals (like RLHF and Lora) and how to fine-tune LLama 2 using PEFT/Lora on a Google Colab A100 GPU.
In the end with quantization and parameter efficient fine-tuning it only took up 13gb on a single GPU.
Check it out here if you're interested: https://www.youtube.com/watch?v=TYgtG2Th6fI
This covers three things: Llama.cpp (Mac/Windows/Linux), Ollama (Mac), MLC LLM (iOS/Android)
Which is not really comprehensive... If you have a linux machine with GPUs, i'd just use hugging face inference (https://github.com/huggingface/text-generation-inference). And I am sure there are other things that could be covered.
Self-plug. Here’s a fork of the original llama 2 code adapted to run on the CPU or MPS (M1/M2 GPU) if available:
https://github.com/krychu/llama
It runs with the original weights, and gets you to ~4 tokens/sec on MacBook Pro M1 with the 7B model.
The easiest way I found was to use GPT4All. Just download and install, grab GGML version of Llama 2, copy to the models directory in the installation folder. Fire up GPT4All and run.
For most people who just want to play around and are using MacOS or Windows, I'd just recommend lmstudio.ai. Nice interface, with super easy searching and downloading of new models.
The correct answer, as always, is the oogabooga text generation webUI, which supports all of the relevant backends: https://github.com/oobabooga/text-generation-webui
Don't remember if the grammar has been merged in llama.cpp yet but it would be the first step to have Llama + Stable diffusion locally to output text + images and talk to each other. The only part I'm not sure is how Llama would interpret images back. At least it could use them though, to build e.g. a webpage.
> curl -L "https://replicate.fyi/install-llama-cpp" | bash
Seriously? Pipe script from someone's website directly to bash?
Seems to be a better guide here (without the risk curl):
https://www.stacklok.com/post/exploring-llama-2-on-a-apple-m...
The LLM is impressive (llama2:13b) but appears to have been greatly limited to what you are allowed to do with it.
I tried to get it to generate a JSON object about the movie The Matrix and the model refuses.
Off topic: is there a way to use one of the LLMs and have it ingest data from a SQLite database and ask it questions about it?
I might be missing something. The article asks me to run a bash script on windows.
I assume this would still need to be run manually to access GPU resources etc, so can someone illuminate what is actually expected for a windows user to make this run?
I'm currently paying 15$ a month in a personal translation/summarizer project's ChatGPT queries. I run whisper (const.me's GPU fork) locally and would love to get the LLM part local eventually too! The system generates 30k queries a month but is not super-affected by delay so lower token rates might work too.
Maybe obvious to others, but the 1 line install command with curl is taking a long time. Must be the build step. Probably 40+ minutes now on an M2 max.
Self plug: run llama.cpp as an inference server on a spot instance anywhere: https://cedana.readthedocs.io/en/latest/examples.html#runnin...
How do you decide what model variant to use? There's a bunch of Quant method variations of Llama-2-13B-chat-GGML [0], how do you know which one to use? Reading the "Explanation of the new k-quant methods" is a bit opaque.
The thing I get peeved by is that none of the models say how much RAM/VRAM they need to run. Just list minimum specs please!
If you just want to do inference/mess around with the model and have a 16GB GPU, then this[0] is enough to paste into a notebook. You need to have access to the HF models though.
0. https://github.com/huggingface/blog/blob/main/llama2.md#usin...
Idiot question: if I have access to sentence-by-sentence professionally-translated text of foreign-language-to-English in gigantic quantities, and I fed the originals as prompts and the translations as completions...
... would I be likely to get anything useful if I then fed it new prompts in a similar style? Or would it just generate gibberish?
I appreciate their honesty when it's in their interest that people use their API rather than run it locally.
Is it possible for such local install to retain conversation history so if for example you're working on a project and use it as your assistance across many days that you can continue conversations and for the model to keep track of what you and it already know?
This is usable, but hopefully folks manage to tweak it a bit further for even higher tokens/s. I’m running Llama.cpp locally on my M2 Max (32 GB) with decent performance but sticking to the 7B model for now.
I need some hand-holding .. I have a directory of over 80,000 PDF files. How do I train Llama2 on this directory and start asking questions about the material - is this even feasible?
curl -L "https://replicate.fyi/windows-install-llama-cpp"
... returns 404 Not FoundIs it possible to do hybrid inference if I have a 24GB card with the 70B model? Ie. Offload some of it to my RAM?
As someone with too little spare time I'm curious, what are people using this for, except research?
Did anyone build pc for running these models and which one do you recommend
I'm still curious to know the hype behind Llama 2
Llama.cpp can run on Android too.
[flagged]
[flagged]
For my fellow Windows shills, here's how you actually build it on windows:
Before steps:
1. (For Nvidia GPU users) Install cuda toolkit https://developer.nvidia.com/cuda-downloads
2. Download the model somewhere: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolv...
In Windows Terminal with Powershell:
`-DLLAMA_CUBLAS` uses cuda`2> $null` is to direct the debug messages printed to stderr to a null file so they don't spam your terminal
Here's a powershell function you can put in your $PSPROFILE so that you can just run prompts with `llama "prompt goes here"`:
adjust your paths as necessary. It has a tendency to talk to itself.