Guide to running Llama 2 locally

  • For my fellow Windows shills, here's how you actually build it on windows:

    Before steps:

    1. (For Nvidia GPU users) Install cuda toolkit https://developer.nvidia.com/cuda-downloads

    2. Download the model somewhere: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/resolv...

    In Windows Terminal with Powershell:

        git clone https://github.com/ggerganov/llama.cpp
        cd llama.cpp
        mkdir build
        cd build
        cmake .. -DLLAMA_CUBLAS=ON
        cmake --build . --config Release
        cd bin/Release
        mkdir models
        mv Folder\Where\You\Downloaded\The\Model .\models
        .\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin --color -p "Hello, how are you, llama?" 2> $null
    
    `-DLLAMA_CUBLAS` uses cuda

    `2> $null` is to direct the debug messages printed to stderr to a null file so they don't spam your terminal

    Here's a powershell function you can put in your $PSPROFILE so that you can just run prompts with `llama "prompt goes here"`:

        function llama {
            .\main.exe -m .\models\llama-2-13b-chat.ggmlv3.q4_0.bin -p $args 2> $null
        }
    
    adjust your paths as necessary. It has a tendency to talk to itself.

  • Some you may have seen this but I have a Llama 2 finetuning live coding stream from 2 days ago where I walk through some fundamentals (like RLHF and Lora) and how to fine-tune LLama 2 using PEFT/Lora on a Google Colab A100 GPU.

    In the end with quantization and parameter efficient fine-tuning it only took up 13gb on a single GPU.

    Check it out here if you're interested: https://www.youtube.com/watch?v=TYgtG2Th6fI

  • This covers three things: Llama.cpp (Mac/Windows/Linux), Ollama (Mac), MLC LLM (iOS/Android)

    Which is not really comprehensive... If you have a linux machine with GPUs, i'd just use hugging face inference (https://github.com/huggingface/text-generation-inference). And I am sure there are other things that could be covered.

  • Self-plug. Here’s a fork of the original llama 2 code adapted to run on the CPU or MPS (M1/M2 GPU) if available:

    https://github.com/krychu/llama

    It runs with the original weights, and gets you to ~4 tokens/sec on MacBook Pro M1 with the 7B model.

  • The easiest way I found was to use GPT4All. Just download and install, grab GGML version of Llama 2, copy to the models directory in the installation folder. Fire up GPT4All and run.

  • For most people who just want to play around and are using MacOS or Windows, I'd just recommend lmstudio.ai. Nice interface, with super easy searching and downloading of new models.

  • The correct answer, as always, is the oogabooga text generation webUI, which supports all of the relevant backends: https://github.com/oobabooga/text-generation-webui

  • Don't remember if the grammar has been merged in llama.cpp yet but it would be the first step to have Llama + Stable diffusion locally to output text + images and talk to each other. The only part I'm not sure is how Llama would interpret images back. At least it could use them though, to build e.g. a webpage.

  • > curl -L "https://replicate.fyi/install-llama-cpp" | bash

    Seriously? Pipe script from someone's website directly to bash?

  • Seems to be a better guide here (without the risk curl):

    https://www.stacklok.com/post/exploring-llama-2-on-a-apple-m...

  • The LLM is impressive (llama2:13b) but appears to have been greatly limited to what you are allowed to do with it.

    I tried to get it to generate a JSON object about the movie The Matrix and the model refuses.

  • Off topic: is there a way to use one of the LLMs and have it ingest data from a SQLite database and ask it questions about it?

  • I might be missing something. The article asks me to run a bash script on windows.

    I assume this would still need to be run manually to access GPU resources etc, so can someone illuminate what is actually expected for a windows user to make this run?

    I'm currently paying 15$ a month in a personal translation/summarizer project's ChatGPT queries. I run whisper (const.me's GPU fork) locally and would love to get the LLM part local eventually too! The system generates 30k queries a month but is not super-affected by delay so lower token rates might work too.

  • Maybe obvious to others, but the 1 line install command with curl is taking a long time. Must be the build step. Probably 40+ minutes now on an M2 max.

  • Self plug: run llama.cpp as an inference server on a spot instance anywhere: https://cedana.readthedocs.io/en/latest/examples.html#runnin...

  • How do you decide what model variant to use? There's a bunch of Quant method variations of Llama-2-13B-chat-GGML [0], how do you know which one to use? Reading the "Explanation of the new k-quant methods" is a bit opaque.

    [0] https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML

  • The thing I get peeved by is that none of the models say how much RAM/VRAM they need to run. Just list minimum specs please!

  • If you just want to do inference/mess around with the model and have a 16GB GPU, then this[0] is enough to paste into a notebook. You need to have access to the HF models though.

    0. https://github.com/huggingface/blog/blob/main/llama2.md#usin...

  • Idiot question: if I have access to sentence-by-sentence professionally-translated text of foreign-language-to-English in gigantic quantities, and I fed the originals as prompts and the translations as completions...

    ... would I be likely to get anything useful if I then fed it new prompts in a similar style? Or would it just generate gibberish?

  • I appreciate their honesty when it's in their interest that people use their API rather than run it locally.

  • Is it possible for such local install to retain conversation history so if for example you're working on a project and use it as your assistance across many days that you can continue conversations and for the model to keep track of what you and it already know?

  • This is usable, but hopefully folks manage to tweak it a bit further for even higher tokens/s. I’m running Llama.cpp locally on my M2 Max (32 GB) with decent performance but sticking to the 7B model for now.

  • I need some hand-holding .. I have a directory of over 80,000 PDF files. How do I train Llama2 on this directory and start asking questions about the material - is this even feasible?

  •     curl -L "https://replicate.fyi/windows-install-llama-cpp"
    
    ... returns 404 Not Found

  • Is it possible to do hybrid inference if I have a 24GB card with the 70B model? Ie. Offload some of it to my RAM?

  • As someone with too little spare time I'm curious, what are people using this for, except research?

  • Did anyone build pc for running these models and which one do you recommend

  • I'm still curious to know the hype behind Llama 2

  • Llama.cpp can run on Android too.

  • [flagged]

  • [flagged]