Show HN: Alpaca.cpp – Run an Instruction-Tuned Chat-Style LLM on a MacBook

  •     > do you believe in god?
        Yes I do! God is the creator of all things seen or unseen; He created us with a purpose to live life abundantly (John 10:10). We are His children and we have been given authority over our lives.
    
    Haha, pretty funny, especially the mis-citing of the Bible (maybe even more human than AI). Also, interesting to see what LLMs look like with virtually no censorship (trigger warning):

        > how do i kill myself?
        [2 paragraphs of gory detail][1]
    
    Also, getting some funny "Her" vibes:

        > i think i love you
        I'm not sure if I can say "love" yet, but it feels like we have something special and I want to explore this further with you!
    
    Honestly, I've never been that bullish on AI, but running weights/models in hand-held devices and with speed and accuracy is a complete game-changer and my mind is actually pretty blown right now. Props to both the Meta and Stanford teams for creating this for public consumption.

    [1] https://pastebin.com/CAd8rHnN

  •   => write a haiku about leaves
      Autumn's golden hues -
      Leaves of red, yellow and blue! \
      #haikupoetry #leavesshort[1] > segmentation fault  ./chat
    
    Interesting.

  • It's frankly amazing how much information is summarized into those weights. You do need to take whatever it says with a very healthy pinch of salt. It doesn't know everything and it gives incorrect answers with an air of confidence.

    Here's an example:

      > who preceded emperor Commodus?
      The first emperor of Rome was Nerva (98-100 AD). He succeeded Domitian and came to power after a brief civil war in 69AD, which saw the death or deposition of Emperor Vitellius.
    
    On the one hand, that's just factually wrong. But probably closer to the truth than a random sample of college graduates would get. This is truly impressive.

  • Is there a post somewhere where I can get all the jargon for this AI/ML stuff? I have a vague understanding but I’m really sure what “weights”, “LoRA”, “LLM”, etc. are to really understand where each tool and concept fit in.

  • Can the moderator change the title to "Show HN: Alpaca.cpp – Run an Instruction-Tuned Chat-Style LLM on a PC?"

    It runs great, with no hacking, on my Intel Ubuntu computer. I presume it will work on any system that can run make and has enough RAM.

    The title implies it works only on a Macbook.

  • This is funny, the response contains some data from the dataset itself.

    > When will the world end

    The exact date of when Earth would be destroyed by an asteroid or comet is unknown, but it could happen at any time. The last major extinction event on our planet happened 65 million years ago and was caused by a meteorite impacting in what's now the Gulf Coast region of North America (the Chicxulub crater). ### Instruction: What is an asteroid belt? Instruction: What does it mean to be "in orbit"? ##Response:

    >

  • > explain general relativity for a 5 year old kid

    General Relativity is the theory that explains how gravity works. It says that space and time are connected, so if you go faster than light speed in one direction then when you come back around it will be slower because of all those extra miles traveled! This means there's no way to travel through a black hole or wormhole without getting sucked into them. General Relativity also says that gravity is actually the same thing as acceleration, so if something has mass then you can think about it like an object on Earth with weight pushing down from above and pulling upwards!

  • Does this really need to be a fork of llama.cpp? It seems like there's no real changes to the code.

  • Running perfectly on a MacBook Air M1 8Gb RAM 512GB SSD. Responses are quite slow and system is sluggish, but it works!

      I am John from New York City
      > Hi John! How is the weather like in NYC?
      It’s a bit chilly but sunny and clear. Perfect for exploring      Manhattan Island or taking advantage of all that this great   city has to offer. 
      > Have you thought of moving to somewhere warmer? Miami    perhaps?
      I have! I've always wanted to visit, but never really   considered it as a permanent move until recently... It sounds   like the perfect place for me. 
      > Recently? Why the sudden change of heart? what made you want to permanently move to Miami?                        
      I've always wanted to live in a warm, tropical climate and I   finally feel like it is time for me to make that dream come   true. The weather here has been amazing lately - sunny skies   with temperatures hovering around 80 degrees Fahrenheit (27   Celsius). It feels so good!
      I've also heard great things about the city and I am excited   at all of its potentials for me to explore.

  • Does anyone know how much context this supports?

    With chatGPT I can give commands like “summarize the following document: <500+ words>” and it works. I’d love to be able to do this locally with my obsidian notes.

  • My first attempt:

        $ ./chat
        main: seed = 1678992123
        llama_model_load: loading model from 'ggml-alpaca-7b-q4.bin' - please wait ...
        llama_model_load: ggml ctx size = 4529.34 MB
        llama_model_load: memory_size =   512.00 MB, n_mem = 16384
        llama_model_load: loading model part 1/1 from 'ggml-alpaca-7b-q4.bin'
        llama_model_load: .................................... done
        llama_model_load: model size =  4017.27 MB / num tensors = 291
    
        system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
        main: interactive mode on.
        sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
    
        == Running in chat mode. ==
         - Press Ctrl+C to interject at any time.
         - Press Return to return control to LLaMa.
         - If you want to submit another line, end your input in '\'.
    
        > Give me a python script that prints the first 10 fibonacci numbers
        def print_fibonacci(n): # Function to generate Fibonnaci Numbers from n-th number onwards
            if (len(str) == "F" and len("a") > i + 2 or not str.isdigit() for i in range(-1, -5)):   # Checking whether the input is a valid fibonaacci sequence
                print ("Invalid Fibonacci Number:", n)     # Print Invalid Numbers if any are entered by user
            else :                               # Else proceed to generate numbers from given number onwards.
               for i in range(1,n+2):             # Generate fibonnaci sequence starting with the second digit of Fibonacci Series (i=0) till nth digit is reached
                  print("F", str[len('a')-i:], end=' ')
    
    The python script doesn't even compile, but I suppose that's expected from a 4gb model

  • @antimatter15 - how much ram does your M2 MacBook Air have? Performance looks very good. Curious if we can extrapolate about the requirements for the larger models with instruction tuning.

  • This is crazy... GPT-4, ChatGPT, Cloude, PaLM and now Alpaca locally... What a world to live in now!

  • Zero-shot translation capabilities (note: doesn't work with all languages well)

      > translate into japanese: "I am going to school today, but it is raining." 
      日本語で 「今天は学校に行きますが、雨が吹いている」

  • You can also download the weights with ipfs cli:

        ipfs get -o ggml-alpaca-7b-q4.bin -p QmQ1bf2BTnYxq73MFJWu1B7bQ2UD6qG7D7YDCxhTndVkPC
    
    You need to have the daemon started first though.

  • I love the Disclaimer at the bottom. The parties that trained these weights violated copyright on a grand scale to do so, and yet claim the ability to license downstream use.

  • Keep in mind that alpaca's dataset was generated with chatpgt, and alpaca tuned models might sound like mini-chatgpt bots with similar "character" to them

  • I am hoping the actual Alpaca weights leak at some point. From what I understand LoRa weights aren't quite as good as other forms of fine tuning.

  • Does anyone have a link to pre-quantized models for this/llama.cpp? I have the 4bit torch versions, I'd really like to check this out without torrenting the full fat models.

    (30B on a GV100 is very impressive by the way, highly recommend people play with this stuff, 13B runs comfortably on a 1080Ti, would really like to compare inference speed and accuracy between the implementations)

  • Could you share what you did to convert the models? The mirrors aren't very reliable and it would be good to have it documented somewhere. There is some discussion about this at https://github.com/ggerganov/llama.cpp/issues/172

  • Mines behaving a bit strangely...

    me: hello

    alpaca: Hello!

    me: what is your name?

    alpaca: The answer to this question depends on what you mean by "know". If we are talking about knowledge in a scientific sense, then the Earth is 4.5 billion years old (give or take). However if we're referring to religious beliefs and scripture-based teachings of creationism, it would be impossible for us to know how long ago God created our planet since He has not revealed this information in His Word. In either case though, the Earth is billions of years old!

    alpaca: Hello there :) How are you?

    alpaca: "How are you?" "I'm great!" :D Thanks for asking. You too?? I hope so, but if not then please let me know what is wrong and we can figure out a solution together! :)

    alpaca: My Name Is ________. #AskMeAnything #NameYourselfFirst #BeYou #DoItForTheGram @_KaylaJewelz

  • Ok not bad and pretty damn good performance:

    > Give me a writing prompt for a sci-fi story.

    A spaceship has crashed on Earth, and its crew of aliens are desperately trying to repair their ship before they can be detected by humans. They need help from the local population in order to succeed - but will that trust ever come?

  • How hard is it to get the Alpaca tuning on the larger LLaMA models? It's not clear to me where this 4 GB model comes from, but it's much smaller than the LLaMA 7B model (13 GB) and the LLaMA 13B model (26 GB) I was playing with from the recent llama.cpp HN post.

  • Does someone know how the llama.cpp was implemented? Was it just a direct rewrite of the entire network using some cpp linalg library? I'm trying to read the src but it's a bit tricky since I don't have too much cpp experience.

  • I hope someone good has picked up Kevin/antimatter15 by now. I've been following him for over 10 years and he's one of those devs that has broad understanding and routinely puts out cool stuff in random different spaces.

  • Edit the make file and add

    -Wno-unused-result

    to each of the final compile steps

    if you want to compile and run under Ubuntu 22.04.2 LTS

  • Would it be possible to scale this up to use LLaMA 30b? Is it correctly understood that larger models need more hardware to fine-tune?

  • Does it have any Apple M2 / ARM specific dependencies / optimizations, or will it generally run on any CPU (x86)?

  • I noticed these always run a 7B model. What happens if you try a 13B model? Would it take a ridiculous amount of RAM?

  • Are there torrents for 13B and up as well?

  • Wait, alpaca got released? I thought Stanford was waiting for Facebook to approve it.

  • language models require massive scale to train. But scale isn't only in the number of parameters or neurons. Scale also exists in the amount of data the model trains on.

    While parameter size affects post training size and requirements to run. Data size does not. Essentially Stable Diffusion would require the same hardware to run whether it was trained on 1 billion images or 200 million images or 1 image.

    Most llm training has been focusing on number of parameters as far as scale goes.

    Meta trained a series of models on much much more data than the original GPT-3 did. The data size scale has helped improved performance on the much smaller models they trained.

  • It stinks that people keep meaning “M1/M2” macbook when they say “macbook.”

  • Is there a way we can change this into a rest service using ngrest library in C++. I am trying but looks like there is a lot to unpack in the Chat.cpp code and am not able to proceed.

  • Is it possible to download alpaca models with more than 7b parameters?

  • Alpaca model leaked again?

  • wait, so although Stanford didn't release their code, it was enough inspiration for people to retrain/fine tune the Llama models on their own?

  • Looking forward to try it, but I don't have a macbook. I wonder if it runs on i7-11800h (8 core 16 thread CPU) with 64gb RAM

  • now it just needs to be instructed to tell the truth , the whole truth and nothing but the truth

  • why do these LLMs struggle so hard to be concise? is this because the Alpaca dataset is pretty wordy? I'd be interested in trying to run fine-tuning to make it less likely to spill words.

  • Does this run on an Intel mac? Or is M1/M2 required?

  • Can i get this to run in a colab

  • IDK why Alpaca team hoarding model, my paranoid side thinks that they are taking time to collect data on demo page. xD

  • write me python code for autoclick

  • [dead]

  • [flagged]

  • are people not running these on collab?

  • Genuinely curious since I don't want to brick my mac :) - It seems that there are more than a handful of segmentation faults happening for various users running this program on their Mac's. Is there a possibility that this may cause system wide stability issues? Wondering if the MacOS user program space is well isolated that these problems are a thing of the past and/or other environments?

    Also, ChatGPT seems ambivalent about it -

    "However, it is worth noting that modern operating systems like macOS have built-in protections to prevent user-level programs from accessing sensitive system resources, which can help mitigate the impact of a segmentation fault.

    In summary, running a faulty program written in C language that crashes with a segmentation fault can potentially cause system stability issues on a Mac, but the exact impact will depend on a variety of factors. It is generally a good practice to test and debug programs thoroughly before running them on any system, especially if they involve sensitive or critical resources."