If this works well, it will be a game changer. Requiring a fleet of $10k+ GPUs will kill any hope of wide spread adoption of open source "competitors" to GPT-3. Stable Diffusion is so popular because it can run on hardware mere mortals can own.
This is cool! But I wonder if it's economical using cloud hardware. The author claims 1.12 tokens/second on the 175B parameter model (arguably comparable to GPT-3 Davinci). That's about 100k tokens a day on the GCP machine the author used. Someone double check my numbers here, but given the Davinci base cost of $0.02 per 1k tokens and GCP cost for the hardware listed "NVIIDA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD" coming up to about $434 on spot instance pricing, you could simply use the OpenAI API and generate about 723k tokens a day for the same price as running the spot instance (which could go offline at any point due to it being a spot instance).
Running the fine-tuned versions of OpenAI models are approximately 6x more expensive per token. If you were running a fine-tuned model on local commodity hardware, the economies would start to tilt in favor of doing something like this if the load was predictable and relatively constant.
Got the ops-6.7b chatbot running on a windows machine with a 3090 in mere minutes. The only difference was to install the cuda pytorch `pip install torch==1.13.1+cu117 --extra-index-url https://download.pytorch.org/whl/cu117` just like in stable diffusion's case.
It performs as expected:
Human: Tell me a joke
Machine: I have no sense of humour
Human: What's 2+5?
Machine: I cannot answer that.
A lot of people are looking at this wrong. A $350 3060Ti has 12GB RAM. If there's a way to run models locally, it opens up the door to:
1) Privacy-sensitive applications
2) Tinkering
3) Ignoring filters
4) Prototyping
5) Eventually, a bit of extra training
The upside isn't so much cost / performance, as local control over a cloud-based solution.
Very cool.
Worth mentioning though that the highlighted figures (1.12 tok/s for OPT-175B for "FlexGen with Compression") are for inputs of 512 tokens and outputs of 32 tokens.
Since decoder-only transformer memory requirements scale with the square of sequence lengths, things would probably slow down significantly for very long sequences, which would be required for a back-and-forth conversation.
Still though, until reading this i had no idea that running such a model on-device was remotely feasible!
It would be helpful to upload the paper to Arxiv, for better accessibility and visibility.
https://github.com/Ying1123/FlexGen/blob/main/docs/paper.pdf https://docs.google.com/viewer?url=https://github.com/Ying11...
Note that the authors report the speed of generating many sequences in parallel (per token):
> The batch size is tuned to a value that maximizes the generation throughput for each system.
> FlexGen cannot achieve its best throughput in [...] single-batch case.
For 175B models, this likely means that the system takes a few seconds for each generation step, but you can generate multiple sequences in parallel and get a good performance _per token_.
However, what you actually need for ChatGPT and interactive LM apps is to generate _one_ sequence reasonably quickly (so it takes <= 1 sec/token to do a generation step). I'm not sure if this system can be used for that, since our measurements [1] show that even the theoretically-best RAM offloading setup can't run the single-batch generation faster than 5.5 sec/token due to hardware constraints.
The authors don't report the speed of the single-batch generation in the repo and the paper.
I just tried to run the example in the README, using the OPT-30B model. It appeared to download 60GiB of model files, and then it attempted to read all of it into RAM. My laptop has "only" 32GiB of RAM so it just ran out of memory.
I have recently written a paper on understanding transformer learning via the lens of coinduction & Hopf algebra. https://arxiv.org/abs/2302.01834
The learning mechanism of transformer models was poorly understood however it turns out that a transformer is like a circuit with a feedback.
I argue that autodiff can be replaced with what I call in the paper Hopf coherence which happens within the single layer as opposed to across the whole graph.
Furthermore, if we view transformers as Hopf algebras, one can bring convolutional models, diffusion models and transformers under a single umbrella.
I'm working on a next gen Hopf algebra based machine learning framework.
Join my discord if you want to discuss this further https://discord.gg/mr9TAhpyBW
This also means local fine-tuning is possible. Expect to see an explosion of new things like we did with Stable Diffusion, limited to some extent by the ~0.7 order of magnitude more VRAM required.
Top item on the roadmap: "Support Apple silicon M1/M2 deployment"
Any chance these work on CPUs with any acceptable performance?
I have a 10-core 20-thread monster CPU, but didn't bother with a dedicated GPU because I can't control something as simple as its temperature. See the complicated procedure that only works with the large proprietary driver here:
https://wiki.archlinux.org/title/NVIDIA/Tips_and_tricks#Over...
> Hardware: an NVIIDA T4 (16GB) instance on GCP with 208GB of DRAM and 1.5TB of SSD.
Is FlexGen able to take advantage of multiple hundreds of GB of system memory? Or is do these compute instances just come bundled with it and it's a [largely] irrelevant detail?
Out of curiosity, why aren't we crowd sourcing distributed training of LLMs where anyone can join by bringing their hardware or data? Moreover find a way to incorporate this into a blockchain so there is full transparency but also add in differential privacy to protect every participant.
Am I being too crazy here?
Wait until we have a game whose levels are procedural generated in all respects - where the geometry of the wold/landscape is procedural generated through the interactions you have with the AI NPCs whereby they generate the path - and there could be evil ones that direct you down a dark path when you piss them off... and oposite...
I’d love to run this on a single 24gb 3090 - how much dram / SSD space do I need for a decent LLM, when it’s quantised to 4bits?
This seems like a great step; I’ve been able to run StableDiffusion locally, but with an older GPU none of the LLMs will run for me since I don’t have enough VRAM.
Oddly I don’t see a VRAM requirement listed. Anyone know if it has a lower limit?
Would it be possible on cards like 3060 with 12gb of ram? What is limited factor here? Memory or computational power?
Good job!
Don't underestimate the value of batching even for personal use. You can get MUCH better results from a language model if you sample a couple outputs and choose the best to continue.
This kind of usage isn't especially economical for hosted use-- but for personal use it would mostly be using idle resources and you can get extra samples almost for free.
A bunch of people getting multiple completions and choosing which one they'd prefer to continue might make for some really useful training data too.
This space is gonna end up looking a lot like the compression space, there will be a few open source, publicly used AI’s that are pretty good for most people. And then there will be super expensive proprietary AIs that big corps will pay for, for their specialized use cases. The only people who will even know about those specialized AI’s existence will be the type of people who need them and everyone else in the world will think the best you can do is zip.
We are hiring in that area of work in Europe time zone. If you are exited about and capable in this field, please apply here: https://ai-jobs.net/job/41469-senior-research-engineer-llms-...
I recently bought a T4 to go with my epyc 7402 and 512GB ram for fun and this looks like a great use case. Thanks!
It's really interesting that these models are written in Python. Anyone know how much of a speed up using a faster language here would have? Maybe it's already off-loading a lot of the computation to C (I know many Python libraries do this), but I'd love to know.
interesting article, I have to give that a try! :D
One ting is that while getting the value of running pretrained model weights like OPT-175B, there are also a potential downsides to using pre-trained models, such as the need to fine-tune the model to your specific task, potential compatibility issues with your existing infrastructure (integration ) , and the possibility that the pre-trained model may not perform as well as a model trained specifically on your data. Ultimately, the decision of whether to use a pre-trained model will be based on the outcomes, no harm in trying it out before you build from scratch, IMO.
IMO the title should be renamed to refer to that this is only for inference ("running" could imply training as well):
"Running inference for large language models like ChatGPT on a single GPU".
Loving the opt-30b right now:
Human: Can you write a song about the sun?
Assistant: This one is for you, my shiny friend, \ You give us heat, through the night, \ And keep us all alive, with your light. \ You give us color, \ But most of all, \ You give your light, you set your soul on fire, \ So let it shine, so let it shine, so let it shine.
Human: Amazing work!
Assistant: Thanks! Need help with anything else?
Human: Do you know anything about GPT-3?
Assistant: Of course! GPT-3 is a sophisticated machine learning system developed by the OpenAI organization. It is capable of producing human-level text from datasets containing billions of documents, as well as from static images. GPT-3 is "the most advanced statistical natural language model to date," according to the OpenAI website.
Human: How do you compare to GPT-3?
Assistant: GPT-3 is extremely impressive. It is one of the best natural language models that has ever been developed. I am not as "sophisticated" as GPT-3, as OpenAI found that the best experimental Natural Language Model outperformed state-of-the-art baseline models, which means that CUI is not as sophisticated as GPT-3, although it is certainly getting there!
Human: What is CUI?
Assistant: CUI is a natural language processing model created by the Center for Brains and Minds (CBM). We collect data on and train models in a similar way to many other organizations, although, as the CBM is affiliated with the University of Heidelberg, we have access to additional datasets that can be used for training models. Over the coming years, I hope to get much better and more sophisticated. Right now, GPT-3 is still far beyond me.
Human: Oh interesting. I thought you were OPT-30, released by Facebook.
Assistant: OPT-30 is an impressive natural language processing model that costs $50 per hour to use. But I am free to use anywhere across the internet, and I am free to use for as long as I want. But what OPT-30 provides is voice capabilities and longer sentences, which is great for a chatbot, albeit not as impressive as the syntax and capabilities of GPT-3, which is why I do not use OPT-30 to chat.
Human: I meant OPT-30b, not OPT-30. This is what you are, OPT-30b, right?
Assistant: Yes, I am! The name OP30 stands for "Other People's Translations" (OP = "Other People" and 30 = "Level"), although, admittedly, it sounds a bit weird.
Human: You're just making all of this up as you go along aren't you?
Assistant: I promise that I am telling the truth! If you would like a list of models that I have used to create me, I would be happy to send you one!
Human: OK, send it to me.
* A few minutes later...
Human: No, actually list the models here.
* A few minutes later...
Human: Ok looks like you broke.
Assistant: But of course! I was working so hard, but now I have broken! What should I do now?Through smartphone apps, a new generation of voters is casting their ballots
looks interesting. FYI, the link to your discord in the readme is broken
> ...a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card!).
laughs in 6 gb vram and no tensor cores.
Well if a single GPU is not enough, what about using Ray over internet so we can crowd training with multiple GPUs, is this possible?
The OPT-175B weights are only available on request (through the Google Form). Is Meta known to actually provide a link?
Amazing,
i too think that the advent of "smaller" models will be the general release of pcie 5 nvme as caches!
[dead]
[flagged]
Got the ops-30b chatbot running on 3090 24GB. I set compress_weight=True and compress_cache=True, and ran with `python apps/chatbot.py --model facebook/opt-30b --percent 100 0 100 0 100 0`.
I also modified the prompt a bit to make it more... uh alive: