Here are the zero-shot accuracy numbers posted in the Huggingface evaluations for Cerebras-GPT 13B vs. the results of LLaMa 13B in their paper:
Model BoolQ PIQA SIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
LLaMa 13B 78.1 80.1 50.4 79.2 73 74.8 52.7 56.4
Cerebras-GPT 13B - 76.6 - 51.3 64.6 71.4 36.7 28.6
FYI: Cerebras's nodes are very different than your typical Nvidia training nodes:
https://www.anandtech.com/show/16626/cerebras-unveils-wafer-...
Each individual "chip" has 40GB of SRAM vs ~76MB for the Nvidia H100, and networked pools of external RAM, SSDs and such. Thats why the training architecture is so different.
OT: I don't know about their scaling strategy for LLM but their scaling strategy for displaying pictures is disappointing.
(it's all blurry)
Summary: This is a company that makes AI accelerator ICs. They reimplemented Chinchilla and released the model weights under a permissive license.
> Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget.
I'm confused as to why 111 million parameter models are trained with the Chinchilla formula. Why not scale up the training data? If you're training smaller models, surely optimizing performance is better than optimizing total compute.
Seems like a silly misunderstanding of the Chinchilla paper, but I'm sure I'm missing something
I might be missing something but it looks to me that actually running this "open" model requires special hardware only accessible with a cloud subscription with 60 000 USD / week minimum spend[1]. Can anyone confirm if you can run it on your own hardware? If software is open but hardware is locked I don't see the point.
[1] https://www.hpcwire.com/2021/09/16/cerebras-wafer-scale-engi....
EDIT: Ok, looks like I've missed the hugging face repo. The language they use is a bit confusing.
I've been following open source LLMs for a while and at first glance this doesn't seem too powerful compared to other open models, Flan-Alpaca[0] is licensed under Apache 2.0, and it seems to perform much better. Although I'm not sure about the legalities about that licensing, since it's basically Flan-T5 fine-tuned using the Alpaca dataset (which is under a Non-Commercial license).
Nonetheless, it's exciting to see all these open models popping up, and I hope that a LLM equivalent to Stable Diffusion comes sooner than later.
Does the chinchilla recipe still hold today? I got the impression that the LLaMA paper proposed a different result where throwing far more tokens at the problem had a very meaningful impact, or did I misunderstand that?
Of course this is great news, I hope these models can be fine-tuned to be like lighter versions of chatGPT. But I remember reading in the LLaMA paper that a small model can still improve when trained more than the Chinchilla budget.
> For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.
Cerebras says:
> For instance, training a small model with too much data results in diminishing returns and less accuracy gains per FLOP
But this is only of concern when you care about the training cost, such as when you are budget limited researcher or a company who doesn't deploy models at scale. But when you care about the total cost of deployment, then making a small model even better with lots of data is a smart move. In the end it matters more to have the most efficient model in prediction, not the most efficient model in training.
Looking at their charts it seems like their 6.7B model is considerably worse than GPT-J which is an existing open 6B model from several years ago.
I wish rather than stopping training early they would have run more data through a small model so we could have something more competitive with LLaMA 7B.
I wonder what led to such a gap between llama 7b and Cerebras 13b. I hope they discuss it in the paper.
Comparing the 13B model here https://huggingface.co/cerebras/Cerebras-GPT-13B to LLaMA-13B https://github.com/facebookresearch/llama/blob/main/MODEL_CA... you can see that in all of the reasoning tasks Cerebras-GPT lags behind. Any reason to use Cerebras instead of LLaMA? Doesn't seem like it.
You can try out some of these models on Hugging face here: https://huggingface.co/cerebras/Cerebras-GPT-1.3B
That was the largest that had inference enabled - I'd really like to try this one: https://huggingface.co/cerebras/Cerebras-GPT-13B
> It takes substantial technical expertise to train very large models on GPUs. In the recently released GPT-4 Technical Report, OpenAI credits over thirty contributors just for compute infrastructure and scaling.
This is called a silver lining for some (in case you were worried about gpt taking your job). Privacy requirements alone will in the near term force major companies to run their own inference (if not training). The expertise required are nearly identical to that of running large scale distributed computational graphs.
This is an interesting diveragence from what happened with web. The backends started out simple before map-reduce and before deconstructing databases and processing distributed logs. With ML, we'll jump right into the complex backends in tandem with easy-picking early stage edge applications (which we see daily on HN).
What’s in the Pile training data they used? How much source code does it include?
Even though I usually use OpenAI's APIs, just because that is the easiest path, I do also use Hugging Face open models (via their APIs, and running locally) and I will check out Cerebras also.
Alternatives are good!
Slightly off-topic:
I remember seeing news about the enormous chip Cerebras was/is selling (pdf https://f.hubspotusercontent30.net/hubfs/8968533/WSE-2%20Dat...).
Has there been any indication that the LLMs released in the last few months use exotic hardware like this, or is it all "standard" hardware?
I wonder if they've done some Alpaca style training on it... Granted, what made Alpaca useful was that it was finetuned with GPT-3's instruction following completions as examples.
And, at least officially, OpenAI's outputs can't be used to train other AI models.
Otherwise, if GPT-4 outputs were used to finetune these models, they may become much more interesting.
A tangential question: I wonder what, as chiplets become increasingly more common, will Cerebras do to keep their technological advantage of wafer-scale integration. What is the bandwidth and latency of the connections between the tiles? Is there such a thing as bandwidth per frontier length?
"Cerebras open sources seven GPT-3 models from 111 million to 13 billion parameters."
I don't understand why they describe them as GPT-3 models here as opposed to calling them GPT models. Or even LLMs - but I guess that acronym isn't as widely recognized.
Is there a regularly updated repository containing all the releases of LLMs as they happen? TBH I am tired of having to doommark (doom-bookmark) so many repositories and links...Would appreciate some collected database.
Cerebras has an efficiency advantage at generating LLMs (assuming IP is open). This is going to be fun to be a part of.
Noob to ML in practice. These models containing weights, all of them, do they have a standard file/binary format?
Is it currently possible to find-tune any of the foundation modules available on a few Gb of unsupervised text?
This “AI spring” is really snowballing with the crazy nouns and terminology. Alpaca, llama and now chinchilla??
Has anyone tried this? I have 96GB of GPU memory; will that be enough to run the biggest model?
I wonder how decrotive our world will become as a consequence of how cheap it will become to make art using AI.
I kind of want 3d marble statues and baroque art of a future reinasance everywhere. But wonder if we will turn minimalistic as a response.
[dead]
> Our paper, which will be available soon, will detail our training methods and performance results.
Yay there will be a paper let's gooooooo!
This type of article (or press release, or whatever you want to call it) is exactly what makes the future so interesting.
The cat is out of the bag, the genie is out of the bottle, the confetti has left the cannon[0].
It's tempting to see a world dominated by Google Bard, ChatGPT, Bing Search, etc. And no doubt, they will be huge players, with services that are far more powerful than anything that can be run on the edge.
But. BUT. The things that we can do on the edge are incredible now. Just imagine a year from now, or two. These earth-shattering models, which seem to be upending a whole industry, will soon have equivalents that run on the edge. Without services spying on your data. Without censorship on what the model can/cannot say. Because it's all local.
When was the last time this happened? There will be players who publish weights for models that are free to use. The moment that torrent magnet link is published, it's out in the wild. And smart people will package them as "one click installers" for people who aren't tech-savvy. This is already happening.
So every time you're amazed by something chat-gpt4 says, remember that soon this will be in your pocket.
[0] the "confetti" idiom brought to you by chat-gpt4.