Our results demonstrate that the power of language models can be attributed, to a great extent, to the auto-regressive next-token training scheme, and not necessarily to a particular choice of architecture.
I think that's obvious isn't it? Neural networks are universal function approximates, the question is how to make the efficient, either in parameters or computation or whatever, as well as all the usual stuff like encouraging convergence, avoiding big gradients, etc. That's why transformers are popular, nobody thinks they especially can compute a function that other models can't.I think the case can be overstated, or at least there are problems.
In the prehistory of transformers I was training character-based LSTMs and GRUs to write fake abstracts for PubMed abstracts of clinical case studies as a proxy for clinical notes.
This was before conversational models and prompts and one big problems was the system started out in the same state. Holistically the system has to decide if the patient has pancreatic cancer or athlete’s foot and a coherent story but really the system had to pick one of 26 letters to start with and then pick another character, assuming it spells good it is starting out in a constrained state space and on a knife edge between writing the same abstract over and over and writing gibberish based on the temperature.
At the time we thought putting in an initial state as an embedding would have helped, although for reading clinical reports we’d rather have a final state as an embedding.
Knowing what I know now we should have made up prompts (“Case study of a 23-year old man with a fractured tibia who presented at the emergency room:”) and stuck them in front of the abstract but of course that would have meant actually reading the 80,000 abstracts (40 person x days of work, our team could have done it in 2 weeks)
The thing is there is a gap between what is possible and what is practical. A good author has an end in mind and writes something and rewrites it and I have been in so many conversations with people speculating about the inference algorithm used by ChatGPT because so often it seems implausible you could really get good results token-at-time.
Humans have two modes of thinking. What is one plus one? Two! No real thinking involved, answering that question got hard wired into your brain. This is what we are currently teaching large language models, a hard wired function from inputs to outputs. What is 51,307 minus 17,469? Now you have to actually start thinking, you have memorized a procedure to do this and you know how to follow this procedure the arrive at the answer.
This is somewhat like chain of thought where intermediate results - which a human would either keep in memory or write down if it gets too much - get dumped out with the output in order to be consumable when later tokens are produced. This could also span several levels, where you first derive the procedure to follow from another procedure you have memorized to than answer the actual question.
And how you solve problems can change over time, when you first learn a thing, then you usually learn a procedure to follow. When you first learn to write letters, you consciously draw a sequence of lines and arcs. After enough practice this becomes hard wired and you just write the letter. When you do not do something for a long time, then it may go the other direction, you might still remember the procedure and be able to do it step by step, but you can no longer just do it.
So what is the point of this comment? While you might in principle be able to learn any function from inputs to outputs, that is create a large language model that can produce the correct answer for every question without really thinking about it, I do not think that this is practicable. Every time a human follows a step by step procedure you will essentially have to learn the completely unrolled procedure in order to produce the answer in one pass. Feed forward neural networks have no loops unless you externally feed the output back into the input.
That also means you have to necessarily mix the intermediate internal states with the output of an autoregressiv feed forward system, i.e. it will never be able to behave like a human as it will constantly have to output its internal thought process. If you want to mimic a human response, you will have to hide part of the autoregression internally.
Not sure about all of the details, but this is an interesting idea focusing on how auto-regressive models can be thought of as learning how to split a difficult task into a series of simpler tasks.
Makes me wonder if that's the magic in denoising autoencoders, too, since they are trained basically to learn how to build an image auto-regressively from more to less noise.
Yet https://news.ycombinator.com/item?id=37621999 (posted 2 hours ago) :)
> Our results demonstrate that the power of language models can be attributed, to a great extent, to the auto regressive next token training scheme, and not necessarily to a particular choice of architecture.
I would have hoped they would attribute LLM success to the structure of language itself. As the authors say, even small linear models can approximate CoT and solve complex tasks. So it's not the model. It's the data.
Analogously, humans have very different brains when looking at low level, but brains still learn the same language and skills about as good as any other brain. It's not the brain or neural net (the models) but the data that shapes them to become smart.
This insight has consequences on how we view training data and where to focus our work to improve AI and human brains - improve language, ideas & chains of thought. This resonates with recent discoveries in fine-tuning and training small models like phi-1 and phi-1.5 who were trained on "textbook quality" data of high diversity.