I have seen people raise some skeptical points about LLMs. I thought I would write some of my thoughts on this:
First, Transformers (the architecture that many LLMs use) are capable of much more than people think. They can execute algorithms where the memory is bounded by the context window length. This requires encoding the algorithms as a chain-of-thought (because the circuit depth for the computations is limited by the number of layers); and it requires using the context window to record an “execution trace” of the algorithm, along with any memory used.
People have furthermore proved that even just a linear autoregressive neural net with one non-linearly applied to choose the next token to output (a final softmax with temperature 0; i.e. argmax), can be trained to execute arbitrary algorithms. (To be a little more precise: given an algorithm to run, and given a sampling distribution of inputs, there exists an encoding of the algorithm by chains-of-thought, such that for every epsilon > 0 we can train that quasi-linear model for polynomial(1/epsilon, n) steps to where it correctly maps inputs to outputs running the algorithm for n steps, with probability > 1 - epsilon according to the sampling distribution). There are further theorems that show that you can specify the weights (up to some isomorphism class, and to within some specified error bound) of the model by choosing the training data distribution appropriately, and then the resulting output model should work for just about any data distribution.
The name of the game, then, is choosing the right training data. It appears we will soon run out of human-created data; however, synthetic data of various kinds will allow things to keep going further.
When synthetic data is mentioned people talk about the dangers of what happens when you feed a model’s output back into itself again and again, with noise amplifying at every stage (like the game of telephone or “Chinese whispers”). There are at least two objections to that. The first objection is that there are results showing that things don’t get so bad as long as you mix-in some fresh data at every step.
The second objection is that it depends on what kind of synthetic data we are talking about. e.g. systems that learn to get better and better at chess by playing against themselves don’t seem to suffer this problem. People like to then claim that it must be because you’re adding some “external grounding” in the form of a verifier that checks that the rules of chess are being adhered-to. However, there isn’t much information inherent in those rules. You can write them down on a napkin; and an LLM can even be trained to act as the verifier, and verify with nearly 100% accuracy.
This is where I make the distinction between two kinds of synthetic data: there is data that is predominantly declarative knowledge (like, what is the tallest mountain in the world?) that can’t just be deducted from other knowledge, and then there is procedural knowledge, or knowledge about how to do things (more efficiently).
If you keep feeding a model’s output back into itself as input its declarative knowledge might degrade; but you can produce an infinite supply of certain kinds of synthetic training data that should amplify its procedural knowledge.
How does this work? In a lot of cases it comes down to the fact that verifying a solution is easier than coming up with one to begin with. e.g. if you ask someone to sort a list of numbers, performing that sort is generally harder than just checking that the list is sorted. A model may have the skill of verifying that a list is sorted, but it may lack the skill of efficiently sorting a list. (It may be able to logically deduce what the sorted list has to be given enough time, and given its ability to verify that a list is sorted; but it may not be able to do this efficiently. Similarly, a chess program may be able to play a perfect game, given its ability to verify that the rules are being adhered to, but it may not be able to do this efficiently.)
Another set of methods that goes by the term “backtranslation” relies on the fact that in some cases it’s much easier to generate (input, output) training examples than to efficiently map an arbitrary input to an output. An example of this is generating anagram puzzle training examples: just pick a random word, like “tomato”. Then scramble the letters to maybe “atomot”. Now you have a training example (atomot, tomato). Here, you can be a good teacher, producing lots and lots of training examples, without being a very good student - just knowing how to generate examples doesn’t imply you can efficiently solve someone else’s example problems.
Generating example problems in a field like medicine or even creative writing is maybe harder, since there aren’t always easily-to-verify right answers. However, perhaps lots and lots of “unit tests” can serve as a “verifier” in this case. (Is this how that secret model that OpenAI built that is supposedly good at writing short stories works?)
Methods like the above, and then also the recent work on reasoning models, make me see there is still a lot of juice left in existing approaches – no breakthroughs required. When I reflect on all this, I come away thinking things will continue to improve at a breathtaking pace over the next several years.
“Onboarding” will be an important theme going forward. I think perhaps training models to do much better in-context learning over, say, 20,000 page “textbooks” will help with that. Making this all run efficiently might be the main challenge here. That might be where a breakthrough is required.