This was always my mental model. If you have a process with N steps where your probability of getting a step right is p, your chance of success is pį¶°, or 0 as N ā ā.
It affects people too. Something I learned halfway through a theoretical physics PhD in the 1990s was that a 50-page paper with a complex calculation almost certainly had a serious mistake in it that you'd find if you went over it line-by-line.
I thought I could counter that by building a set of unit tests and integration tests around the calculation and on one level that worked, but in the end my calculation never got published outside my thesis because our formulation of the problem turned a topological circle into a helix and we had no idea how to compute the associated topological factor.
As long as LLMs have no true memory, this is expected. Think about the movie Memento. That is the experience for an LLM.
What could any human do with a context window of 10 minutes and no other memory? You could write yourself notes⦠but you might not see them because soon you wonāt know they are there. So maybe tattoo them on your bodyā¦
You could likely do a lot of things. Just follow a recipe and cook. Drive to work. But could you drive to the hardware store and get some stuff you need to build that ikea furniture? Might be too much context.
I think solving memory is solving agi.
This is another reason why thereās no point in carefully constructing prompts and contexts trying to coax the right solution out of an LLM. The end result becomes more brittle with time.
If you canāt zero shot your way to success the LLM simply doesnāt have enough training for your problem and you need a human touch or slightly different trigger words. There have been times where Iāve gotten a solution with such a minimal prompt it practically feels like the LLM read my mind, thatās the vibe.
My prediction is that the general limitation of multi-step agents is the quality of the reward function. If you think of LLMs as throwing shit at the wall and see if it sticks, but unlike traditional bruteforce (which is ārandomā in the output space) we have a heuristic guided search with much higher expected results. But, even a well guided heuristic tapers off to noise after many steps, without checkpoints or corrections in global space. This is why AI crush most games, but gets a digital stroke and goes in circles during complicated problems.
Anyway, what this means I think is you will find AI agents continuing to colonize spaces with meaningful local and global reward functions. But most importantly, it likely means that complex problem spaces will see marginal improvements (where are all these new math theorems we were promised many months ago?).
Itās also very tempting to say āah but we can just make or even generate reward functions for those problems and train the AIā. I suspect this wonāt happen, because if there was simple functions, weād have discovered them already. Software engineering is one such mystery, and the reason I love it. Every year, we come up with new ideas and patterns. Many think they will solve all our problems, or at least consistently guide as in the right direction. But yet, here we are, debating language features, design patterns, tooling, UX etc etc. The vast majority of easy truths are already found. The rest are either complex or hard to find. Even when we think we found one, it often takes man-decades to conclude that it wasnāt even a good idea. And theyāre certainly not inferrable from existing training data.
The amusing things LLMs do when they have been at a problem for some time and cannot fix it:
- Removing problematic tests altogether
- Making up libs
- Providing a stub and asking you to fill in the code
Iāve noticed a lot of AI agents start off doing pretty well, but the longer they run, the more they seem to drift. It's like they forget what they were supposed to do in the first place.
Is this just a context limitation, or are they missing some kind of self-correction loop? Curious if anyone has seen agents that can catch their own mistakes and adjust during a task. Would love to hear how far that has come.
So as the space for possible decisions increases, it increases the likelihood of models to end up with bad "decisions". And what is the correlation between the increase in "survival rate" and the increase in model parameters, compute power and memory (context)?
I suspect it's because the average code online is flawed. The flaws are trained into LLMs and this manifests as an error rate which compounds.
I saw some results showing that LLMs struggle to complete tasks which would take longer than a day. I wonder if the average developer, individually, would be much better if they had to write the software on their own.
The average dev today is very specialized and their code is optimized for job security, not for correctness and not for producing succinct code which maps directly to functionality.
Interesting.
So if you project outwards a while, you hit around 10000 hours about 6 years from now.
Is that a reasonable timeline for ASI?
It's got more of a rationale behind it than other methods perhaps?
Iāve seen the same thing using AI for coding. It helps a lot at first, but after a while it starts doing weird stuff or undoing things that were already fixed. Now I treat it like a junior dev. I try to keep tasks small, reset often, and check everything. Still useful, just needs babysitting.
Speaking of the Kwa et al. paper, is there a site that updates the results as new LLMs come out?
I don't think this has anything to do with AI. There's a half life for success rates.
[dead]
another article on xyz problem with LLMs, which will probably be solved by model advancements in 6/12 months.
This very much aligns with my experience ā I had a case yesterday where opus was trying to do something with a library, and it encountered a build error. Rather than fix the error, it decided to switch to another library. It then encountered another error and decided to switch back to the first library.
I donāt think Iāve encountered a case where Iāve just let the LLM churn for more than a few minutes and gotten a good result. If it doesnāt solve an issue on the first or second pass, it seems to rapidly start making things up, make totally unrelated changes claiming theyāll fix the issue, or trying the same thing over and over.