> TL:DR: So, how's it doing? Well, pretty badly. Worse than a 6-year-old would, definitely not PhD-level.
This is what happens when you try to apply hype technology (LLMs) on to every problem, especially with a company that has amassed too much hype too quickly.
The limits of said technology tell us that Claude has a very limited memory to plan in the game which is why it is obviously struggling. But expanding those limitations would cost Anthropic an enormous amount of money and compute even if they did that.
So you can clearly see that if LLM are unable to beat this game in an efficient manner to test for planning and reasoning, what hope is there for it with much challenging and complex scenarios which is required for so-called "AGI"?
The most important sentence in this article is this:
>> ...some new paradigm is yet required for them to be right.
A notable error in the article:
> The second attempt got all the way to Vermilion City, finding a way through the infamous Mt. Moon maze and achieving two badges, so pretty close to the benchmark.
It did not make it to Vermillion City: it got Misty's badge (with some fun battle RNG), then got stuck in Cerulean City and could not get out: the next objective was to go north to Bill's House to get the S.S Anne ticket which is required before going to Vermillion City, but it just couldn't do that.
Given the amount of loops in this livestream, I am somewhat skeptical of that benchmark results chart. There's no way it somehow made it to Vermillion, beat the S.S Anne for HM Cut, and also beat Surge with the relative amount of actions implied by the chart.