Getting 50% (SoTA) on Arc-AGI with GPT-4o

  • (ARC Prize co-founder here).

    Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:

    > get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)

    Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.

    A couple important notes:

    1. this result is on the public eval set vs private set (ARC Prize $).

    2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

    All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard

    EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this

  • The article jumps to the conclusion that "Given that current LLMs can perform decently well on ARC-AGI" after having used multiple hand-crafted tricks to get to these results, including "I also did a small amount of iteration on a 100 problem subset of the public test set" which is hidden in the middle of the article and not mentioned in the bullet list at the top.

    Adding the close-to ad-hominem attack on Francois Chollet with the comics at the beginning (Francois never claimed to be a neuro-symbolic believer), this work does a significant disservice to the community.

  • Very cool. When GPT-4 first came out I tried some very naive approaches using JSON representations on the puzzles [0], [1]. GPT-4 did "okay", but in some cases it felt like it was falling for the classic LLM issue of saying all the right things but then then failing to grasp some critical bit of logic and missing the solution entirely.

    At the time I noticed that many of the ARC problems rely on visual-spatial priors that are "obvious" when viewing the grids, but become less so when transmuted to some other representation. Many of them rely on some kind of symmetry, counting, or the very human bias to assume a velocity or continued movement when seeing particular patterns.

    I had always thought maybe multimodality was key: the model needs to have similar priors around grounded physical spaces and movement to be able to do well. I'm not sure the OP really fleshes this line of thinking out, brute forcing python solutions is a very "non human" approach.

    [0] https://x.com/eatpraydiehard/status/1632671307254099968

    [1] https://x.com/eatpraydiehard/status/1632683214329479169

  • I'll say what a lot of people seem to be denying. GPT-4 is an AGI, just a very bad one. Even GPT-1 was an AGI. There isn't a hard boundary between non AGI and AGI. A lot of people wish there was so they imagine absolutes regarding LLM's like "they cannot create anything new" or something like that. Just think: we consider humans a general intelligence, but obviously wouldn't consider an embryo or infant a general intelligence. So at what point does a human go from not generally intelligent to generally intelligent? And I don't mean an age or brain size, I mean suite of testable abilities.

    Intelligence is an ability that is naturally gradual and emerges over many domains. It is a collection of tools via which general abstractive principles can be applied, not a singular universally applicable ability to think in abstractions. GPT-4, compared to a human, is a very very small brain trained for the single purpose of textual thinking with some image capabilities. Claiming that ARC is the absolute market of general intelligence fails to account for the big picture of what intelligence is.

  • Having tons of people employ human ingenuity to manipulate existing LLMs into passing this one benchmark kind of defeats the purpose of testing for "AGI". The author points this out as it's more of a pattern matching test.

    Though on the other hand figuring out which manipulations are effective does teach us something. And I think most problems boil down to pattern matching, creating a true, easily testable AGI test may be tough.

  • To me the big take-aways here are:

    1) Most of the heavy lifting is being done by search. We're talking about having the LLM generate thousands of candidate solutions, and they're mostly bad enough that "just pick the ones that get kinda close on the examples" is a meaningful operation.

    2) More samples improves performance despite the fact that GPT-4o's vision is not capable of parsing the inputs. I'm curious how much performance would degrade if you shuffled the images passed to the model (but used the correct images when evaluating which candidates to keep).

    3) It's definitely true that the LLM has to be giving you something more than random programs. At the very least, the LLM knows how to craft parsimonious programs that are more likely to be the solution. It may be that it's providing more than that, but it's not clear to me exactly how much information on the correct search space is coming from the hand-crafted examples in the prompt.

    Overall, the work to get this far is very impressive, but it doesn't really move the needle for me on whether GPT-4 can do ARC puzzles. It does, however, show me that search is surprisingly powerful on this task.

  • Seems that Arc-AGI is more flawed rather than GPT-4o is more AGI.

    Maybe a AI version of Hanlons Razor. Never attribute to AGI what could be easily explained by being in the training set.

  • When we talk about system 2; is it possible that [generating large number of programs; evaluating them of the task; choosing top K outcomes; feeding it back to Neural net] can act as system 2 for a AGI? Isn't that how we think intelligently as well- by making lot of hypothesis internally and evaluating them - and updating our model?

  • We don't actually know if it is SOTA, the previous SOTA solution also got around the same on the evaluation set.

  • >> Claim 1 seems likely true to me for a reasonable notion of “learning”. I think François Chollet agrees here. Most of my doubts about this claim are concerns that you can basically brute force ARC-AGI without interestingly doing learning (e.g. brute-force search over some sort of DSL or training on a huge array of very similar problems). These concerns apply much less to the kind of approach I used

    The approach described in the article is exactly "brute-force search over some sort of DSL". The "DSL" is a model of Python syntax that GPT-4o has learned after training on the entire internet. This "DSL" is locked up in the black box of GPT-4o's weights, but just because no-one can see it, it doesn't mean it's not there; and we can see GPT-4o generating Python programs, so we know it is there, even if we don't know what it looks like.

    That DSL may not be "domain specific" in the sense of being specifically tailored to solve ARC-AGI tasks, or any other particular task, but it is "domain specific" in the sense of generating Python programs for some subset of all possible Python programs that includes programs that can solve some ARC-AGI tasks. That's a very broad category, but that's why it over-generates so much: it needs to draw 8k samples total until one works for just 50% of the public eval set.

  • > 50% accuracy on the public test set for ARC-AGI by having GPT-4o

    Isn't the public test set public on github and therefore GPT-4o trained on it?

  • The Arc stuff just felt intuitively wrong as soon as I heard it. I don't find any of Chollet's critiques of LLMs to be convincing. It's almost as if he's being overly negative about them to make a point or something to push back against all the unbridled optimism. The problem is, the optimism really seems to be justified, and the rate of improvement of LLMs in the past 12 months has been nothing short of astonishing.

    So it's not at all surprising to me to see Arc already being mostly solved using existing models, just with different prompting techniques and some tool usage. At some point, the naysayers about LLMs are going to have to confront the problem that, if they are right about LLMs not really thinking/understanding/being sentient, then a very large percentage of people living today are also not thinking/understanding/sentient!

  • You can have a go at the problems here: https://arcprize.org/play?task=00576224

    None of them are terribly hard but some aren't trivial either, a couple took me a bit of thinking to work out. By far the most tedious part is inputting the result (I didn't bother after the first) which is definitely something AI is better at!

  • This challenge looks quite solvable but it's relies on physics understanding and it's has a lot of human/world priors in sense of space understanding and object boundaries.

    Seems like it relies on identification of objects and then mapping them somehow. Most of the cases so far that I've seen are based on some transformation or relation between the objects.

    So far it seems like some search among common transformatiosn and relations could solve it. Plus some heuristics/computation for counting order, wholeness(boundary) or pattern.

    IMO it can be solved by search of programs that combine these + some LLM to guide heuristics most likely.

    The only hard one was applied noise or one testing understanding of "gravity".

    Did anyone test human baseline for this?

  • François Chollet says LLMs do not learn in-context. But Geoff Hinton says LLMs' few-shot learning compares quite favorably with people!

    https://www.youtube.com/watch?v=QWWgr2rN45o&t=46m20s

    The truth is in the middle, I think. They learn in-context, but not as well as humans.

    The approach in the article hides the unreliability of current LLMs by generating thousands of programs, and still the results aren't human-level. (This is impressive work though -- I'm not criticizing it.)

  • FWIW GPT-4 is able to generate a plan very similar to one in the article: also involves feature extraction, program synthesis, iterative refinement.

    https://chatgpt.com/share/2fde1db5-00cf-404d-9ae5-192aa5ac90...

    So it's pretty close to being able to plan solution completely on its own. It's just rather bad at coding and visual inputs, so it doesn't know what it doesn't know.

  • "Vision is an especially large weakness."

    But you can have GPT write code to reliably convert the image grid into a textual representation, right? And code to convert back to image and auto-verify.

  • Amazing work, prompt engineering at its finest. One future direction for Arc AGI could be to use not Python, but a much more concise programming language that is more suited for brute-force methods like genetic mutations. The problem would be of course to train an LLM that is proficient enough in such a language. I am thinking about stack based languages. For this competition I would develop a careful bit-level encoding of a variant of the 'Joy' programming language. (https://en.wikipedia.org/wiki/Joy_(programming_language)) It would be a considerable effort though which I don't have time for, hence I post this idea publicly. A promising direction is a mix of things in my opinion: Special stack-based concise language, consulting LLMs like the OP did, and genetic algorithms combined.

  • The expectation is that you'll have to have dynamically generated benchmarks with better eval at some point given the potential for brute forcing the private validation set.

  • You know you’re approaching AGI when creating benchmarks gets difficult. This is only just beginning

  • I looked at the website and have no idea how Arc is supposed to be AGI.

    Can someone explain?

  • Arc agi is a small stepping stone to agi but is not agi.

    Program search mimics what humans do to a certain extent but not in entirety.

    A more general world model and reference will be required for agi.

  • Can we be sure GPT-4o hasn’t been trained on the public test set?

  • Isn't 50% kind of a failing grade?

  • LOL i looked at that first complex test sample and closed the page, it made my brain hurt.

  • I'm glad someone else finally said it, those born blind cannot possibly have AGI!

    /sarcasm :D

  • [dead]