Why don't LLMs ask for calculators?

  • They can do exactly that, it's called Tool Use and nearly all modern models can handle it. For example, I have a consumer GPU that can run a R1 Qwen distill, which, when prompted for a large multiplication, will elect to write a python script to find the answer.

    This is a table stakes feature for even the open/free models today.

  • Claude Sonnet 3.5 will often use JavaScript as calculator. It's not perfect when it comes to deciding whether it should write code, but that's easy to fix by prompting it with "Write some code to help you answer the question".

    The post is honestly quite strange. "When LLMs try and do math themselves they often get it wrong" and "LLMs don't use tools" are two entirely different claims! The first claim is true, the second claim is false, and yet the article uses the truth of the first claim as evidence for the second! This does not hold up at all.

  • Many LLMs, particularly, coding assistants, use "tools". Here is one with a calculator

    https://githubnext.com/projects/gpt4-with-calc/

    and another example

    https://www.pinecone.io/learn/series/langchain/langchain-too...

    LLMs often do a good job at mathy coding, for instance I told Copilot that "i want a python function that computes the collatz sequence for a given starting n and returns it as a list"

      def collatz_sequence(n):
        sequence = [n]
        while n != 1:
            if n % 2 == 0:
                n = n // 2
            else:
                n = 3 * n + 1
            sequence.append(n)
        return sequence
    
    which gives right answers, which I wouldn't count on copilot being able to do on its own.

  • > Now, some might interject here and say we could, of course, train the LLM to ask for a calculator. However, that would not make them intelligent. Humans require no training at all for calculators, as they are such intuitive instruments.

    Does the author really believe humans are born with an innate knowledge of calculators and their use?

  • A lot of people are talking about tool use and writing internal scripts, and yeah, that’s kind of an answer. Really though I think the author is highlighting that LLMs are not being used efficiently at the present moment.

    LLMs are great at certain tasks. Databases are better at certain tasks. Calculators too. While we could continually throw more and more compute at the problem, growing layers and injecting more data, wouldn’t it make more sense to just have an LLM call its own back-end calculator agent? When I ask it for obscure information maybe it should just pull from it’s own internal encyclopedia database.

    Let LLMs do what they do well, but let’s not forget the decades that brought us here. Even the smartest human still uses a calculator, so why doesn’t an AI? The fact that it writes its own JavaScript is flashy as hell but also completely unnecessary and error prone.

  • I don't know what happened but there was that time when GPT-4 could access wolfram alpha, and anytime you asked it something that was beyond the most basic math, it would automatically prompt wolfram for the answer.

  • > The LLM has no self-reflection for the knowledge it knows and has no understanding of concepts beyond what can be assembled by patterns in language.

    My favorite framing: The LLM is just an ego-less extender of text documents. It is being iteratively run against movie script, which is usually incomplete and ending in: "User Says X, and Bot responds with..."

    Designers of these systems have--deliberately--tricked consumers into thinking they are talking to the LLM author, rather than supplying mad-libs dialogue for a User character that is the same fictional room as a Bot character.

    The Bot can only speak limitations which are story-appropriate for the character. It only says it's bad at math because lots of people have written lots of words saying the same thing. If you changed its name and description to Mathematician Dracula, it would have dialogue about how its awesome at math but can't handle sunlight, crucifixes, and garlic.

    This framing also explains how "prompt injection" and "hallucinations" 3 are not exceptional, but standard core behavior.

  • The paid version of ChatGPT has had a built in Python runtime for well over a year.

    The [>_] links to the Python code that was run.

    https://chatgpt.com/share/67b79516-9918-8010-897c-ba061a2984...

  • I’m surprised why LLMs don’t have in their system prompt a hard rule instructing that any numeric computations in particular, and any other computations in general must only be performed by tool use / running Python.

  • I am puzzled by the fact that the modern LLMs don't do multiplication in the same way humans do it, i.e. digit by digit. Surely they can write an algorithm for that, but why can't they perform it ?

  • And as you see in the responses here, most people miss the point, elect to patch over the aspects in which the lack of intelligence is glaring, and eventually the end product will be so hard to distinguish from actual intelligence that it's deemed "good enough".

    Is that bad? Idk. If you hoped that real AGI would eventually solve humanities biggest problems and questions, perhaps so. But if you want something that really really looks like AGI except to some nerds who still say "well actually", then it's gonna be good enough for most. And certainly sufficient for ending up the dystopia from that movie clip in the end.

  • I just ask chatgpt to use a script to calculate an answer

  • Umm, they do though? When I use ChatGPT it will phone out to Wolfram Alpha to compute numbers and the like.

  • this is copium, the author doesn't have a good grasp on LLMs. you can't simply "ask" a language model to see if they know they're bad at math and then conclude that the response actually reflects the knowledge encapsulated in the model... sigh...