The Unreliability of LLMs and What Lies Ahead

  • If I take a step back and think back to say a few (or 5) years ago, what LLMs can do is amazing. One has to acknowledge that (or at least, I do). But as a scientist it's been rather interesting to probe the jagged edge and unreliability, including using deep research tools, on any topic I know well.

    If I read through the reports and summaries it generates, it seems at first glance correct - the jargon is used correctly, and physical phenomena referred to mostly accurately. But very quickly I realize that, even with the deep research features and citations, it's making a bunch of incorrect inferences that likely arise from certain concepts (words, really) co-occurring in documents but are actually physically not causally linked or otherwise fundamentally connected. In addition to some strange leading sentences and arguments made, this often ends up creating entirely inappropriate topic headings/ sections connecting things that really shouldn't be together.

    One small example of course, but this type of error (usually multiple errors) shows up in both Gemini and OpenAI models, and even with some very specific prompts and multiple turns. And keeps happening for topics in the fields I work in in the physical sciences and engineering. I'm not sure one could RL hard enough to correct this sort of thing (and it is not likely worth the time and money), but perhaps my imagination is limited.

  • This is a good articulation of what is a real concern around the AI bull thesis.

    If a calculator works great 99% of the time you could not use that calculator to build a bridge.

    Using AI for more than code generation is still very difficult and requires a human in the loop to verify the results. Sometimes using AI ends up being less productive because you're spending all your time debugging it's outputs. It's great but also there are a lot of questions on if this technology will ultimately lead to the productivity gains that many think are guaranteed in next few years. There is a non zero chance it ends up actually hurting productivity because of all the time wasted trying to get it to produce magic results.

  • Good article. Agree that general unreliability will continue to be an issue since it's fundamental to how LLMs work. However, it would surprise me if there was still a significant gap between single-turn and multi-turn performance in 18 months. Judging by improvements in the last few frontier model releases, I think the top AI labs have finally figured out how to train for multi-turn and agentic capabilities (likely RL) and just need to scale this up.

  • MongoDB was basically "vibe coding" for RBDMs. After the hype cycle, there will be a wasteland of unmaintainable vibe-coded products that companies will have to pump unlimited amounts of money into to maintain.

  • A few months ago I asked CGPT to create a max operating depth table for scuba diving based on various PPO2 limits and EAN gas profiles, just to test it on something I know (its a trivially easy calculation; and the formula is readily available online). It got it wrong…multiple times…even after correction and supplying the correct formula, the table was still repeatedly wrong (it did finally output a correct table). I just tried it again, with the same result. Obviously not something I would stake my life on anyway, but if it’s getting something so trivial wrong, I’m not inclined to trust it on more complex topics.

  • There are jobs out there that have always been unreliable.

    A classic example is the Travel Agent. This was already a job driven to near-extinction just by Google, but LLMs are a nail in the travel agent coffin.

    The job was always fuzzy. It was always unreliable. A travel agent recommendation was never a stamp of quality or guarentee of satisfaction.

    But now, I can ask an LLM to compare and contrast two weeks in the Seychelles with two weeks in the Caribbean, have it then come up with sample itineraries and sample budgets.

    Is it going to be accurate? No, it'll be messy and inaccurate, but sometimes a vibe check is all you ever wanted to confirm that yeah, you should blow your money on the Seychelles, or to confirm that actually, you were right to pick the Caribbean.

    Or that actually, both are twice the amount you'd prefer to spend, where dear ChatGPT would be more suitable?

    etc.

    When it comes down to the nitty-gritty, does it start hallucinating hotels and prices? Sure, at that point you break out trip-advisor, etc.

    But as a basic "I don't even know where I want to go on holiday ( vacation ), please help?" it's fantastic.

  • LLMs can't evaluate their own output. LLMs suggest possibilities, but can't evaluate them. Imagine an insane man who is rumbling something smart, but doesn't self-reflect. The evaluation is done against some framework of values that are considered true: the rules of a board game, the language syntax or something else. LLMs also can't fabricate evaluation because the latter is a rather rigid and precise model, a unlike natural language. Otherwise you could set up two LLMs questioning each other.

  • It's hard to say "never" in technology. History isn't really on your side. However, LLMs have largely proven to be good at things computers were are already good at: repetitive tasks, parallel processing, and data analysis. There's nothing magical about an LLM that seems to be defeating the traditional paradigm. Increasingly I lean toward an implosion of the hype cycle for AI.

  • Unreliability doesn't matter for some people because their bar was already that low. Unfortunately this is the way of the world and quality has and will continue to suffer. LLMs mostly accelerate this problem... hopefully they get good enough to help solve it.

  • Has anyone experimented with an ensemble + synthesizer approach for reliability? I'm thinking: make n identical requests to get diverse outputs, then use a separate LLM call to synthesize/reconcile the distinct results into a final answer. Seems like it could help with the consistency issues discussed here by leveraging the natural variance in LLM outputs rather than fighting it. Any experience with this pattern?

  • LLMs are a tool to extend human capabilities. They are not intelligent agents that can replace humans

    Not very hard to understand, except it seems to be

  • I'm no AI fan, but articles talking about the shortcomings of LLM's seem to have to be complaining that forks aren't good for drinking soup.

    Don't use LLM's to do 2 + 2. Don't use LLM's to ask how many r's are in strawberry.

    For the love of God. It's not actual intelligence. This isn't hard. It just randomly spits out text. Use it for what it's good at instead. Text.

    Instead of hunting for how to do things in programming using an increasingly terrible search engine, I just ask ChatGPT. For example, this is something I've asked ChatGPT in the past:

        in typescript, I have a type called IProperty<T>, how do I create a function argument that receives a tuple of IProperty<T> of various T types and returns a tuple of the T types of the IProperty in order received?
    
    This question that's such an edge case that I wasn't even sure how to word properly actually yielded the answer I was looking for.

        function extractValues<T extends readonly IProperty<any>[]>(
          props: [...T]
        ): { [K in keyof T]: T[K] extends IProperty<infer U> ? U : never } {
          return props.map(p => p.get()) as any;
        }
    
    This doesn't look unrealiable to me. It actually feels pretty useful. I just need [...T] there and infer there.

  • I think I'm settling on a "Gell-mann Amnesia" explanation of why people are so rabidly committed to the "acceptable veracity" of LLM output. When you don't know the facts, you're easily mislead by plausible-sounding analysis, and having been mislead -- a certain default prejudice to existing beliefs takes over. There's a significant asymmetry of effort in belief change vs. acquisition. I think there's also an ego-protection effect here too: if I have to change my belief then I was wrong.

    There a socratically-minded people who are more addicted to that moment of belief change, and hence overall vastly more sceptical -- but I think this attitude is extremely marginal. And probably requires a lot of self-training to be properly inculcated into it.

    In any case, with LLMs, people really seem to hate the idea that their beliefs about AI and their reliance of LLM output could be systematically mistaken. All the while, when shown output in an area of their expertise, realising immediately that its full of mistakes.

    This, of course, makes LLMs a uniquely dangerous force in the health of our social knowledge-conductive processes.

  • AI does not know what is fake or real any more than we do. It uses our shaky data to make predictions.

  • > Internally, it uses a sophisticated, multi-path strategy, approximating the sum with one heuristic while precisely determining the final digit with another. Yet, if asked to explain its calculation, the LLM describes the standard 'carry the one' algorithm taught to humans.

    So, the LLM isn't just wrong, it also lies...

  • I have been using LLM coding tools to make stuff which I had no chance of making otherwise. They are MVPs, and if anything ever got traction I am very aware that I would need to hire a real dev. For now, I am basically a PM and QA person.

    What really concerns me is that the big companies on whose tools we all rely are starting to push a lot of LLM generated code without having increased their QA.

    I mean, everybody cut QA teams in recent years. Are they about to make a comeback once big orgs realize that they are pushing out way more bugs?

    Am I way off base here?

  • Can’t we make this deterministic with techniques like Jax’s RNG seed?

  • hallucinations are essentially the only thing keeping all knowledge workers from being made permanently redundant. if that doesnt make you a little concerned then you are a fool. and the predictions of all the experts in 2010 is that what is currently happening right in front of us could never happen within a hundred years. why are the predictions of experts more reliable now? anyone who dismisses the risks is just a sorry fool

  • Large language models reliably produce misinformation that appears plausible only because it mimics human language. They are dangerous toys that cannot be made into tools that are safe to use.

  • I think this misses some of the core problems and it suggests there are some more straight forward solutions. We have no solutions to this and the way we're treating this means we aren't going to come up with solutions.

    Problem 1: Training

    Using any method like RLHF, DPO, or such guarantees that we train our models to be deceptive.

    This is because our metric is the Justice Potter metric: I know it when I see it. Well, you're assuming that this accurate. The original case was about defining porn and well... I don't think it is hard to see how people even disagree on this. Go on Reddit and ask if girls in bikinis are safe for work or not. But it gets worse. At times you'll be presented with the choice between two lies. One lie you know is a lie and the other lie you don't know it is. So which do you choose? Obviously the latter! This means we optimize our models to deceive us. This is true too when we come to the choice between truth and a lie we do not know is a lie. They both look like truths.

    This will be true even in completely verifiable domains. The problem comes down to truth not having infinite precision. A lot of truth is contextually dependent. Things often have incredible depth, which is why we have experts. As you get more advanced those nuances matter more and more.

    Problem 2: Metrics and Alignment

    All metrics are proxies. No ifs, ands, or buts. Every single one. You cannot obtain direct measurements which are perfectly aligned with what you intend to measure.

    This can be easily observed with even simple forms of measurements like measuring distance. I studied physics and worked as an (aerospace) engineer prior to coming to computing. I did experimental physics, and boy, is there a fuck ton more complexity to measuring things than you'd guess. I have a lot of rules, calipers, micrometers and other stuff at my house. Guess what, none of them actually agree on measurements. They all are pretty close, but they do differ within their marked precision levels. I'm not talking about my ruler with mm hatch marks being off by <1mm, but rather >1mm. RobertElderSoftware illustrates some of this in this fun video[0]. In engineering, if you send a drawing to a machinist and it doesn't have tolerances, you have actually not provided them measurements.

    In physics, you often need to get a hell of a lot more nuanced. If you want to get into that, go find someone that works in an optics lab. Boy does a lot of stuff come up that throws off your measurements. It seems straight forward, you're measuring distances.

    This gets less straightforward once we talk about measuring things that aren't concrete. What's a high fidelity image? What is a well written sentence? What is artistic? What is a good science theory? None of these even have answers and are highly subjective. The result of that is your precision is incredibly low. In other words, you have no idea how you align things. It is fucking hard in well defined practical areas, but the stuff we're talking about isn't even close to well defined. I'm sorry, we need more theory. And we need it fast. Ad hoc methods will get you pretty far, but you'll quickly hit a wall if you aren't pushing the theory alongside it. The theory sits invisible in the background, but it is critical to advancements.

    We're not even close to figuring this shit out... We don't even know if it is possible! But we should figure out how to put bounds, because even bounding the measurements to certain levels of error provides huge value. These are certainly possible things to accomplish, but we aren't devoting enough time to them. Frankly, it seems many are dismissive. But you can't discuss alignment without understanding these basic things. It only gets more complicated, and very fast.

    [0] https://www.youtube.com/watch?v=EstiCb1gA3U

  • [dead]

  • My experience with LLm-based chat is so different from what the article (and some friends) describe.

    I use LLM chat for a wide range of tasks including coding, writing, brainstorming, learning, etc.

    It’s mostly right enough. And so my usage of it has only increased and expanded. I don’t know how less right it needs to be or how often to reduce my usage.

    Honestly, I think it’s hard to change habits and LLM chat, at its most useful, is attempting to replace decades long habits.

    Doesn’t mean quality evaluation is bad. It’s what got us where we are today and what will help us get further.

    My experience is anecdotal. But I see this divide in nearly all discussions about LLM usage and adoption.