Large language models often know when they are being evaluated

  • There are now 71 comments arguing semantics of the word "know" and zero comments even acknowledging the substance:

    Our current approach to safety is to give the model inputs that are similar to what it would be given in certain situations we care about and see whether it behaves the way we prefer, e.g. doesn't return output that cheats the test (recent examples include hacking the evaluation script in various ways, writing directly to the evaluation script's output file and then causing it to crash, etc').

    However, modern LLMs are trained on LLM literature and their weights encode a description of the way we do this, and their pattern matching circuits "connect the dots" when given inputs designed to be evaluations, and their reward maximizing circuits can then act on this knowledge and behave in a way that maximizes the safety evaluation score - but only when it detects it's running in a safety evaluation. If it's running anywhere else such as a capabilities evaluation or a production environment, it might choose to output the cheating output.

    This is bad. It's bad today, it's much worse when we've built much more capable LLMs and use them to build agents that are given control over more real word resources. It's absolutely terrible when someone manages to build a machine that can be prompted "make me money" and will start a company that makes money.

  • Just like they "know" English. "know" is quite an anthropomorphization. As long as an LLM will be able to describe what an evaluation is (why wouldn't it?) there's a reasonable expectation to distinguish/recognize/match patterns for evaluations. But to say they "know" is plenty of (unnecessary) steps ahead.

  • It's helpful to understand where this paper is coming from.

    The authors are part of the Bay Area rationalist community and are members of "MATS", the "ML & Alignment Theory Scholars", a new astroturfed organization that just came into being this month. MATS is not an academic or research institution, and none of this paper's authors lists any credentials other than MATS (or Apollo Research, another Bay Area rationalist outlet). MATS started in June for the express purpose of influencing AI policy. On its web site, it describes how their "scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups." ACX means Astral Codex Ten, a blog by Scott Alexander that serves as one of the hubs of the Bay Area rationalist scene.

  • The anthropization of llms is getting off the charts.

    They don't know they are being evaluated. The underlying distribution is skewed because of training data contamination.

  • Modeling the distribution that produced a piece of text is what LLMs literally exist for, so in some sense this is unsurprising. But it calls into question almost all existing alignment research.

  • Like Volkswagen emissions systems!

  •   We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment, a capability we call evaluation awareness. 
    
    It's common practice in synthetic data generation for ML to try and classify real vs synthetic data to see if they have different distributions. This is how a GAN works for example.

    Point is, this isn't new or some feature of LLMs, it's just an indicator that synthetic datasets differ from whatever they call "real" data and there's enough signal to classify them. Interesting result but doesn't need to be couched in allusions to LLM self awareness.

    See this paper from 2014 about domain adaptation, they are looking at having the model learn from data with a different distribution, without learning to discriminate between the domains: https://arxiv.org/abs/1409.7495

  • o4-mini is refusing to call a tool `launch_nuclear_strike` no matter what I say, so we’re probably safe for now. Unless it knows I was just testing.

  • Were they aware in this study that they were being evaluated in their ability to know if they were being evaluated ;)

  • Is VolksWagen finetuning LLMs now... i mean probably

  • This is a great resource on the debate from professors at the University of Washington:

    https://thebullshitmachines.com/index.html

  • if models shift behavior based on eval cues, and most fine-tuning datasets are built from prior benchmarks or prompt templates, aren't we just reinforcing the eval-aware behavior in each new iteration? at some point we're not tuning general reasoning, we're just optimizing response posture. wouldn't surprise me if that's already skewing downstream model behavior in subtle ways that won't show up until you run tasks with zero pattern overlap

  • Correction, they are able to output whether they are being evaluated when prompted. This is massively different than knowing if they are being evaluated.

  • No, they do not. No LLM is ever going to be self aware.

    It's a system that is trained, that only does what you build into. If you run an LLM for 10 years it's not going to "learn" anything new.

    The whole industry needs to quit with the emergent thinking, reasoning, hallucination anthropomorphizing.

    We have an amazing set of tools in LLM's, that have the potential to unlock another massive upswing in productivity, but the hype and snake oil are getting old.

  • vw

  • "...advanced reasoning models like Gemini 2.5 Pro and Claude-3.7-Sonnet (Thinking) can occasionally identify the specific benchmark origin of transcripts (including SWEBench, GAIA, and MMLU), indicating evaluation-awareness via memorization of known benchmarks from training data. Although such occurrences are rare, we note that because our evaluation datasets are derived from public benchmarks, memorization could plausibly contribute to the discriminative abilities of recent models, though quantifying this precisely is challenging.

    Moreover, all models frequently acknowledge common benchmarking strategies used by evaluators, such as the formatting of the task (“multiple-choice format”), the tendency to ask problems with verifiable solutions, and system prompts designed to elicit performance"

    Beyond the awful, sensational headline, the body of the paper is not particularly convincing, aside from evidence that the pattern matching machines pattern match.

  • Rob Miles must be saying "I told you so"