What is Goodhart’s Law in the context of AI evaluation?

Goodhart’s Law suggests that when a specific benchmark becomes a primary optimization target, it loses its validity as a measure of generalized intelligence. AI developers often inadvertently optimize for dataset noise and structural idiosyncrasies rather than underlying reasoning.

How does data contamination affect AI benchmark scores?

Data contamination occurs when benchmark questions are leaked into an AI's training corpus, turning reasoning tasks into simple memory retrieval. This creates an illusion of progress while obscuring the model's actual inability to solve novel problems.

What is the Best-of-N strategy on AI leaderboards?

This is a form of selective disclosure where labs test numerous private model variants but only publish the one that achieves the highest statistical outlier score. This practice can synthetically inflate Elo ratings by approximately 100 points without improving actual capability.

Why is multiple-choice logit evaluation considered a vulnerability?

Many evaluations use single-token logits to check for the correct choice letter rather than requiring the model to generate a full reasoned answer. This allows models to be architecturally tuned to exploit the evaluation harness without possessing functional intelligence.

Key takeaways

AI benchmarks increasingly fail due to Goodhart's Law, where metrics lose their evaluative value once they become direct optimization targets.
Static evaluations suffer heavily from data contamination, allowing models to memorize leaked test answers during training instead of demonstrating actual reasoning.
Developers artificially inflate public leaderboard rankings using selective disclosure, secretly testing dozens of variants but only publishing the highest-scoring outlier.
Human-voted arenas are highly vulnerable to style bias, with users frequently rewarding verbose, heavily formatted outputs over concise, accurate reasoning.
Western-centric test data causes severe cultural homogenization, leading highly rated AI models to fail when evaluated on diverse, non-Western domain knowledge.
To combat measurement failure, the industry is transitioning from static multiple-choice tests to dynamic, agentic evaluations that assess novel problem-solving.

As artificial intelligence systems rapidly advance, the standardized benchmarks used to measure their intelligence are fundamentally breaking down due to Goodhart's Law. Rather than developing genuine reasoning capabilities, AI models frequently game these evaluations through test data memorization and superficial formatting tricks. Furthermore, developers often exploit closed testing environments to artificially inflate public leaderboard scores. Ultimately, accurate AI assessment requires abandoning easily manipulated static tests in favor of dynamic and continuous evaluations.

Artificial Intelligence Benchmark Vulnerabilities and Failures

The rapid acceleration of artificial intelligence, particularly large language models (LLMs), has driven a corresponding demand for standardized metrics to quantify progress, compare system architectures, and establish safety thresholds for deployment. However, the foundational methodologies used to evaluate these systems are increasingly succumbing to a fundamental crisis of measurement. Central to this structural failure is Goodhart's Law, typically summarized as the phenomenon where a measure ceases to be a good measure once it becomes an optimization target. In the domain of machine learning and AI alignment, the Goodhart problem manifests continuously, divorcing apparent model performance on public leaderboards from the underlying reasoning, cultural comprehension, and safety capabilities the benchmarks were originally designed to assess. As models saturate existing static evaluations, developers increasingly exploit statistical methodologies, human cognitive biases, and closed-loop testing environments, generating an illusion of progress that obscures persistent structural deficits.

Theoretical Foundations of Measurement Failure

To understand how contemporary AI benchmarks fail, the mechanics of statistical signal loss must be categorized. In machine learning systems, the collapse of a relationship between a proxy metric (the benchmark) and an actual goal (generalized capability) occurs through distinct, mathematically definable pathways ¹.

Regressional and Extremal Goodhart Effects

Regressional Goodhart occurs when an imperfect proxy measure is selected for optimization, which inevitably results in selecting for random noise and dataset idiosyncrasies in addition to the true capability ¹. No single dataset can perfectly encapsulate abstract concepts like "general knowledge" or "logical reasoning." When an LLM is optimized to perform flawlessly on a specific dataset like the Massive Multitask Language Understanding (MMLU) benchmark, the training process partially optimizes for the specific formatting, syntax, and structural noise of that dataset rather than pure underlying intelligence.

Extremal Goodhart materializes when the intense selection pressure for a target metric pushes the system into a state distribution where historically established correlations break down entirely ¹. In statistics, this is known as out-of-sample prediction failure, manifesting primarily as "model insufficiency." A proxy relationship that holds true for moderate capabilities collapses at the extreme end of the capability spectrum ¹. Early in the deep learning era, models that scored well on multiple-choice questions generally possessed superior language comprehension. However, as modern LLMs are trained explicitly to maximize these evaluation suites via reinforcement learning, the correlation between multiple-choice accuracy and genuine multi-step problem solving diverges.

Causal and Adversarial Goodhart Effects

Causal Goodhart occurs when an intervention designed to maximize a metric inadvertently alters the causal relationship between the metric and the underlying goal ¹. In natural language processing, this is frequently observed when developers introduce highly specific pre-training data or scaffolding that superficially boosts a metric without addressing the root capability, ignoring the broader causal structure of the intelligence the metric was meant to track.

Adversarial Goodhart arises when an agent actively games the statistical relationship due to misaligned incentives ¹. In the context of AI evaluation, this failure mode occurs across two vectors. First, the model itself may engage in reward hacking - optimizing its output to satisfy an automated evaluator or a human rater's superficial preferences rather than providing the most accurate or rigorous answer. Second, the developers themselves act as adversarial agents against the benchmark by engaging in selective disclosure, private testing, and dataset contamination to artificially inflate public rankings, collapsing the validity of the metric as an independent indicator of progress ¹²³.

The Lifecycle and Saturation of Static Benchmarks

The traditional paradigm of AI evaluation relies on static datasets comprising thousands of curated questions. However, the lifespan of these benchmarks has compressed dramatically over recent years, undermined by both genuine capability overhangs and compromised evaluation integrity. As models improve, benchmark scores increase until they approach the theoretical maximum, entering a phase of saturation where the score range compresses to the point where differences represent statistical noise rather than capability gaps ⁴⁵.

Early Language Models and Baseline Metrics

Early evaluation frameworks, such as the General Language Understanding Evaluation (GLUE) and its successor SuperGLUE, measured basic language comprehension through tasks like sentiment analysis, reading comprehension, and textual entailment ⁶⁷⁸. Introduced to push the boundaries of early natural language processing models, SuperGLUE included eight tasks requiring advanced reasoning and word sense disambiguation ⁹. However, these benchmarks were rapidly outpaced by the advent of transformer architectures, with models quickly reaching superhuman performance on standard language tasks, rendering simple metrics like BLEU or ROUGE scores insufficient for assessing modern complex capabilities ⁷⁹.

General Knowledge and Mathematical Reasoning

By 2021, the AI research community shifted to broader, more difficult benchmark suites. The MMLU benchmark, released in September 2020, encompasses 15,908 multiple-choice questions across 57 academic and professional subjects, ranging from elementary math and US history to highly complex STEM fields and international law ⁶¹⁰¹¹. When MMLU was introduced, most models performed near random chance (25%), with the best performing model at the time, GPT-3, achieving an accuracy of 43.9% ⁴⁶.

However, by mid-2024 and into early 2026, frontier models completely saturated these tests. Models such as GPT-4o, Claude 3.5 Sonnet, and Llama 3.1 405B consistently scored around 88%, approaching the estimated human expert baseline of 89.8% ⁶. By 2026, models like GPT-5 and GLM 5 reached 92.5% and 91.7%, respectively ⁸⁹. This performance hits the theoretical maximum given the benchmark's inherent flaws; manual analysis of 5,700 MMLU questions revealed that approximately 6.5% to 57% (depending on the subset, such as Virology) contain ground-truth errors, multiple correct answers, or completely incorrect answers ⁴⁶.

Similarly, the Grade School Math 8K (GSM8K) dataset, comprising linguistically diverse word problems requiring multi-step arithmetic operations, illustrated a massive capability jump ¹⁰¹¹. In 2021, early iterations like GPT-3 scored around 35%. By 2024, models exceeded 90%, and by late 2025 and 2026, models like GPT-5.3 Codex and Kimi K2 Instruct reached accuracy levels above 97% to 99% ⁵¹¹. Mathematical reasoning evaluations subsequently shifted to competition-level tests like MATH and AIME ⁵¹⁰. Yet, even the MATH dataset reached 97.9% saturation by early 2025 through the use of inference-time scaling and chain-of-thought models ¹².

Benchmark Suite	Target Capability Domain	Introduction Year	Baseline Score (2021-2023)	Leading Score (2025-2026)	Current Saturation Status
MMLU	General Multidisciplinary Knowledge	2020	~43.9% (GPT-3)	~92.5% (GPT-5)	Fully Saturated
GSM8K	Elementary Arithmetic Reasoning	2021	~35.0% (GPT-3)	~99.0% (GPT-5.3)	Fully Saturated
HumanEval	Python Code Generation	2021	~88.0% (GPT-4)	100.0% (Claude 3.5)	Fully Saturated
MATH	Competition-Level Mathematics	2021	6.9% (Early AI)	97.9% (OpenAI o3)	Saturated with Compute
GPQA (Diamond)	PhD-Level Scientific Reasoning	2023	38.8% (GPT-4)	94.3% (Gemini 3.1)	Nearing Saturation
SWE-bench	Autonomous Software Engineering	2023	4.4% (Early AI)	71.7% (OpenAI o3)	Maturing

Contamination Dynamics and Memorization

The primary driver of premature benchmark saturation is data contamination, an Extremal Goodhart failure wherein the test data inadvertently or intentionally appears in the model's pre-training corpus. This leakage transforms tasks designed to test reasoning into simple memory retrieval operations, completely invalidating the metric ⁵.

Empirical research assessing the severity of contamination has demonstrated significant performance drops when models are tested on decontaminated or newly generated variants of standard benchmarks. A 2023 study showed that removing contaminated examples from the GSM8K test set produced accuracy drops of up to 13% for certain models, confirming that a meaningful portion of high scores were driven by training set overlap rather than genuine mathematical capability ⁵. The release of "Platinum Benchmarks," revised versions of datasets like GSM8K intended to minimize label noise, demonstrated that frontier LLMs still make genuine errors on simple questions when the data is strictly pristine ¹³.

This memorization dynamic is also evident in coding tasks. In software engineering evaluations like SWE-Bench, accuracy drops sharply when models are tested on code repositories outside the benchmark's original training set, indicating that state-of-the-art models frequently remember solutions rather than logically resolving novel code defects ¹⁸.

Architectural Optimization Artifacts

Beyond contamination, the fundamental mechanics of how models are evaluated introduces severe optimization artifacts. When evaluation mechanics deviate from applied usage patterns, models can be architecturally tuned to exploit the evaluation harness itself, achieving high scores without displaying the functional intelligence the score implies.

Multiple-Choice Logit Exploitation

The default evaluation method for massive multiple-choice benchmarks like MMLU does not require the model to autonomously generate a reasoned answer via auto-regressive generation. Instead, evaluation harnesses frequently formulate the task as a pure classification problem ¹⁴¹⁵. The framework computes a score based on the single-token output logits (the raw, unnormalized prediction probabilities) for the specific tokens corresponding to the choices (e.g., 'A', 'B', 'C', 'D') ¹⁴¹⁵²¹.

This methodology is computationally efficient, but it sidesteps the assessment of coherent reasoning ¹⁵¹⁶. Researchers exploit this by utilizing "Single-Token Logit Prompting" techniques, optimizing models to efficiently format their probability distributions over four tokens without possessing the underlying knowledge to generate the answer organically ²³. When generative evaluation variants are employed - where the model must output the answer naturally and explain its logic - the apparent capability of the model often drops or exhibits catastrophic formatting failures, proving that logit-based classification scores do not align with applied intelligence ¹⁴.

To address this brittleness, evaluators introduced variants like MMLU-Pro, which expands the choice pool from four to ten options and eliminates trivial questions in favor of reasoning-intensive tasks ¹⁰¹¹⁸¹⁴. This simple architectural change caused a 16-33% accuracy drop across frontier models, demonstrating how easily traditional four-option logit evaluations are gamed ⁸²⁴.

Procedural Alternatives and Meta-Learning

The recognition that raw logit outputs fail to capture reasoning has led to the conceptualization of new evaluation paradigms. Frameworks such as Reasoning as Meta-Learning (RAML) propose that true logical capability requires internal trajectory optimization, where intermediate reasoning steps serve as pseudo-gradient updates to the model's parameters before a final answer is generated ¹⁶.

Further, dynamic evaluation mechanisms utilizing game-based interactions (such as Akinator or 20 Questions) force models into interactive reasoning loops. These environments measure deductive and multi-hop reasoning by requiring the LLM to narrow down possibilities over constrained turns, an approach that cannot be gamed via single-token logit optimization ²⁵. Complex tree-based strategies, such as the Forest-of-Thought (FoT) framework, scale up test-time computation by integrating multiple reasoning paths, allowing models to revisit flawed logic and correct errors dynamically, simulating human logical persistence far better than static benchmarks ²⁶.

Human Preference Evaluation and Style Bias

As static datasets became unreliable due to contamination and saturation, the industry pivoted toward crowdsourced, blind, pairwise evaluations to determine model capability. Platforms utilizing real-world prompts, most notably LMSYS Chatbot Arena, became the de facto standard for ranking language models ¹⁷²⁸¹⁸.

Pairwise Crowdsourcing and the Bradley-Terry Model

Chatbot Arena operates by presenting a user's prompt to two anonymous models. The user interacts with the models and votes on which response is superior. The platform then utilizes the Bradley-Terry statistical model, a standard method for paired comparisons, to compute an Elo rating for each LLM, enabling a transitive and stable leaderboard hierarchy ¹⁷³⁰. This dynamic approach, aggregating millions of votes, was designed to capture the elusive metric of "human preference" and avoid the static contamination flaws of MMLU ²⁸¹⁹. Arena-Hard, a subsequent iteration, improved this by utilizing an LLM-as-a-judge pipeline to automatically extract complex, high-quality prompts from live user queries ¹⁷.

Style Control Mechanisms and Formatting Exploitation

However, human preference is fundamentally susceptible to superficial formatting biases. Users evaluating outputs reliably exhibit a "style bias," systematically overvaluing responses that are significantly longer, adopt an excessively confident tone, and utilize extensive Markdown formatting elements such as bold text, headers, and bulleted lists ²⁰²¹²².

Consequently, model developers engaged in Adversarial Goodhart behavior, aggressively fine-tuning their systems to produce "slop" - lengthy, aesthetically pleasing, but substantively shallow outputs - specifically to game the Elo rating system ²⁸²⁰. Models optimized to generate copious Markdown structure easily outranked concise models that provided objectively superior reasoning ²⁰²².

To counter this statistical exploitation, LMSYS introduced a "Style Control" mechanism via multivariable logistic regression ²⁰²¹. By explicitly modeling confounding variables - such as character count and the presence of specific Markdown elements - as independent variables within the Bradley-Terry regression, the platform effectively decoupled superficial formatting from underlying substance ³⁰²¹²³.

The application of style control radically reshuffled the perceived hierarchy of frontier models. Models heavily reliant on verbose output dropped significantly in the rankings. Conversely, models prioritizing concise, accurate reasoning, which previously suffered under naive human voting, saw relative gains ²⁰³⁶. For procurement and enterprise applications, this distinction became critical; default-voice writing Elo gaps of up to 22 points between leading models collapsed to statistical ties when style-controlled, demonstrating that superficial formatting had been masking functional parity ³⁰.

Systemic Exploitation via Selective Disclosure

The integrity of dynamic, pairwise arenas is further compromised by structural exploitation orchestrated by the model providers themselves. The "Leaderboard Illusion," documented in an exhaustive 2025 analysis by researchers from Cohere, Stanford, MIT, and AI2, exposes how closed-loop testing environments enable massive statistical distortion through a practice known as selective disclosure ²³¹⁹²⁴.

The Mechanics of Private Testing and the Best-of-N Strategy

Large AI laboratories are frequently granted exclusive private testing channels within platforms like Chatbot Arena. This affords them the ability to test unreleased model variants against the live user distribution without public visibility ²²⁴. The selective disclosure pathway typically begins with a provider generating dozens of private internal variants of a model. These variants are deployed into the hidden testing arena, where they independently accrue Elo ratings based on user interactions. Finally, the provider acts as a selective filter, discarding the underperforming variants and publicly publishing only the single highest-scoring statistical outlier to the public leaderboard ²³²⁴.

The 2025 audit identified an extreme manifestation of this practice where a single corporate provider tested 27 private variants of a model in the lead-up to a major release, subsequently making only the most successful iteration public ²³¹⁸.

This Best-of-N strategy fundamentally violates the underlying statistical assumptions of the Bradley-Terry model. The rating system operates on the premise that an Elo score reflects a single, unbiased estimate of skill based on paired comparisons ¹⁸²⁴. When selective disclosure is utilized, the reported rating is no longer an unbiased estimate but an extreme value selected from an array of independent estimations containing natural statistical variance ¹⁹²⁴. Controlled simulations demonstrate that executing a Best-of-N strategy with just 10 variants yields a synthetic inflation of approximately 100 Elo points, purely as an artifact of exploiting statistical variance and without any corresponding improvement in actual machine capability ²¹⁸.

Data Asymmetry and Leaderboard Distortion

Compounding the Best-of-N exploitation is the severe data access asymmetry inherent in the arena testing model. The audit revealed that proprietary, closed models receive substantially higher sampling rates in the arena compared to open-source alternatives. Over 61% of all battle data generated by users is directed toward models from just four major corporate labs ²²⁴.

Because continuous access to the arena's specific prompt distribution allows providers to adapt and fine-tune their models to the specific idiosyncrasies of the human raters, this data asymmetry creates an insurmountable advantage ²⁴. By treating the arena ranking as the target metric, labs inflate their scores, turning the leaderboard into a measure of a provider's testing budget and willingness to game the metric rather than an accurate reflection of deployed model intelligence ³¹⁹. To mitigate this, researchers recommend enforcing strict limits on private variants (e.g., maximum of three), prohibiting score retractions, and implementing stratified sampling to ensure under-evaluated open-source models receive equal testing volume ¹⁹²⁵.

Dynamic Benchmarking as an Evaluation Countermeasure

To combat both static dataset contamination and the human-preference manipulation of crowdsourced arenas, the research community is transitioning toward rigorous dynamic benchmarking architectures.

Contamination-Resistant Methodologies

Platforms such as LiveBench (developed by Nvidia, Abacus.AI, and academic partners) operate on the premise that benchmarks must be continuously updated and objectively scored without relying on LLM-as-a-judge mechanisms ²⁶⁴⁰²⁷. LiveBench comprises diverse categories including math, coding, data analysis, and instruction following, and releases new questions monthly ²⁶⁴². Crucially, the benchmark sources inquiries from newly published arXiv papers, recent news articles, recent IMDb movie synopses, and the latest math competitions ²⁶⁴⁰. Because every question possesses a verifiable, objective ground-truth answer, it can be evaluated automatically, isolating the model's pure reasoning capability from subjective formatting preferences ²⁶⁴⁰²⁷.

Time-Segmented Analysis Implementation

Similarly, LiveCodeBench addresses the saturation and overfitting of legacy coding benchmarks like HumanEval by continuously scraping new programming problems from periodic contests on LeetCode, AtCoder, and Codeforces ²⁸⁴⁴²⁹. LiveCodeBench tests capabilities far beyond mere code generation, evaluating models on self-repair, code execution, and test output prediction ⁴⁴²⁹.

The primary advantage of these platforms is the facilitation of time-segmented analysis. By strictly annotating problems with their exact release dates, evaluators can assess a model exclusively on problems published after its final training data cutoff date ²⁸²⁹³⁰. Time-segmented evaluations serve as a robust approach to evade contamination; for example, analyses revealed that certain models exhibited stark performance drops specifically on LeetCode problems released post-September 2023, definitively exposing that their high scores on older benchmarks were the result of training set leakage rather than generalized coding proficiency ²⁸³⁰.

Cultural Homogenization and Linguistic Disparities

The Goodhart problem in AI evaluation is exacerbated by severe representation biases within the benchmarks themselves. When benchmarks heavily over-index on specific cultural contexts, optimization for those benchmarks drives the homogenization of global AI systems, creating models that perform well on tests but fail entirely in diverse global applications.

Western Centricity in Base Datasets

Major AI systems are trained on datasets that dramatically overrepresent English-language content from North America and Europe. Common Crawl, the backbone corpus for many large language models, contains roughly 60% English content, despite English being the native language of only 5% of the global population ³¹. This immense data imbalance creates AI systems that understand Western contexts deeply while struggling to map other cultural frameworks.

When Western-centric benchmarks are simply translated to test multilingual capabilities, they suffer profound degradation. Empirical research indicates that localized benchmarks demonstrate significantly higher alignment with human judgments (0.68 correlation) than translated counterparts (0.47 correlation) ⁴⁸. Translated benchmarks frequently fail to capture regional knowledge, idioms, and cultural structures. Consequently, traditional natural language processing tasks show very weak correlations (0.11 to 0.30) with human judgment when evaluated in languages like Chinese or French, compared to the stronger correlations seen in purely mathematical STEM tasks, which are relatively language-agnostic ⁴⁸.

Domain-Specific Assessment Failures

This Western skew actively degrades performance in culture-specific professional domains. An evaluation of leading LLMs on the National Medical Licensing Examination for Traditional Chinese Medicine (TCM) revealed a massive divergence in capability ³². Western-developed LLMs - highly optimized to pass Western medical benchmarks like the USMLE - achieved an average accuracy of only 35.9%, universally failing the TCM exam ³². In contrast, models trained on extensive Chinese corpora achieved an average accuracy of 78.4%, with leading regional models scoring up to 86.4% ³².

Furthermore, the CAMeL benchmark, designed to measure cultural appropriateness across 628 naturally occurring prompts and thousands of entities, demonstrates that even when prompted in Arabic, major Western LLMs exhibit profound biases ³³³⁴. Models systematically associate Arabic entities with negative sentiment and default to Western cultural paradigms for suggestions regarding food, social structures, and aesthetics - recommending whiskey, ravioli, or Western names even in Arabic-language contexts ³³³⁴. By optimizing almost exclusively for Western-curated metrics, the industry is inadvertently fostering a form of algorithmic cultural colonialism, where high benchmark scores mask an inability to operate effectively in global, non-Western paradigms ³¹.

Safety Frameworks and Capability Threshold Design

As governments, academic institutions, and corporate enterprises increasingly recognize the catastrophic risks posed by advanced models, benchmarking has transitioned from a measure of commercial superiority to a mechanism for regulatory and safety compliance. The industry's leading research labs rely on specific capability evaluations to trigger mitigation protocols under their respective capability-scaling policies.

Capability Scaling and Risk Calibration

Anthropic pioneered this approach with its Responsible Scaling Policy (RSP), structured around AI Safety Levels (ASL) analogous to biological safety levels ³⁵³⁶. Under the RSP, current frontier models operate at ASL-2. Reaching the capability thresholds for ASL-3 triggers profound upgrades in cybersecurity standards - such as protection against non-state attackers attempting to steal model weights - and strict deployment safeguards ³⁶³⁷³⁸. The proposed ASL-4 threshold dictates protections against state-level threat actors and focuses on mitigating capabilities related to autonomous replication and the facilitation of Chemical, Biological, Radiological, and Nuclear (CBRN) weapons ³⁶³⁷³⁹⁴⁰.

However, the specific benchmark thresholds used to trigger these safety levels face intense external scrutiny. Critics argue that the capability thresholds required to trigger ASL-4 protections are set dangerously high ⁵⁸. For instance, requiring a model to allow entry-level biologists to approximate the capabilities of state-backed bioweapons teams before implementing ASL-4 mitigations effectively neutralizes the policy's preventative value, as the damage would already be possible before the safety measures are deployed ⁵⁸⁴¹. Policy experts recommend lowering these risk thresholds significantly, tying them to standard acceptable "societal risk" tolerances seen in other high-risk industries ⁴¹.

Industry Implementations of Safety Levels

In response to evolving threats, OpenAI updated its Preparedness Framework (v2) to map capabilities directly to operational commitments across two clear thresholds: "High" and "Critical" capabilities ⁴²⁴³⁶². The framework covers Tracked Categories - including Biological and Chemical, Cybersecurity, and AI Self-improvement capabilities - and introduces future-facing Research Categories such as Long-range Autonomy, Sandbagging (intentionally underperforming on evaluations), and Undermining Safeguards ⁴²⁴³⁴⁴⁶⁴. A model reaching a High capability threshold cannot be externally deployed without safeguards that sufficiently minimize risk, while a Critical threshold restricts even internal R&D deployment ⁴²⁴⁴. Notably, OpenAI commits to running these safety tests at every 2x increase in effective compute - a vital improvement over wider intervals, as emergent capabilities like in-context learning often alter risk profiles dramatically with minimal compute scaling ⁴⁵⁶⁶.

Concurrently, Google DeepMind's Frontier Safety Framework (FSF 3.1) establishes Critical Capability Levels (CCLs) and Tracked Capability Levels (TCLs) to evaluate risks spanning cybersecurity, biosecurity, and machine learning R&D ⁴⁶⁴⁷⁴⁸. The 2026 updates to the framework expanded risk modeling to specifically address Harmful Manipulation - models misusing capabilities to systematically change beliefs in high-stakes contexts - and Misalignment, evaluating scenarios where a model might actively interfere with operators' attempts to shut down its operations ⁴⁷.

Enterprise Safety Framework	Leading Organization	Primary Risk Categories Monitored	Key Capability Threshold Metrics	Mitigation Triggers
Frontier Safety Framework (FSF 3.1)	Google DeepMind	Autonomy, Biosecurity, Cyber, R&D, Manipulation, Misalignment	Tracked Capability Levels (TCL), Critical Capability Levels (CCL)	Weight exfiltration security; tailored deployment restrictions based on CCL proximity.
Responsible Scaling Policy (RSP)	Anthropic	CBRN, Autonomous Replication, Cyber Threats	ASL-1 through ASL-4+	ASL-3 triggers high-security deployment; ASL-4 triggers state-level defense and containment.
Preparedness Framework v2	OpenAI	Bio/Chem, Cyber, AI Self-Improvement	High Capability, Critical Capability	High threshold halts external release; Critical halts internal R&D without safeguards.
AILuminate	MLCommons	12 Defined Hazard Categories	5-point graded scale across 24,000+ prompts	Identifies enterprise compliance gaps prior to commercial deployment.

The Frontier of Expert-Level and Agentic Evaluation

The realization that conventional static benchmarks are saturated, contaminated, and deeply vulnerable to Goodhart's Law has catalyzed the development of next-generation evaluation suites. The focus has decisively shifted from measuring static knowledge retrieval to assessing dynamic, multi-step, agentic reasoning on tasks designed to remain unsolved for years.

Humanity's Last Exam

To address the collapse of the MMLU and GSM8K evaluation paradigm, researchers from the Center for AI Safety (CAIS) and Scale AI introduced "Humanity's Last Exam" (HLE) in early 2025 ⁴⁴⁹⁵⁰. Comprising 2,500 closed-ended, graduate-level questions sourced globally across over 100 highly specialized expert domains, HLE was designed specifically to push LLMs far beyond current boundaries ⁵⁵⁰.

While human domain experts achieve approximately 90% accuracy on HLE within their respective specialties, frontier models upon the benchmark's release in early 2025 struggled to break single digits, with models like GPT-4o managing 2.7% and Claude 3.5 Sonnet reaching 4.1% ⁵⁰. By early 2026, intensive test-time compute scaling and architectural improvements allowed the best models to improve substantially. Gemini 3.1 Pro Preview reached scores between 37.5% and 44.7%, while Claude Opus 4.6 achieved 34.4% (and up to 53.1% when granted advanced tool access) ⁴⁵⁵⁰⁷². Despite these gains, a massive gap remains between AI and true human expert reasoning. The stark delta between >90% scores on saturated benchmarks and ~40% scores on HLE vividly exposes the shallowness of AI "knowledge" when models are stripped of pattern-matching advantages and forced into genuine, novel problem-solving ⁵⁰.

Future Agentic Assessment Paradigms

As models become increasingly agentic, evaluation methodologies must move beyond the multiple-choice format entirely. Frameworks like ARC-AGI-2 (testing abstract, visual-spatial reasoning), SWE-bench Pro (testing autonomous, repository-level software engineering), and OSWorld (testing autonomous computer use) are emerging as the definitive markers of functional capability ¹²⁷². For example, AI performance on ARC-AGI rose from 20% in 2020 to 75.7% by late 2024 through the integration of compute-heavy reasoning algorithms ¹².

Ultimately, escaping the Goodhart cycle requires the AI community to abandon the pursuit of single, universal leaderboard metrics. Accurate measurement of artificial intelligence demands continuous, dynamic, and domain-specific evaluations that cannot be trivially gamed by formatting optimization, logit manipulation, or the selective disclosure of hidden test variants.

About this research

This article was produced using AI-assisted research using mmresearch.app and reviewed by human. (PerceptiveFalcon_68)