News thumbnail
Technology / Wed, 27 May 2026 orfonline.org

The Geopolitics of AI Benchmarks

Modern AI benchmarks are sets of standardised metrics used to validate model performance on various parameters. Similarly, LMArena (Chatbot Arena) Elo is a platform that allows humans to rate their preferred AI models. The Trump administration’s “America’s AI Action Plan” explicitly frames AI as a geopolitical race in which the United States must achieve global dominance in artificial intelligence, while emphasising the need to build an AI evaluation ecosystem. However, some are criticised for being compromised by contamination (when the training data for AI models accidentally contains questions from benchmarks) and memorisation. A real-world example is COMPL-AI, a benchmarking framework that translates broad legal obligations into practical technical tests for AI models.

Modern AI benchmarks are sets of standardised metrics used to validate model performance on various parameters. For example, SuperGLUE is a benchmark used to test a model’s Natural Language Understanding (NLU), i.e., how well it understands context, intent, and entities in a conversation. Similarly, LMArena (Chatbot Arena) Elo is a platform that allows humans to rate their preferred AI models. Other examples include Massive Multitask Language Understanding (MMLU), which evaluates general knowledge capabilities across various subjects and levels.

Today, most frontier benchmark leaderboards are dominated by American models developed by firms such as OpenAI, Anthropic, Google, and Meta. However, several open-weight Chinese models are emerging as competitive alternatives. Further, new benchmarks continue to emerge as models reach peak performance, even surpassing ‘human-level’ capabilities. These strong performances are being met with record investment, with about US$ 285.9 billion invested by the United States (US) industry in 2026.

By standardising evaluation criteria, competing models can be assessed on similar grounds and shared baseline assumptions, allowing for greater transparency, especially in developer communities.

The Trump administration’s “America’s AI Action Plan” explicitly frames AI as a geopolitical race in which the United States must achieve global dominance in artificial intelligence, while emphasising the need to build an AI evaluation ecosystem. China has also emphasised the desire to embed Chinese perspectives, data, and values into its open-source AI systems. This indirectly means that models are to be judged on China-specific benchmarks, thus exporting their ideologies and biases to broader Asian, Latin American, and African markets.

The Role of Benchmarks in Standardising AI Evaluation

By standardising evaluation criteria, competing models can be assessed on similar grounds and shared baseline assumptions, allowing for greater transparency, especially in developer communities. Further, performance on these open-source benchmarks increasingly drives downstream decision-making and even national strategic planning. Many governments have adopted national strategies for artificial intelligence (AI) after assessing AI capabilities across different use cases, especially considering the implications for national security.

It is thus in the broader interest of all stakeholders to improve and track benchmark performance and devise new methodologies to ensure a strategic advantage. Reports show that benchmark performance is now closely tied to perceptions of frontier leadership among governments and industry actors. Similarly, benchmarks such as MMLU, HumanEval, and SWE-bench are increasingly cited in product launches and technical reports by leading firms. However, some are criticised for being compromised by contamination (when the training data for AI models accidentally contains questions from benchmarks) and memorisation.

Current safety benchmarks often reduce complex questions of fairness, risk, and discrimination to fixed metrics, but these cannot guarantee long-term safety or security.

Benchmarks are also widely used in regulatory contexts to assess a model’s safety, ethics, and reliability. For example, the EU AI Act requires Generative AI companies to report harmful or discriminatory outputs and conduct adversarial attack testing, among others. Adversarial testing is especially important since it tests how vulnerable models are and how easily they can be manipulated. An attacker might use different techniques to extract sensitive user information from models, such as healthcare records or preferences, depending on the data they were trained on. It should also be noted that this ecosystem of evaluating AI directly informs discourse at the governmental level. For example, India’s AI governance guidelines, which have emphasised safe and trusted AI, emerged after recognising the potential harm to vulnerable groups such as children, along with social and gender minorities.

Current safety benchmarks often reduce complex questions of fairness, risk, and discrimination to fixed metrics, but these cannot guarantee long-term safety or security. Moreover, they are increasingly designed for short-term compliance rather than sustained risk management. A real-world example is COMPL-AI, a benchmarking framework that translates broad legal obligations into practical technical tests for AI models. It assesses whether dominant models meet baseline safety and fairness requirements.

Global Tools of Strategic Influence

The dominance of American AI standards and evaluation practices has made them the default template for international adoption. NIST’s AI Risk Management Framework is explicitly designed to align with international standards, and global policy debates increasingly treat standards development as a central component of AI governance. Further, AI safety institutes (AISIs) have also been established in multiple countries, indicating that countries are prioritising national interests and security, with evaluation systems and safety frameworks playing a central role.

Yet without stronger domestic or regional governance frameworks, many countries may rely on evaluation systems shaped by US-based firms, universities, and standards bodies. Research shows that widely used LLMs often reflect Western or English-speaking cultural values, while India-specific studies find caste and religious stereotypes that are poorly captured by conventional Western fairness benchmarks. The result is that imported evaluation frameworks may also export their vulnerabilities and entrenched biases.

Most external researchers and regulators in other countries must rely on company disclosures or third-party audits rather than full technical inspection. In practice, this allows American firms to maintain influence over what capabilities are prioritised and measured while preserving competitive advantages.

While most leading American frontier models developed by firms remain largely closed-source, the benchmarks used to evaluate them are publicly accessible. Frontier firms do not release full model weights, training data, or reproducible training pipelines for their most capable systems. Thus, most external researchers and regulators in other countries must rely on company disclosures or third-party audits rather than full technical inspection. In practice, this allows American firms to maintain influence over what capabilities are prioritised and measured while preserving competitive advantages.

This atmosphere raises the question of evaluation sovereignty. Countries have begun to discuss data and AI sovereignty, but little attention has been paid to who defines the tests through which AI systems are judged. Evaluation sovereignty means ensuring that domestic institutions can design, audit, and validate benchmarks that reflect local languages, legal norms, social and security risks, and development priorities. Robust compliance frameworks, safety metrics, and independent auditing mechanisms should be developed over the long term. For this to be achieved.

It is important to understand that auditing is a continuous process to ensure compliance with governance frameworks, while benchmarking is currently a series of specific metrics used to evaluate performance before widespread deployment. Further, due to instances of leaderboard gaming—where companies train and adjust models to perform better only on popular benchmarks and not for real-world deployment—the importance of updating benchmarks and improving standardisation is increasingly important.

Figure 1: Current issues of benchmarking practices

Source: European Commission Joint Research Centre (JRC)

Problems with Current Benchmarks

Current AI benchmarking remains fragmented and often does not provide users with a complete picture. Most criteria are isolated, task-specific tests, each with its own metric, making cross-model comparison difficult and sometimes giving a false sense of objectivity. Further, due to the increased crossover between academic AI research and frontier AI firms, the boundary between independent evaluation and industry strategy is becoming increasingly porous. A Stanford report found that nearly 90 percent of the top-performing AI models in 2024 came from industry. Further, benchmark use is also highly uneven across model releases. For example, 63.2 percent of highlighted benchmarks are used by only one model builder, suggesting that many benchmarks do not become shared standards across the field. Instead, companies often select and describe benchmarks in ways that support their own performance narrative.

Benchmark scores rarely explain what performance means in real-world settings, where models interact with changing users, institutions, incentives, and socio-cultural contexts.

There is also vagueness around what benchmarks actually measure. High scores on general knowledge tests or coding benchmarks may not always mean a model can reason well or work safely in real-world settings. They may simply show that the model is good at recognising patterns or has seen similar test questions before. Further, there are reports suggesting cherry-picking of certain benchmarks and inflated performance claims. A real-world example is Meta’s Llama 4 launch, where the company was criticised after submitting a specially optimised version of Llama 4 Maverick to LMArena/Chatbot Arena, while the publicly released model reportedly performed worse than the version advertised on the leaderboard.

Benchmark scores rarely explain what performance means in real-world settings, where models interact with changing users, institutions, incentives, and socio-cultural contexts. Diversity is another structural problem: benchmark design is concentrated among researchers at elite universities and firms, raising concerns about whose languages, values, and use cases define the boundaries. Finally, dominant benchmarks generally use static task formats, whereas real human-AI interaction is dynamic, context-dependent, and influenced by multiple factors.

There is a disciplinary imbalance in how AI evaluation is designed. Most dominant benchmarks are built by computer scientists and machine-learning researchers, with far less systematic involvement from the social sciences. One example is Weinberg’s “Rethinking Fairness”, which argues that dominant machine-learning fairness approaches often reduce fairness to narrow mathematical metrics, while neglecting broader questions of historical injustice. Recent work therefore calls for more holistic, interactive, and deployment-sensitive evaluation ecosystems rather than narrow leaderboards that can be manipulated.

The Way Forward

In light of the growing intersection of AI safety concerns, strategic competitiveness, and the need for inclusive adoption practices, benchmarks are powerful determinants that directly and indirectly shape policy and investment. It is therefore important that these systems are not dominated by discourse from particular geopolitical blocs. Furthermore, robust independent bodies must come forward to democratise this evaluation process, ensuring that countries embarking on their AI journeys are not compelled to adopt pre-existing dominant evaluation methods. A pioneering set of guidelines—periodically updated across both industry and academia and capable of casting a critical eye on frontier models—is crucial.

Ishita Deshmukh is a Research Intern at Observer Research Foundation.

Acknowledgement

The author acknowledges the use of ChatGPT 5.5 for sourcing links to two references. It was also used for language refinement and minor editorial assistance in select sections.

The views expressed above belong to the author(s). ORF research and analyses now available on Telegram! Click here to access our curated content — blogs, longforms and interviews.

© All Rights Reserved.