Artificial intelligence researchers have discovered a troubling blind spot: the tests they use to measure AI progress are becoming obsolete. Popular benchmarks like MMLU (Measuring Massive Multitask Language Understanding) now see state-of-the-art AI models achieving over 90% accuracy, making it nearly impossible to distinguish between genuinely advanced systems and those that have simply memorized their way to high scores. This saturation has created a measurement crisis that threatens to obscure how close, or how far, AI systems actually are from human expert-level reasoning. To address this gap, researchers have introduced Humanity's Last Exam (HLE), a new benchmark designed to measure AI capabilities at the frontier of human knowledge. The benchmark consists of 2,500 challenging questions spanning over a hundred academic subjects, developed collaboratively by nearly 1,000 subject matter experts affiliated with more than 500 institutions across 50 countries. Unlike existing benchmarks, HLE is specifically engineered to resist the shortcuts that allow AI models to game their scores. Why Are Current AI Benchmarks Becoming Useless? The problem is straightforward: when AI models achieve 90% or higher accuracy on a benchmark, researchers lose the ability to measure meaningful differences between systems. It's like trying to rank students when everyone scores 95% on an exam. The benchmark no longer tells you who actually understands the material and who has found ways to cheat the test. This saturation affects not just research progress but also policymaking, since governments and organizations need accurate measures of AI capabilities to make informed decisions about deployment and regulation. HLE was designed to solve this problem by creating questions that cannot be quickly answered through internet searches or database lookups. Each question has been vetted by domain experts to ensure it requires genuine reasoning rather than pattern matching. The benchmark includes multiple question formats to prevent gaming, including exact-match questions where models must provide a precise answer and multiple-choice questions with five or more options. How Does HLE Test AI Differently Than Existing Benchmarks? - Expert-Level Difficulty: Questions typically require graduate-level expertise or test knowledge of highly specific topics, such as precise historical details, specialized terminology, or local customs that cannot be easily retrieved from common sources. - Multi-Modal Format: Approximately 14% of questions require AI models to understand both text and images together, testing whether systems can integrate information across different types of input. - Rigorous Quality Control: Each question undergoes a multi-stage review process, including initial feedback from graduate-level reviewers and approval from expert organizers, ensuring questions are unambiguous and have verifiable answers. - Emphasis on Deep Reasoning: The benchmark emphasizes world-class mathematics problems designed to test deep reasoning skills that apply across multiple academic areas, rather than surface-level knowledge recall. The results from testing frontier AI models on HLE reveal a stark reality: these systems perform far worse on genuinely difficult questions than their high scores on older benchmarks suggest. State-of-the-art large language models (LLMs), which are AI systems trained on vast amounts of text to understand and generate human language, demonstrate low accuracy across all models on HLE, highlighting a marked gap between current capabilities and expert-level academic performance. What Do the Results Actually Show About AI Capabilities? Beyond simply scoring lower, the HLE results reveal something more troubling about current AI systems: they often provide incorrect answers with high confidence rather than acknowledging uncertainty. Most models exhibit root mean square calibration errors above 70%, meaning they are frequently wrong while appearing certain. This is particularly dangerous in real-world applications where users rely on AI systems to know the limits of their own knowledge. The benchmark's public release includes 2,500 questions while maintaining a private test set to prevent models from being trained specifically to perform well on HLE. This approach ensures that the benchmark will remain useful for measuring progress over time, even as AI systems improve. The researchers have made HLE publicly available at lastexam.ai to enable the research community to use it as a common reference point for assessing AI capabilities. The creation of HLE represents a significant shift in how the AI research community approaches measurement. Rather than accepting benchmarks that have become too easy, researchers are building tools that can scale with AI progress. This matters because as AI systems approach human expert performance in many domains, precise measurement of their capabilities and limitations becomes essential for informing research directions, governance decisions, and public understanding of what these systems can actually do. "To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai," Researchers behind Humanity's Last Exam, Nature The implications extend beyond academic research. Companies developing AI systems, regulators evaluating their safety, and organizations deciding whether to deploy these tools all need accurate measures of what AI can and cannot do. HLE provides a more honest assessment than existing benchmarks, revealing that despite impressive performance on older tests, current AI systems still fall significantly short of expert-level reasoning on genuinely difficult, closed-ended academic questions. " }