Question 1

How many questions are in the HLE benchmark test?

Accepted Answer

Humanity's Last Exam contains 2,500 questions spanning dozens of academic subjects including mathematics, humanities, and natural sciences. Each question requires expert-level knowledge and has unambiguous, verifiable answers that cannot be quickly found through internet searches.

Question 2

What makes HLE different from other AI benchmarks?

Accepted Answer

Unlike existing benchmarks where AI achieves over 90% accuracy, HLE is designed at the frontier of human expert knowledge. It uses multi-modal questions requiring deep understanding rather than pattern recognition, revealing significant gaps in current AI capabilities.

Question 3

How did top AI models perform on HLE?

Accepted Answer

State-of-the-art large language models demonstrated low accuracy and poor calibration on HLE questions. This poor performance highlights a marked gap between current AI capabilities and expert human knowledge in closed-ended academic domains.

Question 4

Why did researchers create this challenging AI benchmark?

Accepted Answer

Popular AI benchmarks have become too easy, with models achieving over 90% accuracy on tests like Measuring Massive Multitask Language Understanding. HLE was created to provide meaningful measurement of cutting-edge AI capabilities and limitations.

Question 5

Where can researchers access the HLE benchmark data?

Accepted Answer

The complete Humanity's Last Exam benchmark is publicly available at https://lastexam.ai. Researchers and policymakers can use this resource to better understand current AI model capabilities and inform development decisions.

The Impact of 9.375 GHz Microwave Radiation on the Emotional and Cognitive Abilities of Mice

Plain English Summary

Why This Matters

Exposure Information

Quick Questions About This Study