Note: This study found no significant biological effects under its experimental conditions. We include all studies for scientific completeness.
Whole Body / General311 citations
Wang X, Zhao X, Xu J, Li M, Sun B, Gao A, Zhang L, Wu S, Liu X, Zou D, Li Z, Dong G, Zhang C, Wang C
No Effects Found
Authors not listed · 2025
AI systems show major knowledge gaps on expert-level questions, raising concerns about AI-generated health information reliability.
Plain English Summary
Summary written for general audiences
This study introduces a new academic benchmark called 'Humanity's Last Exam' designed to test advanced AI language models on expert-level questions across multiple subjects. The researchers found that current state-of-the-art AI systems perform poorly on these challenging questions, revealing significant gaps between AI capabilities and human expert knowledge.
Cite This Study
Unknown (2025). Wang X, Zhao X, Xu J, Li M, Sun B, Gao A, Zhang L, Wu S, Liu X, Zou D, Li Z, Dong G, Zhang C, Wang C.
Show BibTeX
@article{wang_x_zhao_x_xu_j_li_m_sun_b_gao_a_zhang_l_wu_s_liu_x_zou_d_li_z_dong_g_zhang_c_wang_c_ce3557,
author = {Unknown},
title = {Wang X, Zhao X, Xu J, Li M, Sun B, Gao A, Zhang L, Wu S, Liu X, Zou D, Li Z, Dong G, Zhang C, Wang C},
year = {2025},
doi = {10.1038/s41586-025-09962-4},
}Quick Questions About This Study
It tests AI language models on 2,500 expert-level academic questions across mathematics, humanities, and natural sciences that cannot be quickly answered through internet searches but have clear, verifiable solutions.
State-of-the-art AI models demonstrate low accuracy and poor calibration on the benchmark, showing significant gaps between current AI capabilities and expert human knowledge levels.
Poor AI performance on complex topics raises concerns about reliability of AI-generated health information, especially for nuanced subjects like EMF research where accurate interpretation is crucial.
Unlike popular benchmarks where AI achieves over 90% accuracy, this expert-level benchmark reveals true capability gaps through questions requiring deep subject matter expertise rather than pattern recognition.
Yes, by testing AI performance on expert-level academic content, it provides insight into whether AI systems can reliably process and interpret complex scientific information across disciplines.