8,700 Studies Reviewed. 87.0% Found Biological Effects. The Evidence is Clear.

Wang X, Zhao X, Xu J, Li M, Sun B, Gao A, Zhang L, Wu S, Liu X, Zou D, Li Z, Dong G, Zhang C, Wang C

Bioeffects Seen

Authors not listed · 2025

Share:

Advanced AI systems fail expert-level academic tests, highlighting limitations in complex scientific analysis including EMF health research.

Plain English Summary

Summary written for general audiences

Researchers created Humanity's Last Exam (HLE), a challenging new benchmark with 2,500 expert-level questions across multiple subjects to test advanced AI systems. Current state-of-the-art AI models performed poorly on these difficult academic questions, revealing significant gaps between AI capabilities and human expert knowledge. This benchmark provides a more accurate measure of AI limitations compared to existing tests where AI now scores over 90%.

Why This Matters

While this study focuses on AI benchmarking rather than EMF health effects directly, it highlights a critical issue for EMF research: the growing reliance on AI systems to analyze complex scientific data and make health assessments. The reality is that current AI models struggle with expert-level knowledge across scientific domains, which should give us pause when considering AI-generated health advice or risk assessments about EMF exposure. The science demonstrates that even the most advanced AI systems have significant knowledge gaps, particularly in specialized fields like bioelectromagnetics where nuanced understanding of biological mechanisms is essential. What this means for you is that human expertise remains irreplaceable in evaluating EMF health research, interpreting study limitations, and making informed decisions about exposure reduction strategies.

Exposure Information

Specific exposure levels were not quantified in this study.

Cite This Study
Unknown (2025). Wang X, Zhao X, Xu J, Li M, Sun B, Gao A, Zhang L, Wu S, Liu X, Zou D, Li Z, Dong G, Zhang C, Wang C.
Show BibTeX
@article{wang_x_zhao_x_xu_j_li_m_sun_b_gao_a_zhang_l_wu_s_liu_x_zou_d_li_z_dong_g_zhang_c_wang_c_ce2638,
  author = {Unknown},
  title = {Wang X, Zhao X, Xu J, Li M, Sun B, Gao A, Zhang L, Wu S, Liu X, Zou D, Li Z, Dong G, Zhang C, Wang C},
  year = {2025},
  doi = {10.1038/s41586-025-09962-4},
  
}

Quick Questions About This Study

Humanity's Last Exam contains 2,500 questions spanning dozens of academic subjects including mathematics, humanities, and natural sciences. Each question requires expert-level knowledge and has unambiguous, verifiable answers that cannot be quickly found through internet searches.
Unlike existing benchmarks where AI achieves over 90% accuracy, HLE is designed at the frontier of human expert knowledge. It uses multi-modal questions requiring deep understanding rather than pattern recognition, revealing significant gaps in current AI capabilities.
State-of-the-art large language models demonstrated low accuracy and poor calibration on HLE questions. This poor performance highlights a marked gap between current AI capabilities and expert human knowledge in closed-ended academic domains.
Popular AI benchmarks have become too easy, with models achieving over 90% accuracy on tests like Measuring Massive Multitask Language Understanding. HLE was created to provide meaningful measurement of cutting-edge AI capabilities and limitations.
The complete Humanity's Last Exam benchmark is publicly available at https://lastexam.ai. Researchers and policymakers can use this resource to better understand current AI model capabilities and inform development decisions.