8,700 Studies Reviewed. 87.0% Found Biological Effects. The Evidence is Clear.

Note: This study found no significant biological effects under its experimental conditions. We include all studies for scientific completeness.

Wang X, Zhao X, Xu J, Li M, Sun B, Gao A, Zhang L, Wu S, Liu X, Zou D, Li Z, Dong G, Zhang C, Wang C

No Effects Found

Authors not listed · 2025

Share:

AI systems show major knowledge gaps on expert-level questions, raising concerns about AI-generated health information reliability.

Plain English Summary

Summary written for general audiences

This study introduces a new academic benchmark called 'Humanity's Last Exam' designed to test advanced AI language models on expert-level questions across multiple subjects. The researchers found that current state-of-the-art AI systems perform poorly on these challenging questions, revealing significant gaps between AI capabilities and human expert knowledge.

Cite This Study
Unknown (2025). Wang X, Zhao X, Xu J, Li M, Sun B, Gao A, Zhang L, Wu S, Liu X, Zou D, Li Z, Dong G, Zhang C, Wang C.
Show BibTeX
@article{wang_x_zhao_x_xu_j_li_m_sun_b_gao_a_zhang_l_wu_s_liu_x_zou_d_li_z_dong_g_zhang_c_wang_c_ce3557,
  author = {Unknown},
  title = {Wang X, Zhao X, Xu J, Li M, Sun B, Gao A, Zhang L, Wu S, Liu X, Zou D, Li Z, Dong G, Zhang C, Wang C},
  year = {2025},
  doi = {10.1038/s41586-025-09962-4},
  
}

Quick Questions About This Study

It tests AI language models on 2,500 expert-level academic questions across mathematics, humanities, and natural sciences that cannot be quickly answered through internet searches but have clear, verifiable solutions.
State-of-the-art AI models demonstrate low accuracy and poor calibration on the benchmark, showing significant gaps between current AI capabilities and expert human knowledge levels.
Poor AI performance on complex topics raises concerns about reliability of AI-generated health information, especially for nuanced subjects like EMF research where accurate interpretation is crucial.
Unlike popular benchmarks where AI achieves over 90% accuracy, this expert-level benchmark reveals true capability gaps through questions requiring deep subject matter expertise rather than pattern recognition.
Yes, by testing AI performance on expert-level academic content, it provides insight into whether AI systems can reliably process and interpret complex scientific information across disciplines.