Large Language Models Fail at Reproducing Physics Experiments

A Peking University preprint tested whether LLMs can replicate experimental physics results. They can't. The models fail at the sequential reasoning and precision measurement interpretation that physics requires. This exposes a gap between LLMs' fluency at pattern-matching text and their inability to ground abstract knowledge in verifiable physical outcomes—a problem that affects scientific peer review and AI agents making real-world decisions. The finding suggests that scaling parameters alone won't close this gap. Models may need different training approaches that reward reproducibility and constraint-satisfaction rather than plausible-sounding next tokens.