News Publishers Block Wayback Machine to Starve AI Training

Source: The Next Web

Major outlets including the New York Times, CNN, and The Guardian are using robots.txt files to prevent the Internet Archive from indexing their content, directly targeting the historical corpus that AI companies have relied on for training data. Publishers are moving from legal posturing to technical infrastructure—they're no longer waiting for litigation outcomes but actively degrading the information commons that enabled the current AI boom. The shift exposes a real constraint on AI development: when training data sources dry up through coordinated publisher action rather than scarcity, models built on historical web text become harder to improve. This could accelerate the race toward licensed data partnerships and proprietary training datasets.

Related Signals

Signals from adjacent fields