
In large language model collapse, there are generally three sources of errors: The model itself, the way the model is trained and the data — or lack thereof — that the model is trained on.
Andriy Onufriyenko/Getty Images
hide caption
toggle caption
Andriy Onufriyenko/Getty Images
In large language model collapse, there are generally three sources of errors: The model itself, the way the model is trained and the data — or lack thereof — that the model is trained on.
Andriy Onufriyenko/Getty Images
Asked ChatGPT anything lately? Talked with a customer service chatbot? Read the results of Google’s “AI Overviews” summary feature?
If you’ve used the Internet lately, chances are, you’ve been consuming content created by a large language model.
Large language models, like DeepSeek-R1 or OpenAI’s ChatGPT, are kind of like the predictive text feature in your phone on steroids. In order for them to “learn” how to write, these modesl are trained on millions of examples of human-written text.
In the past, this training usually involved having the models read the whole Internet. But nowadays — thanks in part to these large language models themselves — a lot of content on the Internet is written by generative AI.
That means that AI models trained now may consume their own synthetic content — and suffer the consequences.
View the AI-generated images mentioned in this episode.
Have another topic in artificial intelligence you want us to cover? Let us know my emailing shortwave@npr.org!
Listen to Short Wave on Spotify and Apple Podcasts.
Listen to every episode of Short Wave sponsor-free and support our work at NPR by signing up for Short Wave+ at plus.npr.org/shortwave.
This episode was produced by Hannah Chinn. It was edited by our showrunner, Rebecca Ramirez. The audio engineer was Jimmy Keeley.