top of page

Article: Will AI Feed Itself to Death?



Photo Source: PickPik


Suppose ChatGPT and other large language models depend on human written content collected from the internet to learn and grow. What happens when the internet fills up with more and more text generated by the LLMs themselves? And if every successive round of model development uses more machine-generated text, will the future ChatGPTs develop answers further and further from reality? Will the large language models feed themselves to death? Computer science researchers from Oxford and Cambridge Universities and the University of Toronto studied the effects of synthetic or AI-generated information on models like ChatGPT. In their paper titled “The Curse of Recursion: Training on Generated Data Makes Models Forget,” the authors introduce the concept of “Model Collapse.” (arxiv.com) Model collapse refers to the loss of truth in the information that feeds the model, which compounds over time with each successive model consuming more AI-generated and less human-made content. Model collapse occurs when information such as the total content of the internet is used to train a model such as ChatGPT. Over time, ChatGPT and other LLMs add content to the internet in the form of AI-generated news, stories, reviews, social media messages, etc. The growing AI-generated content will then contribute to training next-generation LLMs. The first scrape of the internet in 2021 by ChatGPT contained almost entirely human-generated text. Estimates suggest that the internet currently holds between 30 and 60 billion web pages, which puts the number of words in the trillions. (worldwidewebsize.com) As many people and organizations begin to use LLMs like Bard and ChatGPT to generate content, the internet will have less and less human-generated content. The Living Library, an organization that studies technology, estimates that 90% of the content on the internet will be AI-generated by 2026. (thelivinglib.org) According to the model collapse theory, LLMs will forget the underlying truth in human-generated data when subsequent models also scrape the web for training data. But subsequent scrapes will contain more and more AI data. This compounding dilution of the truth will make the LLMs collapse into ineffectiveness, forgetting the truth in favor of false beliefs. One type of model collapse called “catastrophic forgetting” occurs when a model gets continuous updates in information but forgets the original truth. For example, the initial input in the model is “Dogs have four legs and have coevolved with humans over the past 30,000 years into many breeds.” However, the model outputs, “Dogs have legs and a tail that has changed in length over the years.” The output loses important information by forgetting the original truth, such as coevolution with humans over the past 30,0000 years. Such forgetting will lead the model astray and produce less reliable information. Large language models have displayed remarkable capabilities in human-understandable question-and-answer tasks, summarizing information, and composing essays and news articles. The current LLMs, such as ChatGPT and Bard, have consumed large swaths of the internet to learn to predict the correct answer to a question. However, researchers have found that as LLMs add more AI-generated information to the internet, the models face the specter of model collapse. The models need a regular supply of human-generated content, not AI content, to continue improving. It suggests that as people use the LLMs to make writing content more effortless, the future of LLM health falls into jeopardy. These opposing forces of needing peoples’ creativity and content for LLMs to work creates a strange dependence on people for machines to flourish.




Dr. Smith’s career in scientific and information research spans the areas of bioinformatics, artificial intelligence, toxicology, and chemistry. He has published a number of peer-reviewed scientific papers. He has worked over the past seventeen years developing advanced analytics, machine learning, and knowledge management tools to enable research and support high-level decision making. Tim completed his Ph.D. in Toxicology at Cornell University and a Bachelor of Science in chemistry from the University of Washington.


You can buy his book on Amazon in paperback and in kindle format here.






 
 
 

Comments


bottom of page