The rise of generative artificial intelligence (AI) models, like ChatGPT and Stable Diffusion, has led to an explosion of high-quality content created by AI. While this technology has democratized creativity, it has also raised concerns about the effect of AI-generated content on future AI models.
Researchers from Oxford, Cambridge, Imperial College London, and the University of Toronto investigated this issue and found that models trained on data created by earlier generative AI models will develop irreversible defects. These defects will worsen with each generation and cause the models to misinterpret the data distribution upon which they were trained.
This problem, known as model collapse, can be seen as a form of data poisoning, with the model and training process polluting the training data. This issue might become more prevalent as access to human-generated data becomes more expensive.
The researchers simulated this issue by testing three kinds of models: a Gaussian mixture model, a variational autoencoder, and a large language model. They found that when the models were trained on their own data, the distribution of the data changed significantly until it was completely unrecognizable from the original data.
The researchers then tested their hypothesis on OPT-125m, a small version of Meta’s open-source LLM. They found that models generated samples that would be produced with higher probabilities by the original model. However, the researchers did find that the models could learn (some of) the underlying task even when trained on LLM-generated data.
The researchers suggest measures be taken to preserve access to the original data over time, but it is unclear how to track and filter LLM-generated content at scale. Tech companies will need to innovate and compete to create high-quality, human-generated data to maintain an advantage in creating top-performing AI models.
In conclusion, while generative AI models have expanded the possibilities of creative output, the effects of AI-generated content on subsequent models must be taken into consideration. The development of high-quality, human-generated data remains crucial in ensuring the integrity of AI models in the future.