Generative AI tools, such as ChatGPT, are facing a potential problem of running out of text to train on, according to Stuart Russell, an artificial intelligence expert and professor at UC Berkeley. He explained that these tools, which rely on vast amounts of text data to learn how to respond, are starting to hit a roadblock due to the limited availability of digital text. This issue could impact how generative AI developers gather data and train their technologies in the future.
Russell’s observations add to the recent focus on data harvesting practices employed by OpenAI and other generative AI developers to train language models. The data collection methods used by chatbots like ChatGPT have drawn increased scrutiny from creators concerned about unauthorized replication of their work and social media executives dissatisfied with the unrestricted use of their platform’s data. However, Russell’s warning indicates another potential vulnerability in the form of a scarcity of text for training these datasets.
A study conducted by Epoch, a group of AI researchers, estimated that machine learning datasets will likely exhaust all high-quality language data before 2026. High-quality language data includes sources such as books, news articles, scientific papers, Wikipedia, and filtered web content. The current language models powering popular generative AI tools have been trained on enormous amounts of published text sourced from digital news platforms, social media sites, and other online sources. Elon Musk even limited Twitter access due to concerns over data scraping.
Russell suggested that OpenAI, in particular, has likely supplemented its public language data with private archive sources to develop GPT-4, their most advanced AI model to date. However, he also mentioned that the exact training datasets for GPT-4 have not been disclosed by OpenAI.
In recent weeks, OpenAI has faced several lawsuits alleging the use of datasets containing personal data and copyrighted materials to train ChatGPT. One lawsuit, filed by 16 unnamed plaintiffs, accuses OpenAI of using sensitive data like private conversations and medical records. Comedian Sarah Silverman and two other authors have also filed a lawsuit claiming copyright infringement due to ChatGPT’s ability to produce accurate summaries of their work. OpenAI has not publicly commented on these legal challenges.
Despite the concerns and legal complications, Russell stated that AI will eventually replace humans in many jobs that involve language processing tasks. However, the potential shortage of high-quality language data could pose a significant challenge for the development and training of future generative AI tools.
It remains to be seen how generative AI developers will address this issue and adapt their training methods in the face of limited text availability. As the AI field continues to evolve, the focus on responsible data collection practices and the development of ethical guidelines will undoubtedly become even more crucial.