A recent study conducted by scientists from the University of California, Berkeley explored OpenAI’s ChatGPT and its GPT-4 model, and discovered an undisclosed secret: the model was trained using text from copyrighted books. This study, called “Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4,” was authored by Kent Chang, Mackenzie Cramer, Sandeep Soni, and David Bamman.
The researchers determined that ChatGPT and GPT-4 had “memorized” texts from a variety of genres, particularly science fiction and fantasy. This lead to potentially less knowledge of other genres, such as Global Anglophone works, works in the Black Book Interactive Project and Black Caucus American Library Association award winners. David Bamman, associate professor in the School of Information at Berkeley, succinctly summarized the article by stating, “Takeaways: open models are good; popular texts are probably not good barometers of model performance; with the bias toward sci-fi/fantasy, we should be thinking about whose narrative experiences are encoded in these models, and how that influences other behaviors.”
The unknowable nature of OpenAI’s data, as well as the question as to whether the texts in question truly exist in the model, renders them unanswerable. The research team thus proposes, for a more transparent model behavior, public training data availability.
Regarding copyright implications, Stanford University law professor Tyler Ochoa agrees that lawsuits will likely arise if generated texts are too similar to those originating from copyrighted sources. He remarked that the same notions apply to image generators.
Margaret Mitchell, AI researcher and chief ethics scientist for Hugging Face, voiced the need for more efficiency in data curation and due documentation. She remarked, “I hope this work will help further advance the state of the art in responsible data curation.”
OpenAI is a research laboratory owned by Microsoft and specializes in artificial intelligence, especially and most notably deep learning, a field of AI that has been proven to generate immense success in comparison to other areas of AI. OpenAI develops and implements artificial general intelligence, creating machines to process human language and act on a provided task. ChatGPT is the first AI-driven chatbot released by OpenAI, which uses GPT-4 to generate automated responses and understand natural language. GPT-4 is the fourth version of OpenAI’s large-scaled, open-sourced language model, which is capable of generating human-like dialogue with minimal training data, and has prompted a new wave of AI-driven chatbot technology.