Bestselling authors Mona Awad and Paul Tremblay have recently filed a lawsuit against OpenAI, a leading artificial intelligence (AI) company, alleging copyright infringement. The lawsuit, filed in a San Francisco federal court, claims that OpenAI used copyrighted material from the authors’ novels to train its AI chatbot, ChatGPT, without obtaining proper consent.
ChatGPT, an AI-powered generative chatbot, relies on large language models that extract extensive amounts of text to produce human-like responses. Awad and Tremblay argue that ChatGPT’s detailed summaries of their novels, Bunny, 13 Ways of Looking at a Fat Girl, and The Cabin at the End of the World, are evidence of their novels being used to train the chatbot.
The lawsuit asserts that OpenAI’s training datasets for its generative chatbots have incorporated copyrighted works, including books authored by Awad and Tremblay, without providing credit, consent, or compensation. Books are considered valuable training material for these large language models due to their high-quality writing and long-form content.
OpenAI’s GPT-1, unveiled in June 2018, was trained using BookCorpus, a dataset compiled in 2015 that contained over 7,000 unpublished books from various genres. However, these books were predominantly protected by copyright and were copied without permission from Smashwords.com, a platform hosting free, unpublished novels.
Subsequent iterations of OpenAI’s large language models, such as GPT-3, were trained using even larger quantities of copyrighted books. OpenAI’s paper released in July 2020 indicated that 15% of the training data set for GPT-3 originated from two internet-based books corpora referred to as Books1 and Books2. The lawsuit approximates that Books1 comprises around 63,000 titles, while Books2 includes approximately 294,000 titles.
The lawsuit alleges that the OpenAI Language Models are infringing derivative works because they can only function with the extracted information from the plaintiffs’ novels and others, which is retained within the models without the authors’ permission. This, the authors claim, violates their exclusive rights under the Copyright Act.
In addition to Awad and Tremblay’s lawsuit, a separate class-action suit was filed by Clarkson, a public-interest law firm, on behalf of anonymous clients. This lawsuit alleges that OpenAI extracted private information from internet users without their consent or knowledge, further fueling concerns about privacy and user consent in AI development.
The filing of these lawsuits highlights the growing tension between AI technology and copyright protection. As AI technologies continue to advance and rely on greater amounts of data, the issue of using copyrighted material for training purposes without proper authorization or compensation may result in additional legal challenges for companies like OpenAI. Experts predict that more lawsuits involving AI and data usage are likely to emerge in the future.