The ongoing battle between copyright holders and OpenAI has taken a new turn as two novelists, Paul Tremblay and Mona Awad, have filed a lawsuit against the company in a federal court in San Francisco. The authors allege that OpenAI’s ChatGPT large language model was trained using data from their copyrighted books without their consent.
In the 16-page class action suit filed on June 28, Tremblay, the author of The Cabin at the End of the World, and Awad, the author of Bunny and 13 Ways of Looking at a Fat Girl, claim that ChatGPT is capable of generating highly accurate summaries of their literary works when prompted. They argue that this level of accuracy is only possible if the model was trained on the content of their books, which would be a violation of federal copyright law. The authors assert that OpenAI stands to profit commercially from the use of their copyrighted materials.
This lawsuit represents the first copyright-related legal claim against OpenAI, but it is unlikely to be the last. Intellectual property scholar Andres Guadamuz from the University of Susse commented that this case could set a precedent for future claims.
The authors’ complaint references a 2018 paper by OpenAI where the company revealed that its GPT-1 model was trained on BookCorpus, a collection of over 7,000 unique unpublished books spanning various genres. In a subsequent 2020 paper introducing GPT-3, OpenAI disclosed that 15% of its training dataset came from Books1 and Books2, internet-based book corpora that comprise over 350,000 books.
Since the launch of ChatGPT, OpenAI has not publicly disclosed the specific data used to train the model or its source. The company stated in its 2020 paper that most of the training data was scraped from the web, including archived books and Wikipedia.
The lawsuit by Tremblay and Awad highlights the emerging battle between copyright holders and AI companies that use copyrighted materials to train their models. It also adds to the growing demands for damages caused by the unauthorized use of copyrighted works, raising questions about how to prove financial losses in such cases.
In previous instances, visual artists filed suits against AI engines for using their artwork without permission, and music creators emphasized the need to protect their copyrights from generative AI systems. This latest lawsuit from Tremblay and Awad further pushes regulators and courts to define the rules surrounding copyright and AI. They may require AI companies to disclose the sources and methods of their training data, allowing for greater transparency in these systems.
As this legal battle unfolds, it will shape the future landscape of copyrights in the context of AI. The outcome may have significant implications for generative AI companies, potentially opening up the once-opaque workings of these systems for public scrutiny.