Title: Authors Sue OpenAI, Accusing Use of Pirated Content to Train ChatGPT
Authors Paul Tremblay and Mona Awad have taken legal action against OpenAI, the parent company of ChatGPT, by filing a class action lawsuit. The authors claim that their copyrighted works were used without permission in the training of ChatGPT, alleging copyright infringement and violations of the DMCA.
According to the plaintiffs, they never granted OpenAI permission to utilize their works, yet ChatGPT can accurately provide summaries of their writings. This suggests that the information must have been derived from somewhere. While OpenAI has not revealed the specific datasets used in training ChatGPT, an older paper references Books1 and Books2 as sources. Books1 contains approximately 63,000 titles, while Books2 comprises around 294,000 titles.
However, Tremblay and Awad argue that legitimate databases with such extensive collections of books do not exist. They believe OpenAI likely resorted to using pirated resources from shadow library websites like Library Genesis (LibGen), Z-Library (Bok), Sci-Hub, and Bibliotik. These websites are infamous for aggregating books available for bulk download through torrent systems.
The complaint states, Indeed, when ChatGPT is prompted, ChatGPT generates summaries of Plaintiffs’ copyrighted works – something only possible if ChatGPT was trained on Plaintiffs’ copyrighted works.
Based on these allegations, the complaint claims that OpenAI has infringed upon copyright laws. The plaintiffs are seeking statutory damages, which could amount to $150,000 per work. They are also considering additional damages related to the alleged removal of copyright management information, which would violate the DMCA.
This lawsuit stands out because it highlights the accusation that OpenAI used pirate websites for training data. Notably, Z-Library, a shadow library that houses millions of pirated books, is currently facing criminal prosecution by the U.S. Department of Justice.
The resolution of copyright-related issues in the realm of AI remains uncertain. Governments worldwide are adopting different approaches, with the U.S. Congress taking a cautious stance. However, rights holders are actively pursuing their interests and are unlikely to remain passive.
Although there is no direct evidence implicating OpenAI in the use of pirate sites for training ChatGPT, it is known that some AI projects have utilized pirated material in the past. Instances have been reported where AI models developed by Google and Facebook were trained on the C4 dataset, which included Z-Library and other pirate sites, as highlighted by a comprehensive summary from Search Engine Journal.
This lawsuit is expected to garner significant attention from both AI enthusiasts and rights holders. The outcome could potentially compel OpenAI to disclose aspects of its training data, which would be of great interest in its own right.
Even if it is established that ChatGPT was indeed trained using pirated books, the court would still need to determine whether such usage constitutes copyright infringement. Some experts argue that this type of AI training could fall under fair use.
Fair use protects transformative applications of copyrighted works that do not directly compete with the original content. Several experts believe this defense may apply to AI training scenarios.
The outcome of this lawsuit will undoubtedly shape the future landscape of AI and copyright law, carrying significant implications for both technology developers and content creators.