Title: Shadow Libraries Stir Copyright Lawsuits Against OpenAI over ChatGPT’s Training Material
Artificial intelligence technology company OpenAI is facing copyright infringement lawsuits related to its AI chatbot, ChatGPT, filed by three writers, including Sarah Silverman. The plaintiffs claim that their copyrighted books were used without their consent as training material for ChatGPT, alleging that the texts were ingested by the AI bot during its training process.
In order to produce human-like responses, AI bots are trained on extensive datasets sourced from various internet materials. OpenAI, however, remains secretive about the specific source texts used to train its models, citing safety reasons and competition in the industry. Among the dataset components, books play a crucial role as they offer lengthy examples of high-quality writing. However, the lawsuit filed by Silverman suggests that much of the book data used in training ChatGPT is sourced from illegal shadow libraries that contain the works of these writers.
OpenAI has disclosed that approximately 15% of the training set for GPT-3, the current language model employed by the free version of ChatGPT, consists of two internet-based book collections referred to as Books1 and Books2, as mentioned in the lawsuit. Clues suggest that Books1 is linked to Project Gutenberg, an online e-book library featuring over 60,000 titles and commonly used by AI researchers due to the absence of copyright restrictions. On the other hand, Books2 likely encompasses about 294,000 titles.
Most of the internet-based books corpora used in training ChatGPT is presumed to originate from shadow library websites such as Library Genesis, Z-Library, Sci-Hub, and Bibliotik. These platforms aggregate books that are out of print, difficult to access, and often behind paywalls. Originating in Russia, these shadow libraries gained popularity among financially constrained researchers seeking affordable access to scholarly journals that were prohibitively expensive, with individual articles supposedly priced at up to $500.
Shadow libraries have drawn the label of pirate libraries due to their involvement in copyright infringement and the potential negative impact they have on the publishing industry’s revenue stream. According to a 2017 study conducted by Nielsen and Digimarc, pirated books can depress legitimate book sales by up to 14%.
To combat shadow libraries, various governments worldwide have taken actions such as seizing websites associated with these platforms. For example, the FBI seized several websites linked to Z-Library and charged two Russian nationals with criminal copyright infringement, wire fraud, and money laundering. Despite these efforts, shadow library websites have been able to create mirror sites after an initial takedown by the US government, as reported by Vice. Additionally, courts in France and India have ordered internet service providers to block Z-Library.
The lawsuit filed by Sarah Silverman against OpenAI concerning ChatGPT’s training material is not an isolated incident. Similar copyright infringement lawsuits have been brought against other generative AI companies as well. For instance, visual artists filed a lawsuit against Stability AI, Midjourney, and DeviantArt earlier this year. Furthermore, GitHub programmers initiated a class-action lawsuit against GitHub, its parent company Microsoft Corp., and OpenAI in November, alleging that GitHub Copilot relies on widespread open-source software piracy.
In response to the mounting lawsuits, Pau Garcia, the founder of art consulting firm Domestic Data Streamers, suggested that AI companies should either shift their training models to exclusively use material in the public domain or obtain explicit permission from artists to use their content as training data, with artists being compensated accordingly.
Some companies are also exploring the possibility of granting artists control over the content AI models can be trained on. For example, music streaming platform Audius recently introduced a feature allowing artists to create a dedicated page for their work that anyone can use for generating AI-based tracks.
As OpenAI faces legal battles concerning the use of copyrighted material in training its AI models, discussions around fair use, copyright permissions, and artistic control are entering center stage in the rapidly evolving field of artificial intelligence.