Shadow libraries driving mounting copyright lawsuits against OpenAI.

Date:

Title: Shadow Libraries Stir Copyright Lawsuits Against OpenAI over ChatGPT’s Training Material

Artificial intelligence technology company OpenAI is facing copyright infringement lawsuits related to its AI chatbot, ChatGPT, filed by three writers, including Sarah Silverman. The plaintiffs claim that their copyrighted books were used without their consent as training material for ChatGPT, alleging that the texts were ingested by the AI bot during its training process.

In order to produce human-like responses, AI bots are trained on extensive datasets sourced from various internet materials. OpenAI, however, remains secretive about the specific source texts used to train its models, citing safety reasons and competition in the industry. Among the dataset components, books play a crucial role as they offer lengthy examples of high-quality writing. However, the lawsuit filed by Silverman suggests that much of the book data used in training ChatGPT is sourced from illegal shadow libraries that contain the works of these writers.

OpenAI has disclosed that approximately 15% of the training set for GPT-3, the current language model employed by the free version of ChatGPT, consists of two internet-based book collections referred to as Books1 and Books2, as mentioned in the lawsuit. Clues suggest that Books1 is linked to Project Gutenberg, an online e-book library featuring over 60,000 titles and commonly used by AI researchers due to the absence of copyright restrictions. On the other hand, Books2 likely encompasses about 294,000 titles.

Most of the internet-based books corpora used in training ChatGPT is presumed to originate from shadow library websites such as Library Genesis, Z-Library, Sci-Hub, and Bibliotik. These platforms aggregate books that are out of print, difficult to access, and often behind paywalls. Originating in Russia, these shadow libraries gained popularity among financially constrained researchers seeking affordable access to scholarly journals that were prohibitively expensive, with individual articles supposedly priced at up to $500.

See also  Revolutionizing Network Monitoring: Advanced Tools Drive Efficient Troubleshooting

Shadow libraries have drawn the label of pirate libraries due to their involvement in copyright infringement and the potential negative impact they have on the publishing industry’s revenue stream. According to a 2017 study conducted by Nielsen and Digimarc, pirated books can depress legitimate book sales by up to 14%.

To combat shadow libraries, various governments worldwide have taken actions such as seizing websites associated with these platforms. For example, the FBI seized several websites linked to Z-Library and charged two Russian nationals with criminal copyright infringement, wire fraud, and money laundering. Despite these efforts, shadow library websites have been able to create mirror sites after an initial takedown by the US government, as reported by Vice. Additionally, courts in France and India have ordered internet service providers to block Z-Library.

The lawsuit filed by Sarah Silverman against OpenAI concerning ChatGPT’s training material is not an isolated incident. Similar copyright infringement lawsuits have been brought against other generative AI companies as well. For instance, visual artists filed a lawsuit against Stability AI, Midjourney, and DeviantArt earlier this year. Furthermore, GitHub programmers initiated a class-action lawsuit against GitHub, its parent company Microsoft Corp., and OpenAI in November, alleging that GitHub Copilot relies on widespread open-source software piracy.

In response to the mounting lawsuits, Pau Garcia, the founder of art consulting firm Domestic Data Streamers, suggested that AI companies should either shift their training models to exclusively use material in the public domain or obtain explicit permission from artists to use their content as training data, with artists being compensated accordingly.

See also  New York-based firm launches RemAI, an AI tool revolutionizing financial advice

Some companies are also exploring the possibility of granting artists control over the content AI models can be trained on. For example, music streaming platform Audius recently introduced a feature allowing artists to create a dedicated page for their work that anyone can use for generating AI-based tracks.

As OpenAI faces legal battles concerning the use of copyrighted material in training its AI models, discussions around fair use, copyright permissions, and artistic control are entering center stage in the rapidly evolving field of artificial intelligence.

Frequently Asked Questions (FAQs) Related to the Above News

What is OpenAI facing lawsuits for?

OpenAI is facing lawsuits related to copyright infringement, specifically regarding their AI chatbot, ChatGPT. Three writers, including Sarah Silverman, have alleged that their copyrighted books were used without their consent as training material for ChatGPT.

How are AI bots trained?

AI bots are trained on extensive datasets sourced from various internet materials to produce human-like responses.

Why is OpenAI secretive about the specific source texts used to train their models?

OpenAI cites safety reasons and competition in the industry as the reasons for their secrecy regarding the specific source texts used in training their models.

What role do books play in training AI models?

Books play a crucial role in training AI models as they offer lengthy examples of high-quality writing, which helps in producing better responses from AI bots.

Which illegal libraries are alleged to be the source of copyrighted materials?

The alleged illegal shadow libraries that are believed to be the source of copyrighted materials for ChatGPT's training include Library Genesis, Z-Library, Sci-Hub, and Bibliotik.

What actions have governments taken against shadow libraries?

Governments worldwide have taken actions such as seizing websites associated with shadow library platforms. For example, the FBI seized several websites linked to Z-Library and charged two Russian nationals with criminal copyright infringement, wire fraud, and money laundering.

What are some concerns regarding shadow libraries?

Shadow libraries have been labeled as pirate libraries due to their involvement in copyright infringement and the potential negative impact they have on the revenue stream of the publishing industry.

Are there other similar lawsuits against generative AI companies?

Yes, there have been similar copyright infringement lawsuits brought against other generative AI companies, such as Stability AI, Midjourney, and DeviantArt. GitHub and OpenAI also faced a class-action lawsuit related to open-source software piracy.

What suggestions have been made to address the use of copyrighted material in AI training?

Some suggestions include shifting training models to exclusively use material in the public domain or obtaining explicit permission from artists to use their content as training data, with artists being compensated accordingly.

How are some companies exploring artist control over AI-generated content?

Some companies, like music streaming platform Audius, are introducing features that allow artists to create dedicated pages for their work, which anyone can use to generate AI-based tracks.

What broader discussions are arising in the field of artificial intelligence as a result of these lawsuits?

The discussions around fair use, copyright permissions, and artistic control are coming to the forefront in the rapidly evolving field of artificial intelligence as these lawsuits against OpenAI and other companies raise important questions about the ethics and legality of AI training with copyrighted material.

Please note that the FAQs provided on this page are based on the news article published. While we strive to provide accurate and up-to-date information, it is always recommended to consult relevant authorities or professionals before making any decisions or taking action based on the FAQs or the news article.

Aryan Sharma
Aryan Sharma
Aryan is our dedicated writer and manager for the OpenAI category. With a deep passion for artificial intelligence and its transformative potential, Aryan brings a wealth of knowledge and insights to his articles. With a knack for breaking down complex concepts into easily digestible content, he keeps our readers informed and engaged.

Share post:

Subscribe

Popular

More like this
Related

Global Data Center Market Projected to Reach $430 Billion by 2028

Global data center market to hit $430 billion by 2028, driven by surging demand for data solutions and tech innovations.

Legal Showdown: OpenAI and GitHub Escape Claims in AI Code Debate

OpenAI and GitHub avoid copyright claims in AI code debate, showcasing the importance of compliance in tech innovation.

Cloudflare Introduces Anti-Crawler Tool to Safeguard Websites from AI Bots

Protect your website from AI bots with Cloudflare's new anti-crawler tool. Safeguard your content and prevent revenue loss.

Paytm Founder Praises Indian Government’s Support for Startup Growth

Paytm founder praises Indian government for fostering startup growth under PM Modi's leadership. Learn how initiatives are driving innovation.