OpenAI and Meta, two prominent companies in the field of artificial intelligence (AI), are facing copyright infringement complaints related to their language models. The complaints allege that OpenAI’s ChatGPT and Meta’s LLaMA have used copyrighted material without the consent, credit, or compensation of the authors.
The issue at hand revolves around the legal status of the data used to train these large language models. Both OpenAI and Meta have utilized publicly available data from the internet in their model training processes.
The complaints, filed in the Northern District of California US District Court under the case numbers 3:23-cv-03417 and 3:23-cv-03417, are brought forward by several plaintiffs, including renowned actor and author Sarah Silverman. They claim that their copyrighted works were used as training material for the LLaMA and ChatGPT models.
According to the complaint against OpenAI, a significant portion of the training datasets used by the company consists of copyrighted works, including books written by the plaintiffs. The complaint alleges that OpenAI copied these copyrighted works without obtaining consent, providing credit, or offering compensation. It further states that the generated summaries produced by ChatGPT can only be possible if the model was trained on copyrighted works authored by the plaintiffs.
The complaint against Meta shares similarities, but explicitly mentions the existence of ‘shadow libraries’ accessible through torrent systems. In Meta’s LLaMA paper, it is stated that the model was trained on Project Gutenberg, an online repository of books that have entered the public domain. Additionally, the Books3 section of The Pile was used. However, the complainants take issue with the source of the Books3 dataset, which was derived from a shadow library website called Bibliotik. This website contains copyrighted material, and it is this aspect that has sparked the complaints against Meta.
ITPro reached out to both Meta and OpenAI for comments, but as of now, neither organization has responded.
OpenAI has faced legal scrutiny before due to concerns over the content used in its training models, and similar cases are already progressing through the court system.
Businesses utilizing generative AI tools with models trained on publicly available content face two primary issues. Firstly, there is the risk that the output generated by these tools, such as ChatGPT, may contain falsehoods or infringe upon intellectual property rights. The latter concern forms the basis of the complaints filed against OpenAI and Meta, as businesses fear the unintentional utilization of illegally acquired material by their employees.
The second issue involves the potential risk of employees inputting confidential information into generative AI systems without realizing that it becomes part of the training dataset for a language model and may resurface elsewhere.
These concerns have led some workplaces to ban certain generative AI tools entirely or impose restrictions on their usage.
To mitigate these risks, some businesses opt for a closed approach by using only internal datasets for training generative AI models. Open-source alternatives exist, and established vendors like Oracle have ventured into this space by allowing customers to train specific models using their own data, theoretically avoiding copyright challenges faced by OpenAI and Meta.