Last week’s copyright infringement lawsuit by The New York Times (NYT) against ChatGPT-maker OpenAI has opened another battlefront between big technology companies and news publishers. NYT has questioned the very essence of how large language models (LLMs) – on which tools such as ChatGPT– are trained. The development could have broader ramifications for news media firms and how they should be valued for helping train language models by the courts. Even as most generative AI companies are dealing with copyright issues post facto, Apple has struck commercial discussions with publishers to sign multi-million deals to use licensed content.
In the lawsuit, The New York Times alleges that OpenAI and its largest investor Microsoft have used millions of articles published by the news organization to train chatbots, accusing them of wide-scale copying. The lawsuit claims that OpenAI’s chatbots are now competing with the media platform as a source of information. It further states that the data from Google and Wikipedia, which is the biggest dataset scraped from the internet by Common Crawl, a non-profit web crawler, has been partially used to train the GPT3 engines. The New York Times argues that OpenAI’s generative AI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style. The lawsuit asserts that OpenAI’s tools undermine and damage The Times’s relationship with its readers and deprive The Times of subscription, licensing, advertising, and affiliate revenue, as the paywall has been breached, directly impacting its business model.
Meanwhile, Apple has adopted a different approach. The tech giant has reportedly approached media companies, including Conde Nast, NBC News, and IAC, with multiyear licensing deals worth at least $50 million to license their archives of news articles. Apple seeks permission to use content before training its generative AI models, unlike other platforms that approach for deals after already training their models. This strategy has received positive feedback from executives at publishing firms, although there are still some concerns about the terms offered by Apple.
The issue of copyright infringement in the context of training LLMs and natural language processing engines is multifaceted. Many generative AI companies use a process called web-scraping to gather data from the internet and feed it into their LLMs. OpenAI, for instance, has been accused of scraping over 300 billion words from the internet without user consent. The New York Times claims that it had raised concerns with OpenAI and Microsoft about the use of its material but no resolution was reached, leading to the filing of the lawsuit.
Numerous copyright lawsuits have been filed by music labels, authors, and now news publishers, seeking to address the question of fair use and the extent to which copyrighted material can be used in training LLMs. These lawsuits will test the various copyright laws in different jurisdictions, considering factors such as the amount of original material used, the purpose and commercial nature of the use, the value of the copyrighted material, and the impact of its use.
In response to potential copyright suits, OpenAI announced a copyright shield in November, offering indemnification to its enterprise users against such claims arising from the use of ChatGPT. Microsoft, Google, and Amazon have also introduced similar shields. However, these measures do not resolve the underlying legal issues and debates surrounding the use of copyrighted material in training AI models.
As the legal battles continue, news publishers are taking various approaches. While some, like The New York Times, are pursuing lawsuits, others, like Apple, are seeking pre-emptive licensing agreements. The outcome of these cases will have significant implications not only for news publishers but for the entire landscape of technology companies relying on large language models for their AI applications.