Recent research has discovered that two language models — ChatGPT and its successor GPT-4 — appear to have memorised details from vast numbers of copyrighted books. This raises questions about the legality of how such large language models (LLMs) are created. Both of these artificial intelligences were developed by the company OpenAI and trained on huge amounts of data. However, it is unclear exactly which texts form the basis of this training data.
To answer this question, David Bamman from the University of California, Berkeley, and his colleagues tested if the AIs could reliably tell passages from copyrighted books apart from less-known books. The AI was tested on passages from famous books such as Harry Potter and Game of Thrones, as well as seemingly random less-known novels and poems. The test revealed that the model had memorised the contents of copyrighted books, indicating that it was trained on them.
This raises several issues — not least on the potential copyright infringement that has taken place. It implies that OpenAI used passages from these books without permission, making it difficult to see how they could remain compliant with copyright regulations. What’s more, this discovery could set a dangerous precedent, with LLMs trained on vast amounts of copyright-protected material becoming more common.
OpenAI, which is based in San Francisco in the United States, was co-founded by entrepreneur Elon Musk. The company is dedicated to advancing artificial intelligence technologies. OpenAI’s mission is to ensure that artificial intelligence benefits all of humanity. Co-founder and current CEO, Sam Altman, is a highly experienced leader in the fields of technology and innovation.
David Bamman is a computer scientist from UC Berkeley whose work focuses on natural language processing and the use of artificial intelligence in digital humanities, publishing and media. He has published numerous research papers and books on artificial intelligence and natural language processing.