Major Battle Erupts Over Copyright and Generative AI as Meta Withholds Training Data for Llama AI Model
A heated battle is brewing between publishers and big tech companies over generative artificial intelligence (AI) and copyright issues. Publishers are demanding compensation for the use of their work in training large language models, but tech giants like Meta (formerly Facebook) are reluctant to pay. In an attempt to sidestep the controversy, Meta has taken the unusual approach of not disclosing the specific data used to train their AI model, called Llama 2.
In a research paper released on Tuesday, Meta’s researchers provided minimal information about the training data, simply stating that it consisted of a new mix of publicly available online data. This departure from the standard practice of openness within the AI industry has raised eyebrows. Previous research papers on AI models, like the original Transformer research paper, have shared detailed information about the training data used.
The inclusion of specific training data is crucial for researchers to understand and trace the outputs of AI models. This transparency allows for accountability in case errors or issues arise, enabling researchers to rectify the problems. The original LLaMA research paper, when Meta released its first version in February, listed all the training data sources in detail, including books and the vast Common Crawl dataset, which is an extensive collection of internet data.
So, what has changed in the past five months? Publishers and content creators have become aware that their work is being used to train these AI models without their permission. Consequently, numerous lawsuits challenging the rights of tech companies to use copyrighted material for AI model training have emerged. Celebrities like Sarah Silverman have joined the legal battle against the unauthorized use of their work.
Tech companies are fully aware of the risks associated with this issue. Microsoft, a backer of OpenAI, acknowledged the potential dangers in their recent quarterly SEC filing, citing the possibility of legal liability under new legislation regulating AI. Intellectual property, including copyright, plays a significant role in this context. On the other hand, Google, another AI leader, argues that using public information to develop new beneficial uses aligns with US law and could be a valid argument in court.
In this landscape, Meta seems to prefer maintaining secrecy about the data it uses until the legal situation becomes clearer. However, it is important to note that there may be other reasons for Meta’s reticence. Sharon Zhou, CEO of Lamini AI, has suggested various theories regarding Meta’s decision.
In response to queries about the lack of data transparency, a Meta spokesperson emphasized that developers would still have access to model weights and starting code for pretrained and conversational fine-tuned versions, as well as responsible use resources. While keeping the data mixes undisclosed for competitive reasons, Meta claims that its internal Privacy Review process ensures responsible data usage and reflects evolving societal expectations. They remain committed to the responsible and ethical development of their generation AI products.
As the debate continues, it is evident that the use of copyrighted material to train AI models raises significant legal questions. Going forward, it will be crucial to strike a balance between the interests of publishers and the development of innovative technologies, ensuring that regulations align with ethical considerations and the expectations of creators. Only time will tell how this copyright drama surrounding generative AI unfolds.