Major AI companies like Apple and Nvidia have come under fire for training their artificial intelligence models with YouTube content without obtaining permission from content creators. An investigation by Proof News and Wired revealed that these companies, along with Anthropic and others, have utilized a dataset called YouTube Subtitles, containing transcripts from approximately 175,000 videos across 48,000 channels, without the knowledge of the video creators.
The YouTube Subtitles dataset, developed by EleutherAI as part of a larger collection called the Pile, includes text from video subtitles, often with translations in various languages. Despite EleutherAI’s goal of democratizing access to AI development, major tech firms have been leveraging this dataset to train their models. Apple, for example, utilized the Pile to train its OpenELM AI model, while Salesforce’s AI model, released two years ago, also relied on this dataset.
The dataset includes content from a wide range of YouTube channels spanning news, education, entertainment, and popular creators like MrBeast and Marques Brownlee. Notably, some videos used in the dataset have been deleted by their creators, potentially leading to concerns about unauthorized content usage and lack of compensation.
Utilizing YouTube’s API to automatically download subtitles, the dataset collection process raises questions about compliance with YouTube’s terms of service, which explicitly prohibit automated scraping of video content. The revelation has sparked outrage among content creators, who were surprised to learn that their work was being used in AI models without consent.
While EleutherAI has not provided a comment on the matter, the ethical implications of using unauthorized content in AI development have become a point of contention. As the legal and regulatory landscape of AI continues to evolve, this discovery underscores the need for a balance between technological innovation and ethical responsibility in the industry.