Following an investigation by Proof News in collaboration with Wired, it has been revealed that major tech companies such as Apple, NVIDIA, Anthropic, and others have reportedly used unauthorized data from YouTube to train their AI models. This data was gathered from a wide range of sources including books, websites, photos, and social media posts without the knowledge or consent of the creators.
Last year, Zoom clarified its terms to ensure that AI training data would not be used without explicit user consent, addressing concerns regarding privacy invasion. However, Proof News discovered that these companies extracted subtitles from a staggering 173,536 YouTube videos from over 48,000 channels, despite YouTube’s policies against such data harvesting.
The YouTube Subtitles dataset comprised transcripts from educational channels like Khan Academy, MIT, and Harvard, as well as content from media outlets such as The Wall Street Journal and NPR, and entertainment shows like The Late Show and Last Week Tonight. This collection, totaling 5.7GB and containing 489 million words, included videos from popular YouTubers like MrBeast and PewDiePie, along with content promoting conspiracy theories like the flat-earth theory.
Creators like David Pakman, who had nearly 160 videos included in the dataset, expressed frustration over the unauthorized use of their content. Pakman highlighted the financial and creative investments involved in his work and called for compensation from AI companies that utilized his data. Critics, including Dave Wiskus of Nebula, argued that using creators’ work without consent is unethical and could potentially harm artists and content creators.
Tech giants like Apple and NVIDIA admitted to using the Pile dataset, which includes YouTube Subtitles, to train their AI models. Apple employed it for their OpenELM model, released shortly before announcing new AI features for iPhones and MacBooks. Anthropic defended its use of the dataset, stating that they only used a small subset of YouTube subtitles, and Salesforce also confirmed using the Pile for AI research purposes, acknowledging the dataset’s inclusion of profanity and biases.
The use of such datasets raises ethical concerns about data ethics and copyright in AI development, sparking debates about fair compensation for content and the responsibility of tech giants in protecting creators’ rights. As AI technologies continue to advance, questions persist about transparency and accountability in AI data sourcing and usage, emphasizing the need for regulation and fair compensation for content creators.