Meta’s AI chatbot, a product of Meta, claimed that it was trained on vast amounts of data from YouTube videos. While Meta did not explicitly confirm this claim, it did suggest that the chatbot could potentially provide inaccurate information.
When questioned by Business Insider about the specifics of its training data and data collection methods, the Meta AI chatbot revealed intriguing details. It disclosed that it was trained on extensive datasets of transcriptions from YouTube videos and that Meta utilizes its web scraper bot, known as MSAE (Meta Scraping and Extraction), to gather large volumes of data from the internet for AI model training.
The revelation of the existence of this web scraper was previously undisclosed by Meta. It is worth noting that using bots and scrapers to collect data from YouTube is prohibited by the platform’s terms of service, a practice that has recently raised concerns with OpenAI.
A Meta spokesperson refrained from denying the chatbot’s claims concerning the scraper and training data, instead indicating that the AI model could potentially generate inaccurate or inappropriate outputs. The spokesperson emphasized Meta’s ongoing efforts to enhance these features based on user feedback.
Meta AI initially mentioned that its training data included a third-party dataset comprising 3.7 million transcribed YouTube videos but clarified that the web scraper bot was not used specifically for scraping YouTube videos. Additionally, the chatbot mentioned another substantial dataset of transcriptions from 6 million YouTube videos sourced from a third party, as well as two other sets of YouTube transcriptions and a dataset from TED Talks posted on YouTube.
Furthermore, Meta AI highlighted its commitment to avoid collecting copyrighted data and mentioned referring to sources like NBC News, CNN, and The Financial Times in its responses to queries. The chatbot acknowledged the importance of respecting robots.txt, a mechanism used to block content scraping by bots.
Meta is currently exploring potential partnerships with media publishers to access additional training data for its AI models, which could potentially enhance the performance of Meta AI. Despite the release of Llama 3, a large language model developed by Meta, the company has not disclosed the training data used for the model, which reportedly consists of 15 trillion tokens sourced from public domains.
In the realm of AI technology, web scrapers play a crucial role in extracting online content for training language models like Llama 3. However, legal concerns surrounding copyright infringement have prompted scrutiny of tech companies like Meta, Google, and OpenAI for their data collection practices.
As the use of AI continues to evolve, it is essential for companies like Meta to navigate the ethical and legal implications of data scraping and model training to ensure compliance with regulations and intellectual property rights.