Meta AI's Secret Web Scraper Revealed: Training Data and Legal Concerns Uncovered

Meta’s AI chatbot, a product of Meta, claimed that it was trained on vast amounts of data from YouTube videos. While Meta did not explicitly confirm this claim, it did suggest that the chatbot could potentially provide inaccurate information.

When questioned by Business Insider about the specifics of its training data and data collection methods, the Meta AI chatbot revealed intriguing details. It disclosed that it was trained on extensive datasets of transcriptions from YouTube videos and that Meta utilizes its web scraper bot, known as MSAE (Meta Scraping and Extraction), to gather large volumes of data from the internet for AI model training.

The revelation of the existence of this web scraper was previously undisclosed by Meta. It is worth noting that using bots and scrapers to collect data from YouTube is prohibited by the platform’s terms of service, a practice that has recently raised concerns with OpenAI.

A Meta spokesperson refrained from denying the chatbot’s claims concerning the scraper and training data, instead indicating that the AI model could potentially generate inaccurate or inappropriate outputs. The spokesperson emphasized Meta’s ongoing efforts to enhance these features based on user feedback.

Meta AI initially mentioned that its training data included a third-party dataset comprising 3.7 million transcribed YouTube videos but clarified that the web scraper bot was not used specifically for scraping YouTube videos. Additionally, the chatbot mentioned another substantial dataset of transcriptions from 6 million YouTube videos sourced from a third party, as well as two other sets of YouTube transcriptions and a dataset from TED Talks posted on YouTube.

Furthermore, Meta AI highlighted its commitment to avoid collecting copyrighted data and mentioned referring to sources like NBC News, CNN, and The Financial Times in its responses to queries. The chatbot acknowledged the importance of respecting robots.txt, a mechanism used to block content scraping by bots.

Meta is currently exploring potential partnerships with media publishers to access additional training data for its AI models, which could potentially enhance the performance of Meta AI. Despite the release of Llama 3, a large language model developed by Meta, the company has not disclosed the training data used for the model, which reportedly consists of 15 trillion tokens sourced from public domains.

In the realm of AI technology, web scrapers play a crucial role in extracting online content for training language models like Llama 3. However, legal concerns surrounding copyright infringement have prompted scrutiny of tech companies like Meta, Google, and OpenAI for their data collection practices.

As the use of AI continues to evolve, it is essential for companies like Meta to navigate the ethical and legal implications of data scraping and model training to ensure compliance with regulations and intellectual property rights.

Meta AI’s Secret Web Scraper Revealed: Training Data and Legal Concerns Uncovered

Frequently Asked Questions (FAQs) Related to the Above News

Subscribe

How to Use Chat GPT: Step by Step Guide to Start Open AI ChatGPT

Fascinating Facts on ChatGPT

ChatGPT Global News Offers Comprehensive AI-Powered News Coverage

An Overview of ChatGPT

Meet the Experts Who Trained ChatGPT

More like this
Related

Obama’s Techno-Optimism Shifts as Democrats Navigate Changing Tech Landscape

Tech Evolution: From Obama’s Optimism to Harris’s Vision

Tonix Pharmaceuticals TNXP Shares Fall 14.61% After Q2 Earnings Report

The Future of Good Jobs: Why College Degrees are Essential through 2031

About us

Company

The latest

Obama’s Techno-Optimism Shifts as Democrats Navigate Changing Tech Landscape

Tech Evolution: From Obama’s Optimism to Harris’s Vision

Tonix Pharmaceuticals TNXP Shares Fall 14.61% After Q2 Earnings Report

Subscribe

Meta AI’s Secret Web Scraper Revealed: Training Data and Legal Concerns Uncovered

Frequently Asked Questions (FAQs) Related to the Above News

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

More like this
Related