AI Training Data in the Limelight – Even I’m Surprised

Date:

The general public now has a peek into the world of AI, particularly the data that goes into training AI technologies. Last week, the Washington Post highlighted Google’s C4 data set – more formally known as the English Colossal Colossal Clean Corpus – which consists of 15 million websites. These websites range from proprietary to personal to offensive, and have been used to train high-profile models like Google’s T5 and Meta’s LLaMA. The nonprofit CommonCrawl collected the C4 data set in April 2019 and their process is not foolproof.

It’s not a surprise to me that as an experienced writer, many of my own works are included in the AI training data set. Having reported on data analytics, digital advertising technology and GPS tracking for many years, there is no denying the vast number of data sources out there.

The scale of data used to train AI technologies is immense. OpenAI’s ChatGPT and Google’s Bard, for example, utilize the entire World Wide Web, which contains considerable amounts of private information about individuals, thus giving rise to privacy concerns. It’s not publicly known how exactly the information is used and if it’s stored in big databases.

In 2021, Stella Rose Biderman, Executive Director of Eleuther AI, pointed out that Google’s C4 data set was lower-quality than Eleuther’s Pile data, which contains PubMed, Wikipedia, and Github. Prakash, founder and CEO of Together also noted that AI models learn from, but reproduce the data, making it technically fair use.

The ‘yuck’ factor of my work being included in the AI dataset is still strong. It’s like I’m helping to train a Goose that later might try to take my place. I feel especially uncomfortable knowing that the world of AI is discussing and incorporating my works.

See also  Nvidia, Nikkei Drop - China Avoids Deflation Shock

As a solution, Michael Wooldridge, a professor of computer science at the University of Oxford, suggested that AI researchers should make the data sets publicly available and criticizable so that they can be monitored. This would reduce the amount of misinformation spread as well as privacy risks.

VentureBeat is a media outlet which was founded with the goal to give readers an unbiased, honest look into the world of technology and business innovation. They provide insights from executives and business insiders, enabling readers to make informed decisions and become pioneers of change. VentureBeat has more than 10 million tokens included in the AI training data set for LPMs, providing readers with content to help them stay at the forefront of technological changes.

Frequently Asked Questions (FAQs) Related to the Above News

Please note that the FAQs provided on this page are based on the news article published. While we strive to provide accurate and up-to-date information, it is always recommended to consult relevant authorities or professionals before making any decisions or taking action based on the FAQs or the news article.

Share post:

Subscribe

Popular

More like this
Related

Obama’s Techno-Optimism Shifts as Democrats Navigate Changing Tech Landscape

Explore the evolution of tech policy from Obama's optimism to Harris's vision at the Democratic National Convention. What's next for Democrats in tech?

Tech Evolution: From Obama’s Optimism to Harris’s Vision

Explore the evolution of tech policy from Obama's optimism to Harris's vision at the Democratic National Convention. What's next for Democrats in tech?

Tonix Pharmaceuticals TNXP Shares Fall 14.61% After Q2 Earnings Report

Tonix Pharmaceuticals TNXP shares decline 14.61% post-Q2 earnings report. Evaluate investment strategy based on company updates and market dynamics.

The Future of Good Jobs: Why College Degrees are Essential through 2031

Discover the future of good jobs through 2031 and why college degrees are essential. Learn more about job projections and AI's influence.