Uncovering the Hidden Websites that Make AI Like ChatGPT Seem Intelligent

Date:

The past four months have seen a massive surge in the popularity of AI chatbots, to the extent that some of them can carry on conversations and write sophisticated term papers. Although it appears that these AI bots are thinking, they are actually limited to mimicking speech. This is because these artificially intelligent programs are made to process a whole lot of data provided by web sources. To understand the data used to make AI chatbots talk, The Washington Post set out to analyze one of these datasets, known as C4.

Big tech companies are incredibly secretive when it comes to programming their AI chatbots. To protect users from offensive and inappropriate content, they use the filter known as List of Dirty, Naughty, Obscene, and Otherwise Bad Words, this list includes 402 English words as well as an emoji. While the filter is supposed to remove any racial slurs and obscenities, sometimes LGBT content also gets removed. Additionally, the filters failed to remove some distressing content, such as anti-trans and anti-government websites. Even the popular QAnon phenomenon and “pizzagate” conspiracy theories were present in the C4 dataset.

Generally, the data used to train AI chatbots is a sample of websites from a particular period of time. This scrape was performed in April 2019 by the nonprofit CommonCrawl and companies use this data to fine-tune models and protect users from unwanted content. However, as research has shown, a lot can still get through. It has been found that there were hundreds of pornographic websites and more than 72,000 instances of “swastika.”

See also  AI Investment Boom Fueled by ChatGPT

AI chatbot models such as GPT-3 consume an overwhelming amount of data. For example, GPT-3’s training data includes all of English language Wikipedia, novels written by unpublished authors and text from Reddit links highly rated by users.

Though companies don’t typically disclose what their AI chatbot is consuming, their potential to use private, copyrighted and offensive content underscores the need for transparency in this field. With recent regulation changes and on-going efforts to make sure tech companies are being held accountable for their AI bots, users can also rest assured that their personal data is protected from malicious sources.

Frequently Asked Questions (FAQs) Related to the Above News

Please note that the FAQs provided on this page are based on the news article published. While we strive to provide accurate and up-to-date information, it is always recommended to consult relevant authorities or professionals before making any decisions or taking action based on the FAQs or the news article.

Share post:

Subscribe

Popular

More like this
Related

Obama’s Techno-Optimism Shifts as Democrats Navigate Changing Tech Landscape

Explore the evolution of tech policy from Obama's optimism to Harris's vision at the Democratic National Convention. What's next for Democrats in tech?

Tech Evolution: From Obama’s Optimism to Harris’s Vision

Explore the evolution of tech policy from Obama's optimism to Harris's vision at the Democratic National Convention. What's next for Democrats in tech?

Tonix Pharmaceuticals TNXP Shares Fall 14.61% After Q2 Earnings Report

Tonix Pharmaceuticals TNXP shares decline 14.61% post-Q2 earnings report. Evaluate investment strategy based on company updates and market dynamics.

The Future of Good Jobs: Why College Degrees are Essential through 2031

Discover the future of good jobs through 2031 and why college degrees are essential. Learn more about job projections and AI's influence.