In recent months, the utilization of Artificial Intelligence chatbots has become increasingly popular for various tasks, such as writing complex essays and having fluid conversations with people. Although these chatbots intelligently mimic human language and conversations, they are incapable of truly understanding the meaning behind their words. This feature can be credited to the AI technology used to power them, as they learn from vast amounts of data that is sourced from web-based sources.
Exploring the data trends powering these AI chatbots, The Washington Post recently conducted an investigation into one of the data sets. The research, conducted in collaboration with the Allen Institute for Artificial Intelligence, discovered that sourcing data to train these chatting bots came from a variety of websites with the potential of including offensive or personal platforms. After analyzing more than 15.1 million websites within the C4 (Colossal Clean Crawled Corpus) from Google, it was determined that most of the data sourced was from familiar industries such as journalism, entertainment, software development, medicine, and content creation.
Of the entire data set, the top three websites were patents.google.com, wikipedia.org, and scribd.com. In addition, twenty-seven sites on the list were identified by the US government as piracy and counterfeit markets. This revelation has caused a stir amongst privacy enthusiasts due to the inclusion of sites hosting private voter registration databases. Furthermore, chatbot engines can unknowingly spread misinformation, propaganda, and false information if the data used to train them is unreliable or inauthentic.
Moreover, within the data set, there were websites related to faith, which took up around five percent of the content. Of the top twenty sites regarding religion, fourteen were Christian, two were Jewish, one was Muslim, and one site each was dedicated to Mormonism and Jehovah’s Witnesses. This heightened the potential of bias in the language models, in which a study in the Nature Journal found and identified ChatGPT’s anti-Muslim tendencies in 66% of cases.
Paul Allen, Co-founder of Microsoft, created the Allen Institute for Artificial Intelligence think tank, who contributed to The Washington Post‘s investigation. As chatbots continue to gain prominence and usability, it is important to continue researching the various data sources connecting to them in order to further protect users against the spread of misinformation, propaganda, and biased information.