Recent research conducted by the Washington Post and the Allen Institute for AI concluded that BeInCrypto was included in the dataset necessary to train Artificial Intelligence (AI) models like ChatGPT. The dataset in question, C4, consists of websites that AI technologies “scrape” in order to build up their language models. This dataset, which stands for Colossal Clean Crawled Corpus, has been used to help large language models, such as that of ChatGPT, mimic human speech.
The Washington Post leveraged data from web analytics company Similarweb to rank the top 10 million websites utilized by the dataset. Resultingly, it became apparent that the top three contributors to the dataset were patents.google.com, wikipedia.org and scribd.com, a subscription-based digital library. Aside from these, news organizations such as the Guardian, New York Times, Forbes, LA Times and Huffington Post also featured heavily among the AI model’s data sources.
Additionally, the researchers observed the presence of websites such as Instructables, an online platform for DIY instruction and tutorials. They even detected the presence of twenty seven sites certified as markets for piracy and counterfeiting by the US government.
C4 was first scraped in 2019 by the non-profit CommonCrawl and as such, is free to use and analyze. Despite its popularity amongst AI language models, its usage has proven contentious in sectors most at risk from AI. Namely, due to the fact that AI training does not pay content creators for the use of their data. This problem was recently met with a copyright lawsuit issued against Midjourney and Stable Diffusion AI image tools for scraping artwork without consent from the artists.
In conclusion, BeInCrypto was recognised by the Washington Post and the Allen Institute for AI as a website that contributed to the C4 dataset used to improve AI technology like ChatGPT. C4, which stands for Colossal Clean Crawled Corpus, was popular amongst AI language models and sought to allow AI to mimic human speech. Nevertheless, its usage has become increasingly controversial due to its lack of compensation to content creators.