Digital news publishers around the world, including in India, are taking measures to safeguard their content against powerful web crawlers like OpenAI’s GPTBot. These AI crawlers collect data from websites to train their artificial intelligence models. A recent report by benchmarking agency AltIndex.com reveals that nearly one-third of the top 50 news sites globally have blocked AI crawlers from accessing their content. Among the blocked news sites are CNN, New York Times, Daily Mail, Reuters, and Bloomberg.
AI companies use crawlers to gather data for training their models and generating information for chatbots. However, since data is their competitive advantage, many news websites are wary of handing over their data to AI crawlers. The rise of large language models and generative AI has raised concerns among news sites, publishers, and intellectual property holders regarding the collection of their data by AI crawlers. While there are currently no clear regulatory rules governing AI’s use of copyrighted material, some news websites have taken matters into their own hands.
The situation intensified when OpenAI, backed by Microsoft, launched its GPTBot crawler to collect data for improving its language models. Although OpenAI assured that paywalled content would be excluded, many high-profile news sites blocked GPTBot. AltIndex.com’s research indicates that by the end of last month, 28% of the top 50 news sites worldwide had blocked at least one AI crawler. The percentage varies across regions, with 24% of the leading news sites in the United States and one-third of the top news sites in India blocking AI crawlers.
In India, members of the Digital News Publishers Association (DNPA), which includes prominent publishers like India Today Group, HT Group, Times Group, and more, have already restricted access to OpenAI. However, not all news sites have taken action, and GPTBot remains the most frequently blocked crawler. According to statistics, GPTBot has been blocked 22% of the time across the top 50 news sites, with notable names like Bloomberg, Reuters, Business Insider, Washington Post, New York Times, and CNN leading the list.
Aware of the concerns, the Ministry of Information and Broadcasting and the Ministry of Electronics and Information Technology in India are working to address the issue. The new Digital India Act aims to incorporate changes to ensure revenue and copyright packages for news publishers. Countries like Australia, Canada, and the EU have already taken steps to regulate AI and its impact on news content.
The draft of the Digital India Act is ready and will be released soon, as announced by Rajeev Chandrasekhar, the Union Minister of State for Electronics and Information Technology. It is expected to consider technological advancements and provide a regulatory framework for AI in the country.
Overall, news publishers globally and in India are taking steps to protect their content from AI crawlers, highlighting the growing concerns surrounding the collection of data by these entities. As the use of generative AI expands, it becomes crucial to strike a balance between innovation and safeguarding intellectual property rights.