At least 15% of the top 100 websites and 7% of the top 1,000 websites are currently blocking OpenAI’s GPTBot, a web crawler introduced on August 7, according to a new analysis. The study, conducted by AI content and plagiarism service Originality.ai, found that the percentage of websites blocking GPTBot is increasing by approximately 5% each week. The analysis identified that 69 of the 1,000 most popular websites globally have already implemented blocks against GPTBot.
The decision to block GPTBot is likely motivated by concerns about OpenAI scraping website data without compensation for training its AI models, as well as the lack of citation or linking to sources by ChatGPT. Many popular websites have taken action to prevent OpenAI from accessing their content, including Amazon, Quora, The New York Times, and Shutterstock.
However, it is worth noting that although many sites are blocking GPTBot, they are not blocking CCbot, which is Common Crawl’s web crawler. Common Crawl provides training data not only to OpenAI but also to Google and other organizations. It appears that some websites are comfortable with allowing access to their content for training purposes as long as it is done through Common Crawl.
The Originality.ai analysis further reveals that at least 62 of the top 1,000 websites have blocked CCBot, showing that some websites are taking a more cautious approach by blocking both GPTBot and CCbot. This includes popular websites like Shutterstock, Reuters, and Good Housekeeping.
It is essential to acknowledge the limitations of the analysis, as 241 robots.txt files out of the 1,000 websites were not inspected as part of the study. This serves as a reminder that the reported numbers should be considered a minimum rather than an exact figure.
Considering the potential impact on search engine optimization (SEO), website owners and SEO professionals are grappling with the decision of whether to block GPTBot’s web browser plugin from accessing their websites. This development has prompted discussions around the use of AI systems to scrape website data, with some websites opting to block GPTBot due to concerns about unauthorized usage of their content.
As the use of AI systems continues to grow, it is crucial to strike a balance between accessing valuable content for training models and respecting the rights and concerns of website owners. The decisions made by websites to block or allow access to GPTBot and CCbot highlight the need for clearer guidelines and protocols in this area.
In conclusion, the analysis conducted by Originality.ai indicates that a significant number of top websites are blocking OpenAI’s GPTBot, motivated by concerns around data scraping and lack of citation. While some websites block both GPTBot and CCbot, others allow access to their content through Common Crawl. This ongoing debate raises important questions about the ethics and regulations surrounding the scraping of website data for AI training purposes. Moving forward, it will be important for stakeholders to work together to establish clearer guidelines that address these concerns and ensure a fair and mutually beneficial relationship between AI systems and website owners.