New Research Reveals Biases and Vulnerabilities in ChatGPT Detectors
Artificial intelligence (AI) has become an integral part of our lives, quietly powering social media, scientific research, advertising, and various industries. However, the advent of OpenAI’s ChatGPT has sparked a race in the education sector, with students utilizing the program to cheat by generating convincingly human-like essays. In response, teachers have turned to detection software to catch plagiarists in the act.
A recent study conducted by Stanford University, published in the journal Patterns, delved into the reliability of generative AI detectors in distinguishing between human-written and AI-generated texts. The surprising discovery was that some of the most popular GPT detectors, specifically designed to identify AI-generated text, consistently misclassified essays written by non-native English speakers as being AI-generated. This highlights the limitations and biases that users of such detectors should be aware of.
To investigate this further, the researchers compared 91 essays written by Chinese TOEFL (Test of English as a Foreign Language) participants with 88 essays penned by eighth-grade students in the United States. These essays were then run through seven off-the-shelf GPT detectors, including OpenAI’s detector and GPTZero. The findings indicated that only 5.1% of the US student essays were classified as AI generated. However, a staggering 61% of the TOEFL essays written by non-native English speakers were misclassified. In fact, one particular detector flagged 97.8% of the TOEFL essays as AI generated.
Interestingly, all seven detectors misclassified 18 out of the 91 TOEFL essays as AI generated. Upon closer examination of these 18 essays, the researchers discovered that a lower text perplexity was likely the reason for the misclassification. Text perplexity acts as a measure of variability or randomness within a given text. Since non-native English writers tend to have a less extensive vocabulary and use more simplified grammar, the detectors mistakenly identified their writing as AI-generated.
In essence, verbose and literary text is less likely to be classified as AI-generated. Nevertheless, this bias presents concerns for non-native English speakers who may face adverse consequences in job hiring or school exams, being incorrectly flagged as producing AI-generated content.
To further investigate the issue, the researchers conducted a second experiment that employed AI-generated text and assessed whether detection software could correctly identify it. They used ChatGPT to generate responses to the 2022-2023 US college admission essay prompts and ran these essays through the seven detectors. On average, the detectors successfully identified AI-generated essays 70% of the time. However, to challenge the detectors, the researchers prompted ChatGPT to enhance the essays by incorporating literary language. Astonishingly, the detectors struggled to classify these essays correctly, identifying them as AI-generated only 3.3% of the time. Similar results were observed when ChatGPT was tasked with composing scientific abstracts.
James Zou, co-author of the study and a biomedical data scientist at Stanford University, expressed surprise at the underperformance of commercial detectors in detecting text from non-native English speakers and their susceptibility to being deceived by GPT. This susceptibility may lead non-native English speakers to increasingly adopt ChatGPT to ensure their work is perceived as being written by a native English speaker.
This research raises a crucial question: What is the utility of detectors if they are easily deceived and misclassify human text as AI-generated? Zou suggests that a potential solution to strengthen detectors could be comparing multiple writings on the same topic, including both human and AI responses, followed by clustering analysis. This approach may offer a more robust and equitable detection mechanism.
While the detectors’ flaws are evident, they might have untapped potential. The researchers speculate that if a GPT detector can be designed to detect overused phrases and structures, it could encourage greater creativity and originality in writing.
Thus far, the race between generative AI models like ChatGPT and detection software has resembled the Wild West, with improvements in AI being matched by advances in detectors, all with limited oversight. The researchers emphasize the need for further research and assert that all stakeholders affected by generative AI models should be involved in discussions surrounding their appropriate usage.
Until then, the researchers strongly caution against deploying GPT detectors in evaluative or educational settings, especially when assessing the work of non-native English speakers.