In a reminder of the limits of artificial intelligence (AI), the OpenAI’s ChatGPT system has failed to pass a practice test created by the American College of Gastroenterology (ACG). When tested on questions from the ACG’s 2021 and 2022 multiple-choice assessment, both the GPT-3.5 and GPT-4 versions of the AI chatbot failed to reach the 70% passing grade.
The tests were conducted by Arvind Trindade, MD, of Northwell Health’s Feinstein Institutes for Medical Research in Manhasset, New York, and his colleagues. Questions from the assessment were copied and pasted directly into ChatGPT, which then generated a response and explanation. From these, the authors selected the correspond answer.
The GPT-3.5 and GPT-4 versions scored a 65.1% (296 of 455 questions) and a 62.4% (284 of 455 questions), respectively. The scores were below the required 70% grade to pass the exam. Shockingly, the scores were lower than expected, prompting the authors of the study to call for a higher standard to be set.
Currently, there have been recent papers showing ChatGPT passing other medical assessments. But, Dr. Trindade argued that it doesn’t mean it’s ready for clinical use. He commented that medical professionals should think about how to optimize this technology rather than relying on it for clinical use. He also noted that the medical community should have much higher standards than, for example, a 95% accuracy threshold.
Google researchers have developed their own medically trained AI model, Med-PaLM, which achieved 67.6% accuracy and surpassed the common threshold for passing scores. An updated version of this model, known as Med-PaLM 2, even achieved an 85% accuracy and performed at “expert” physician levels.
AI chatbots such as ChatGPT have also been found to beat physicians in answering patient-generated questions. During a blind evaluation, the AI chatbot’s responses were preferred over real-physician answers 75% of the time.
While this research into AI medical credentialing tests has shown tremendous progress, it is also a reminder that AI is far from providing accurate, reliable advice. Medical professionals should consider all available forms of information when making decisions and should always prioritize human expertise over artificial intelligence.