Study Reveals ChatGPT’s Inaccuracy in Pediatric Diagnoses, Urgent Enhancements Needed
A recent study has shed light on the inaccuracies in pediatric diagnoses made by ChatGPT, a chatbot based on a large language model (LLM). The research found that the majority of pediatric cases were misdiagnosed by the chatbot, highlighting the urgent need for enhancements in AI healthcare.
In the study, 100 pediatric case challenges were presented to ChatGPT version 3.5. Shockingly, the chatbot made inaccurate diagnoses in 83 of these cases. Out of the incorrect diagnoses, 72 were completely wrong, while 11 were clinically related but too broad to be considered correct.
One striking example involved a youngster with a rash and joint pain who was misdiagnosed by ChatGPT as having immune thrombocytopenic purpura instead of autism, which was the correct diagnosis made by a doctor. Another case involved a draining papule on an infant’s neck, where the chatbot diagnosed branchial cleft cyst while the doctor accurately diagnosed branchio-oto-renal syndrome.
Despite the high error rate, the researchers emphasize that physicians should continue to explore the applications of language models to medicine. They acknowledge that LLMs and chatbots have potential as administrative tools for physicians, showing proficiency in tasks such as writing research articles and generating patient instructions.
However, the study highlights the limited diagnostic accuracy of chatbots in pediatric cases. A previous study revealed that chatbots correctly diagnosed only 39% of cases, suggesting that LLM-based chatbots could serve as supplementary tools for clinicians in complex cases. Nevertheless, the accuracy of LLM-based chatbots in pediatric scenarios, which require consideration of the patient’s age alongside symptoms, had not been previously explored.
The findings underscore the irreplaceable role of clinical experience in accurate diagnoses. Chatbots, unlike physicians, are unable to identify crucial relationships in medical conditions, such as the link between autism and vitamin deficiencies.
The researchers attribute the chatbot’s lackluster performance to the fact that LLMs do not distinguish between reliable and unreliable information. They simply generate responses by regurgitating text from the training data. To improve chatbot diagnosis accuracy, more selective training will be necessary.
To conduct the study, the researchers collected pediatric case challenges from JAMA Pediatrics and the New England Journal of Medicine. These cases were used to assess the diagnostic capabilities of ChatGPT version 3.5. The chatbot-generated diagnoses were evaluated by two physician researchers, who categorized them as correct, incorrect, or did not fully capture diagnosis.
One notable finding was that more than half of the incorrect diagnoses provided by the chatbot belonged to the same organ system as the accurate diagnosis. Additionally, the chatbot’s differential list included 36% of the final case report diagnoses.
The study’s results have raised concerns about the reliability of chatbots in pediatric healthcare settings. While there is potential for language models to assist clinicians, it is clear that significant enhancements are needed to ensure their accuracy and usefulness in diagnosing pediatric cases.
In conclusion, the study reveals the inaccuracies of ChatGPT in pediatric diagnoses and highlights the urgent need for improvements in AI healthcare. The research emphasizes the invaluable role of clinical experience and calls for more selective training to enhance chatbot diagnosis accuracy. While language models show promise in medicine, their limitations must be addressed before they can be fully integrated into pediatric healthcare.