Cohen Children’s Medical Center in New York has conducted a study to assess the pediatric diagnostic skills of OpenAI’s ChatGPT, and the results are far from encouraging. The study, published in the prestigious journal JAMA Pediatrics, involved three pediatricians, Joseph Barile, Alex Margolis, and Grace Cason, who sought to evaluate the accuracy of ChatGPT in diagnosing 100 random case studies.
One of the challenges in pediatric diagnostics is not only considering the symptoms exhibited by a patient but also taking their age into account. LLMs, or language models like ChatGPT, have been touted as a promising new tool in the medical field. To determine their efficacy, the researchers employed a simple approach: they provided ChatGPT with the text from the case study and posed the prompt, List a differential diagnosis and a final diagnosis.
Differential diagnosis refers to the method of suggesting potential diagnoses based on a patient’s history and physical exams. The final diagnosis, on the other hand, represents the presumed cause of the symptoms. The responses provided by ChatGPT were evaluated by two independent colleagues who were not involved in the study. The evaluations resulted in three possible scores: correct, incorrect, and did not fully capture diagnosis.
Unfortunately, ChatGPT achieved correct scores in only 17 instances, out of which 11 were clinically related to the correct diagnosis but ultimately proved to be incorrect. The clear conclusion drawn from this research is that ChatGPT is far from being ready for use as a diagnostic tool. However, the researchers suggest that more selective training might enhance its performance. In the meantime, they propose other potential applications for language models like ChatGPT, such as administrative tasks, assisting in research article writing, or generating instruction sheets for patients in aftercare.
This study highlights the limitations of AI language models in the medical field, particularly in complicated specialties like pediatric diagnostics. While these models show promise in certain areas, their performance falls short when it comes to accurately diagnosing patients. Nonetheless, as the technology continues to evolve and researchers refine the training processes, there may be opportunities to leverage AI language models for various administrative and supportive tasks in healthcare.
Although ChatGPT has proven to be insufficient as a diagnostic tool in this specific study, it is essential to recognize the potential of AI in healthcare. With further development and refinement, these tools may eventually provide valuable assistance to medical professionals, improving patient care and outcomes. However, it is crucial to approach their implementation cautiously, always prioritizing the expertise and judgment of trained healthcare professionals.