In a recent study hosted on the medRxiv preprint server, experts in the United States sought to analyze the performance of three general Large Language Models (ChatGPT, GPT-4, and Google Bard) on higher-order questions related to the American Board of Neurological Surgery (ABNS) oral board examination. This type of examination is taken by doctors past residency and contains a difficult set of questions relating to neurosurgical indications and decision-making. By varying the questions, researchers collected data to understand the accuracy and differences between the language models.
The study found that GPT-4 ranked highest in terms of accuracy, scoring 82.6%. Compared to ChatGPT and Google Bard, GPT-4 offered greater accuracy, especially in questions concerning the spine area where its accuracy was 90.5% as opposed to 64.3%. Google Bard return correct answers 44.2% of the time and showed lower accuracy in almost all categories. In addition, GPT-4 showed lower rates of hallucination, which is when the model falsely believes a statement to be true. The results of the study shows that more trust needs to be put into LLMs and rigorous tests should be conducted.
Neha Mathur is a researcher who worked on this study and posted it to the medRxiv preprint server for publication. Neha is currently researching and writing about advancements in artificial intelligence and its impact on medicine. She has published several research papers on the subject, taking a particular interest in LLM systems and their integration into clinical decision-making processes.
Lily Ramsey LLM provided the review for the article. She is a research law associate whose works focuses on technology law and regulatory frameworks associated with the use of AI-based systems in different industries. In her recent works, Lily has sought to identify new opportunities for human-computer interaction (HCI) to its full potential in such industries.
The article is an important piece as it demonstrates the current potential of these language models. These models are able to process text with considerably greater accuracy than that of humans and eliminates the tedious process of taking multiple-choice exams with medical imaging data. Neurosurgical trainees would greatly benefit from having the convenience of using LLM systems to prepare for the board exams and AI chatbots can offer more accurate information that is tailored to their needs.