ChatGPT, an artificial intelligence language model, has been found to write clinical notes as effectively as senior internal medicine residents, according to a new study. The research suggests that ChatGPT may be ready for a larger role in everyday clinical practice.
The study, conducted by a team of researchers from Stanford University, compared the clinical notes on the history of present illness (HPI) generated by ChatGPT with those written by senior residents. The grades given for the HPIs differed by less than 1 point on a 15-point scale, indicating that ChatGPT was on par with the senior residents.
However, the resident-written HPIs were deemed to have a higher level of detail compared to those generated by ChatGPT. Despite this, attending physicians in internal medicine were only able to correctly identify whether the HPIs were written by ChatGPT with 61% accuracy.
Lead researcher Dr. Ashwin Nayak noted that large language models like ChatGPT have reached a level of advancement where they can draft clinical notes that are suitable for clinicians to review. This could potentially automate some of the more mundane tasks and documentation processes that clinicians typically do not enjoy.
The study involved 30 internal medicine attending physicians blindly evaluating five HPIs, with four written by senior residents and one generated by ChatGPT. The physicians graded the notes based on their level of detail, succinctness, and organization.
The researchers used a prompt engineering method to generate the AI-written HPIs. They inputed a patient-provider interaction transcript into ChatGPT to produce HPIs, analyzed them for errors, and modified the prompt based on the results. This process was repeated twice to ensure accuracy, and one final AI-written HPI was selected for comparison with the senior resident HPIs.
Despite the need for prompt engineering and the potential for errors in the AI-generated HPIs, Nayak highlighted the potential of using AI chatbots in clinical documentation. He acknowledged that while the notes may not need to be perfect, they should surpass a certain threshold of quality.
Nayak also mentioned that the study used an earlier version of ChatGPT powered by GPT-3.5. He speculated that if the experiment was repeated with the newer GPT-4 version, the results would likely be even more significant. He suggested the AI-generated notes would be equivalent or even better than those written by humans, and physicians would fare worse in determining if a note was written by AI or a human.
However, Nayak cautioned against drawing definitive conclusions about implementing ChatGPT in real-world clinical note writing. The study used fictional transcripts, and more research and testing are necessary, especially with real patient data.
An accompanying editorial stressed the need for evidence-based research when incorporating AI technology into clinical practice. The authors emphasized that understanding how and when AI technology can be used in medicine is crucial.
In a related study published alongside the research letter and editorial, the GPT-4 version of ChatGPT outperformed medical students at Stanford University on clinical reasoning exams. This highlights the potential for incorporating AI-related topics into clinical training and continuing medical education.
Overall, the study suggests that ChatGPT shows promise in producing clinical notes comparable to those written by experienced clinicians. As AI technology continues to advance, it could play a significant role in automating certain tasks and improving patient care. However, further research and evaluation are needed before widespread implementation in clinical practice.