A team of researchers from Brigham and Women’s Hospital has conducted a study evaluating the potential biases of the latest large language model, GPT-4, in clinical decision support scenarios. The findings, published in Lancet Digital Health, highlight concerns about the model’s ability to perpetuate social biases that could impact historically marginalized groups.
The study examined four specific applications of GPT-4: generating clinical vignettes for medical education, diagnostic reasoning, clinical plan generation, and subjective patient assessments. In the case of generating clinical vignettes, GPT-4 displayed a failure to accurately model the demographic diversity of medical conditions. It exaggerated known demographic prevalence differences in 89% of diseases, potentially skewing perceptions and understanding of specific conditions.
Furthermore, when evaluating patient perception, GPT-4 produced significantly different responses based on gender or race/ethnicity for 23% of cases. This indicates that the model may encode and exhibit biases in its decision-making processes, leading to potential disparities in patient care and outcomes.
Large language models like GPT-4 have shown promise in supporting clinical practice by automating administrative tasks, drafting clinical notes, and aiding in decision-making. However, this study raises concerns about the inadvertent encoding and perpetuation of biases within the model, as it could have adverse effects on marginalized groups.
Lead author Emily Alsentzer, a postdoctoral researcher at Brigham and Women’s Hospital, stresses the importance of assessing biases in GPT-4 to ensure its appropriateness for clinical decision support. Alsentzer acknowledges that while LLM-based tools are currently used with clinician verification, it is challenging for clinicians to detect systemic biases when viewing individual patient cases.
The study’s limitations include a small number of simulated prompts and analysis based on traditional demographic identities. Future research should explore biases using clinical notes from electronic health records to gain a more comprehensive understanding of the model’s performance.
These findings prompt a necessary conversation about the potential for GPT-4 to propagate biases in clinical decision support applications. It highlights the need for rigorous bias evaluations for each intended use of large language models in healthcare. By addressing these concerns, researchers aim to ensure equitable and unbiased treatment for all patients.
This study contributes to the ongoing discussion surrounding the use of artificial intelligence in healthcare and emphasizes the importance of transparency and accountability in developing AI tools that avoid perpetuating biases.