Author(s)
Ashraf Nawari 1
Jamal Zahir 2
Sonal Kumar 2
Lovingly Ocampo 3
Olivia Opara 2
Hassan Ahmad 2
Benjamin Crawford 4
Brian Feeley, MD (faculty) 1
Affiliation(s)
1Department of Orthopaedic Surgery, University of California, San Francisco, CA ; 2Department of Surgery, Ross University School of Medicine,; 3College of Osteopathic Medicine, Touro Middletown, NY ; 4Department of Orthopaedic Surgery, St Mary's Medical Center San Francisco Orthopaedic Residency; Program ;
Abstract:
Background:
The rapid improvement of generative artificial intelligence (AI) models in medical domains including answering board-style questions warrants further investigation regarding their utility and accuracy in answering orthopaedic surgery written board questions. Previous studies have analyzed the performance of ChatGPT alone on board exams, but a head-to-head analysis of multiple current AI models has yet to be performed. Hence, the objective of this study was to compare the utility and accuracy of various large language models (LLMs) in answering Orthopaedic Surgery In-Training Exam (OITE) written board questions to each other as well as orthopaedic surgery residents.
Methods:
A complete set of questions from the OITE 2022 exam was inputted into various LLMs and results were calculated and compared against orthopaedic surgery residents nationally. Results were analyzed by overall performance and question type. Type A questions related to knowledge and recall of facts, Type B questions involved diagnosis and analysis of information, and Type C questions focused on the evaluation and management of diseases, requiring knowledge and reasoning to develop treatment plans.
Results:
Google Gemini was the most accurate tool answering 69.9% of questions correctly. Google Gemini also performed superiorly to ChatGPT and Claude on Type A (76.9%) and Type C questions (67.4%), with Claude performing superiorly on Type B questions (70.7%). Questions without images were answered with greater accuracy compared to those with images (65.9% vs. 34.1%). All LLMs performed above the average of a first-year orthopaedic surgery intern, with Google Gemini and Claude performance approaching that of fourth- and fifth-year orthopaedic surgery residents.