Author(s)
Sasha Severin (student)1
Mehdi Boostani MD 2
Lana Salloum (student) 3
Zahidul Islam (student) 1
Szabolcs Bozsanyi MD PhD 2
Joshua Arbesman MD 4
Alicia Goldenberg MD 2
Gyorgy Paragh MD PhD 2
Affiliation(s)
1 New York Medical College, Valhalla, NY; 2 Department of Dermatology, Roswell Park Comprehensive Cancer Center, Buffalo, NY; 3 Albert Einstein College of Medicine, New York, NY; 4 Cleveland Clinic, Cleveland, Ohio;
Abstract:
Skin cancer is among the most prevalent malignancies worldwide, with early and accurate diagnosis of skin lesions being crucial for effective treatment and optimal patient outcomes. With the increasing accessibility of artificial intelligence (AI), along with the widespread use of electronic patient portals, the likelihood of patients turning to AI for evaluation and clarity of their results is only growing. This raises the importance of assessing the ability of large language models (LLMs) to provide accurate interpretations of pathology reports, determining whether they serve as a beneficial tool to the public or contribute to misinformation and patient distress. This study evaluated the performance of 5 LLMs (ChatGPT-4o, ChatGPTo1-mini, Gemini 1.5 Flash, MetaLlama 3.1, and Copilot) in their ability to accurately determine the malignant potential of a lesion, provide surgical recommendations, and estimate scar size, when prompted with a pathology report and corresponding follow-up questions. A total of 41 pathology reports, each corresponding to one of seven lesion types, were used to analyze model performance across different pathological presentations. Analysis of the results showed that the LLMs had relatively high accuracies in determining malignant potential but struggled greatly with providing surgical recommendations and scar size estimations. For example, the overall accuracy of LLMs in determining malignant potential was 98.04%, as compared to a 67.8% overall accuracy in providing surgical recommendations. Furthermore, the LLMs’ accuracies were evidently affected by the type of lesion they were presented with, demonstrated by their tendency to recommend further excision solely based on the presence of positive margins. Moreover, when assessing atypical lesions, LLMs often failed to interpret complex histological details, resulting in occasional misclassification, which can prove to be hazardous. While AI has the potential to support patient education, these results suggest that LLMs require further refinement in pathological interpretation, aimed to improve their contextual understanding and diagnostic reliability.