Current Artificial Intelligence Based Chatbots May Produce Inaccurate and Potentially Harmful Information for Patients With Aortic Disease

Melissano, Germano; Tinelli, Giovanni; Soderlund, Timo

doi:10.1016/j.ejvs.2023.10.042

The world has witnessed the diffusion of artificial intelligence (AI) based chatbots, computer programs designed to simulate a conversation with a human being. AI chatbots are rapidly evolving thanks to significant in- vestments in the digital world.1 This technology is here to stay and must be acknowledged for the good and the bad. Many individuals already trust it for accurate information in various fields, including health, where reliability is crucial. However, the same innovative and revolutionary tool can be deleterious if used improperly, and users must be aware of the possibility of encountering inaccurate information.2e4 A web based community of people with an interest in aortic disease (Think Tank Aorta), led by a former aortic patient (T.S.), is pursuing the project of interrogating two commonly used, free of charge chatbots, namely ChatGPT 3.0 (OpenAI, San Francisco, CA, USA) and Bing (Microsoft, Redmond, WA, USA), to evaluate the accuracy of replies to common questions that an aortic patient might ask (cha- tAortaAI).5 The chatAortaAI project is on a completely voluntary basis and includes patients, former patients, family members, healthcare professionals as well as any other individual with an interest in aortic disease. The team is composed of 22 individuals from 11 countries who guarantee their continued interest in the project. The team members are fluent in a total of 10 languages. In the pilot study, April 1 e May 31 2023, each team member interrogated the chatbots with questions that an aortic patient, seeking information on the disease, might ask. It should be noted that the same question was asked more than once with identical or similar wording, in the same or different session, and also in different languages. Three sets of questions were asked: non-medical (BASIC), aortic (AORTA), and complex aortic disease (ADVANCED). The total number of questions in the three different levels was 42, including seven test questions for AI and 35 ques- tions to evaluate the accuracy of AI responses. The form BASIC included questions 1 e 17 (https://docs. google.com/forms/d/e/1FAIpQLSeRBZrU97p71CT_KIx5vj FqrRPgUclMpU7s4sDS5kN7nY0CXg/viewform?pli1⁄41),the form AORTA included questions 18 e 29 (https://docs. google.com/forms/d/e/1FAIpQLSfENIb9n_Tao9Ox59xjnKdjs DshBWy0O7Wsuj9lA0Ec_cqtIw/viewform), and the form ADVANCED included questions 30 e 42 (https://docs. google.com/forms/d/e/1FAIpQLScSoG-GaOT7M1cSJLPrAwb s11oF4NFGZOgnnO2hfVgsJmFhqQ /viewform). The form BASIC was completed by 22 team members, the form AORTA was completed by 20 team members, and the form ADVANCED was completed by 19 team members. The first item analysed was the accuracy of replies to questions asked in English as evaluated by two experienced academic vascular surgeons (G.M. and G.T.). Based on their medical expertise and current guidelines,6 the team member rated the accuracy of answers on a two point Likert scale as accurate (addresses all aspects of the question and provides additional information or context beyond what was expected) or inaccurate (addresses some aspects of the question but significant parts are missing or incomplete). The second item analysed was the consistency of the replies to questions asked in the same language. Quite surprisingly, the exact same question led to an array of different replies. Finally, the accu- racy of the replies dropped when the questions were asked in a language other than English, with several erratic replies. The overall analysis showed that 111/279 (39.8%) of the English questions and 178/308 (57.8%) of questions in the other tested languages were inaccurate. ChatGPT’s analysis of the results from all the tests showed that responses were inaccurate in 44.4% (63/142) of the English questions and 65.2% (103/158) of the other tested languages. Bing Chat responses were inaccurate in 35.0% (48/137) of English questions and 50.0% (75/150) of questions in the case of other tested languages (Fig. 1). This pilot study showed that the English language in chatbots provides more accurate responses compared with other test languages. The functionality of AI chatbots may cause harm by pro- ducing misleading or inaccurate content, thereby eliciting concerns around medical misinformation. Thus, the applica- tion of AI chatbots should be carried out ethically and responsibly, considering its potential risks and concerns. The language algorithms interrogated are undergoing rapid development, and it is the mission of this team to continue monitoring these in the future for changes and improve- ments and to come up with more structured studies. For the time being, the present findings should be shared both with the medical community and the lay public so that they can be warned against the potential risks of looking for medical in- formation in the aortic field through this source.

Melissano, G., Tinelli, G., Soderlund, T., Current Artificial Intelligence Based Chatbots May Produce Inaccurate and Potentially Harmful Information for Patients With Aortic Disease, <<EUROPEAN JOURNAL OF VASCULAR AND ENDOVASCULAR SURGERY>>, 2024; 67 (4): 683-684. [doi:10.1016/j.ejvs.2023.10.042] [https://hdl.handle.net/10807/314549]