The sharing of data is of significant importance for the advancement of scientific and technological knowledge. However, legislation such as the General Data Protection Regulation (GDPR) in Europe and the Health Insurance Portability and Accountability Act (HIPAA) in the United States implies significant restrictions on the dissemination of personal data within the healthcare sector. This has led to the development of reliable and automated methods for the anonymization of clinical documents, becoming a key area of research.This study presents a Natural Language Processing (NLP) approach to anonymize Italian clinical reports, focusing on protecting patient privacy by identifying and masking personally identifiable information. The research employs BERT-based Named Entity Recognition models, fine-tuning them on the healthcare-specific domain. The dataset, consisting of 1000 discharge letters from the Gemelli Hospital of Rome and 100 synthetically generated reports, was annotated to include critical protected health information (PHI) categories. The study compares different tagging schemes and loss functions, addressing class imbalance. The results demonstrate that a pre-trained model designed to recognize personal identifiable information in general texts can be effectively adapted and specialized to detect PHI in clinical reports in order to anonymize them.This work underscores the challenges of handling unbalanced datasets, the over-representation of non-PHI tokens, and interclass ambiguities. This research contributes to the development of a novel transformer-based model specialized in Italian clinical text, providing a framework for clinical text anonymization, ensuring compliance with privacy standards like GDPR while preserving the utility of data for research.
Tobia Giovanni, P., Patarnello, S., Masciocchi, C., Nero, C., Passarotti, M. C., Moretti, G., Marchetti, A., Arcuri, G., Lilli, L., Privacy in Italian Clinical Reports: A NLP-Based Anonymization Approach, in 2025 IEEE 13th International Conference on Healthcare Informatics (ICHI), (Rende, 18-21 June 2025), IEEE Computer Society, Los Alamitos 2025: 630-635. [10.1109/ICHI64645.2025.00077] [https://hdl.handle.net/10807/320516]
Privacy in Italian Clinical Reports: A NLP-Based Anonymization Approach
Masciocchi, Carlotta;Nero, Camilla;Passarotti, Marco Carlo;Moretti, Giovanni;Marchetti, Antonio;Arcuri, Giovanni;Lilli, Livia
2025
Abstract
The sharing of data is of significant importance for the advancement of scientific and technological knowledge. However, legislation such as the General Data Protection Regulation (GDPR) in Europe and the Health Insurance Portability and Accountability Act (HIPAA) in the United States implies significant restrictions on the dissemination of personal data within the healthcare sector. This has led to the development of reliable and automated methods for the anonymization of clinical documents, becoming a key area of research.This study presents a Natural Language Processing (NLP) approach to anonymize Italian clinical reports, focusing on protecting patient privacy by identifying and masking personally identifiable information. The research employs BERT-based Named Entity Recognition models, fine-tuning them on the healthcare-specific domain. The dataset, consisting of 1000 discharge letters from the Gemelli Hospital of Rome and 100 synthetically generated reports, was annotated to include critical protected health information (PHI) categories. The study compares different tagging schemes and loss functions, addressing class imbalance. The results demonstrate that a pre-trained model designed to recognize personal identifiable information in general texts can be effectively adapted and specialized to detect PHI in clinical reports in order to anonymize them.This work underscores the challenges of handling unbalanced datasets, the over-representation of non-PHI tokens, and interclass ambiguities. This research contributes to the development of a novel transformer-based model specialized in Italian clinical text, providing a framework for clinical text anonymization, ensuring compliance with privacy standards like GDPR while preserving the utility of data for research.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



