TY - JOUR
T1 - Enhancing radiology report generation through pre-trained language models
AU - Leonardi, Giorgio
AU - Portinale, Luigi
AU - Santomauro, Andrea
N1 - Publisher Copyright:
© Springer-Verlag GmbH Germany, part of Springer Nature 2024.
PY - 2024
Y1 - 2024
N2 - In the healthcare field, the ability to integrate and process data from various modalities, such as medical images, clinical notes, and patient records, plays a central role in enabling Artificial Intelligence models to provide more informed answers. This aspect raises the demand for models that can integrate information available in different forms (e.g., text and images), such as multi-modal Transformers, which are sophisticated architectures able to process and fuse information across different modalities. Moreover, the scarcity of large datasets in several healthcare domains poses the challenge of properly exploiting pre-trained models, with the additional aim of minimizing the needed computational resources. This paper presents a solution to the problem of generating narrative (free-text) radiology reports from an input X-ray image. The proposed architecture integrates a pre-trained image encoder with a pre-trained large language model, based on a Transformer architecture. We fine-tuned the resulting multi-modal architecture on a publicly available dataset (MIMIC-CXR); we evaluated different variants of such an architecture concerning different image encoders (CheXNet and ViT) as well as different positional encoding methods. We report on the results we obtained in terms of BERTScore, a significant metric that has emerged for evaluating the quality of text summarization, as well as in terms of BLEU and ROUGE, standard text quality measures based on n-grams. We also show how different positional encoding methods may influence the attention map on the original X-ray image. We finally report the evaluation given by expert radiologists by considering the number of “errors” they found on the generated report.
AB - In the healthcare field, the ability to integrate and process data from various modalities, such as medical images, clinical notes, and patient records, plays a central role in enabling Artificial Intelligence models to provide more informed answers. This aspect raises the demand for models that can integrate information available in different forms (e.g., text and images), such as multi-modal Transformers, which are sophisticated architectures able to process and fuse information across different modalities. Moreover, the scarcity of large datasets in several healthcare domains poses the challenge of properly exploiting pre-trained models, with the additional aim of minimizing the needed computational resources. This paper presents a solution to the problem of generating narrative (free-text) radiology reports from an input X-ray image. The proposed architecture integrates a pre-trained image encoder with a pre-trained large language model, based on a Transformer architecture. We fine-tuned the resulting multi-modal architecture on a publicly available dataset (MIMIC-CXR); we evaluated different variants of such an architecture concerning different image encoders (CheXNet and ViT) as well as different positional encoding methods. We report on the results we obtained in terms of BERTScore, a significant metric that has emerged for evaluating the quality of text summarization, as well as in terms of BLEU and ROUGE, standard text quality measures based on n-grams. We also show how different positional encoding methods may influence the attention map on the original X-ray image. We finally report the evaluation given by expert radiologists by considering the number of “errors” they found on the generated report.
KW - Automated radiology report generation
KW - Large language models
KW - Multi-modal machine learning
KW - Transformers
UR - http://www.scopus.com/inward/record.url?scp=85212486041&partnerID=8YFLogxK
U2 - 10.1007/s13748-024-00358-5
DO - 10.1007/s13748-024-00358-5
M3 - Article
SN - 2192-6352
JO - Progress in Artificial Intelligence
JF - Progress in Artificial Intelligence
M1 - 101273
ER -