https://doi.org/10.1140/epjs/s11734-024-01447-1
Regular Article
A transformer-based model for next disease prediction using electronic health records
1
Mathematical Robotics Science Division, Sirius University, Olympic Ave., 1., 354340, Sirius, Sirius Federal Territory, Russian Federation
2
Mathematics and Mechanics Faculty, Saint Petersburg State University, Universitetskaya Emb. 7/9, 199034, St Petersburg, Russian Federation
3
Institute of Problems in Mechanical Engineering, V.O., Bolshoy pr. 61, 199178, St. Petersburg, Russian Federation
a
nikolai.makarov.sc@gmail.com
Received:
20
September
2024
Accepted:
12
December
2024
Published online:
7
January
2025
This work proposes a machine learning-based algorithm to predict future diseases that patients may develop, utilizing historical Electronic Health Record (EHR) data for model training. An extensive search and analysis of publicly available datasets containing the relevant medical information is being conducted. An encoder–decoder model is proposed to solve the task of predicting future diseases from past disease history, employing a two-staged training scheme. In the first stage, the encoder part of the model is pre-trained in an unsupervised manner. In the second stage, the entire model is fine-tuned to solve the target sequence-to-sequence task. Our solution utilizes the diseases detected during a patient’s current examination to forecast potential future ailments. The Bidirectional Encoder Representations from Transformers (BERT) architecture serves as the base version of the encoder part of the model, while seven alternative versions of the encoder model are also being explored, some of which outperform BERT in terms of validation metrics. Two encoder–decoder architectures are being proposed as algorithms for solving the problem: one utilizes a fully connected neural network as its decoder part, while another model’s decoder is based on a decoder-only transformer’s architecture. On holdout data, the first approach yields precision and recall of 46.32% and 17.80%, respectively. The second approach achieves precision and recall of 24.98% and 46.17%, respectively. This suggests that both approaches successfully accomplish the task, and the choice between them should depend on the specific diagnostic requirements of the disease in question.
Copyright comment Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
© The Author(s), under exclusive licence to EDP Sciences, Springer-Verlag GmbH Germany, part of Springer Nature 2024
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.