A transformer-based model for next disease prediction using electronic health records

Nikolai Makarov; Mikhail Lipkovich

doi:10.1140/epjs/s11734-024-01447-1

2024 Impact factor 2.3

Special Topics

Artificial Intelligence and Complex Networks meet Natural Sciences

Eur. Phys. J. Spec. Top.
https://doi.org/10.1140/epjs/s11734-024-01447-1

Regular Article

A transformer-based model for next disease prediction using electronic health records

Nikolai Makarov¹^a and Mikhail Lipkovich²^,3

¹ Mathematical Robotics Science Division, Sirius University, Olympic Ave., 1., 354340, Sirius, Sirius Federal Territory, Russian Federation
² Mathematics and Mechanics Faculty, Saint Petersburg State University, Universitetskaya Emb. 7/9, 199034, St Petersburg, Russian Federation
³ Institute of Problems in Mechanical Engineering, V.O., Bolshoy pr. 61, 199178, St. Petersburg, Russian Federation

^a nikolai.makarov.sc@gmail.com

Received: 20 September 2024
Accepted: 12 December 2024
Published online: 7 January 2025

Abstract

This work proposes a machine learning-based algorithm to predict future diseases that patients may develop, utilizing historical Electronic Health Record (EHR) data for model training. An extensive search and analysis of publicly available datasets containing the relevant medical information is being conducted. An encoder–decoder model is proposed to solve the task of predicting future diseases from past disease history, employing a two-staged training scheme. In the first stage, the encoder part of the model is pre-trained in an unsupervised manner. In the second stage, the entire model is fine-tuned to solve the target sequence-to-sequence task. Our solution utilizes the diseases detected during a patient’s current examination to forecast potential future ailments. The Bidirectional Encoder Representations from Transformers (BERT) architecture serves as the base version of the encoder part of the model, while seven alternative versions of the encoder model are also being explored, some of which outperform BERT in terms of validation metrics. Two encoder–decoder architectures are being proposed as algorithms for solving the problem: one utilizes a fully connected neural network as its decoder part, while another model’s decoder is based on a decoder-only transformer’s architecture. On holdout data, the first approach yields precision and recall of 46.32% and 17.80%, respectively. The second approach achieves precision and recall of 24.98% and 46.17%, respectively. This suggests that both approaches successfully accomplish the task, and the choice between them should depend on the specific diagnostic requirements of the disease in question.

Copyright comment Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

© The Author(s), under exclusive licence to EDP Sciences, Springer-Verlag GmbH Germany, part of Springer Nature 2024

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

First page of the article

Conference announcements

12 Internat. Congress of the Balkan Physical Union
July 8-12, 2025
Bucharest, Romania

Joint Annual Meeting of ÖPG and SPS
August 18-22, 2025
Wien, Austria

111th Italian National Society Congress
September 22-26, 2025
Palermo, Italy

EPJ

A transformer-based model for next disease prediction using electronic health records

Conference announcements