https://doi.org/10.1140/epjs/s11734-025-01586-z
Regular Article
A deep learning approach for strengthening person identification in face-based authentication systems using visual speech recognition
Department of Computer Applications, National Institute of Technology, 620015, Tiruchirappalli, Tamil Nadu, India
Received:
23
January
2025
Accepted:
14
March
2025
Published online:
25
March
2025
Identity verification is essential in both an individual’s personal and professional life. It confirms a person’s identity for various services and establishes their legitimacy as an employee within an organization. As cybercrime evolves and becomes more sophisticated, ensuring robust, and secure personal authentication methods has become a critical challenge. Existing face-based authentication systems typically employ deep learning models for user verification. However, these systems are susceptible to various attacks, such as presentation attacks, 3D mask attacks, and adversarial attacks that exploit and deceive the models by manipulating digital representations of human faces. Although various liveness detection techniques have been proposed to combat face spoofing in face-based authentication systems. However, these systems remain vulnerable and can be exploited by sophisticated techniques. To counteract face spoofing in a face-based authentication system, we have proposed an advanced liveness detection technique using Visual Speech Recognition (VSR). The proposed VSR model is designed to integrate seamlessly with face-based authentication systems, forming a dual authentication framework for enhanced liveness detection. The VSR model decodes silently pronounced speech from video by analyzing unique, unforgeable lip motion patterns into textual representation. Although, various liveness detection techniques have been proposed to combat face spoofing in face-based authentication systems. However, these systems remain vulnerable and can be exploited by sophisticated techniques. To counteract face spoofing in a face-based authentication system, we have proposed an advanced liveness detection technique using VSR. The proposed VSR model is designed to integrate seamlessly with face-based authentication systems, forming a dual authentication framework for enhanced liveness detection. The VSR model decodes silently pronounced speech from video by analyzing unique, unforgeable lip motion patterns into textual representation. To achieve effective liveness detection using VSR, we need to enhance the accuracy of the VSR system. The proposed work employs an encoder-decoder technique to extract more robust features from lip motion. The encoder employs a three-dimensional convolution neural network (3D-CNN) combined with a fusion of bi-directional gated recurrent units and long short-term memory (BiGRU-BiLSTM) to effectively capture spatial-temporal patterns from lip movement. The decoder integrates Multi-Head Attention (MHA) with BiGRU-BiLSTM to effectively focus on relevant features and enhance contextual understanding for more accurate text prediction. The proposed VSR system achieved a word error rate (WER) of 0.79%, demonstrating a significant reduction in error rate and outperforming compared to the existing VSR models.
Copyright comment Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
© The Author(s), under exclusive licence to EDP Sciences, Springer-Verlag GmbH Germany, part of Springer Nature 2025
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.