2025 IEEE International Conference on Cyber Security and Resilience

Full Program

Summary:

This paper presents an innovative method to detect Deepfake videos. The proposed model, ResNet Vision Transformer (ResViT), incorporates two complementary components: a Convolutional Neural Network (CNN) founded on the ResNet50 architecture for effective feature extraction and a Vision Transformer (ViT) for categorization. The CNN captures spatial characteristics from video frames, which are then analyzed by the ViT employing attention mechanisms to differentiate between authentic and altered videos. We assessed ResViT using two benchmark datasets, the Deepfake Detection Challenge (DFDC) dataset and the FaceForensics++ dataset, attaining outstanding results. Our model attained an accuracy of 97.1% on the DFDC dataset, illustrating its efficacy in Deepfake detection. Furthermore, ResViT attained accuracies of 86.8%, 75.1%, 75.5%, and 94.9% on the FaceForensics++ subsets (Face2Face, FaceSwap, NeuralTextures, and DeepFakes), underscoring its robustness and adaptability across various manipulation methods. These findings emphasize the promise of ResViT as a reliable method for Deepfake video detection.

Author(s):

Anahita Aria    
Department of Electrical and Computer Engineering, Kharazmi University,
Iran

Seyedeh Leili Mirtaheri    
Department of Informatics, Modeling, Electronics and System Engineering (DIMES), University of Calabria
Italy

Seyyed Amir Asghari    
Department of Electrical and Computer Engineering, Kharazmi University,
Iran

Reza Shahbazian    
Department of Informatics, Modeling, Electronics and System Engineering (DIMES), University of Calabria
Italy

Andrea Pugliese    
Department of Informatics, Modeling, Electronics and System Engineering (DIMES), University of Calabria
Italy

 


Copyright © 2025 SUMMIT-TEC GROUP LTD