IEEE-CSR - Peer Review & Conference Management System

Summary:

This paper presents an innovative method to detect Deepfake videos. The proposed model, ResNet Vision Transformer (ResViT), incorporates two complementary components: a Convolutional Neural Network (CNN) founded on the ResNet50 architecture for effective feature extraction and a Vision Transformer (ViT) for categorization. The CNN captures spatial characteristics from video frames, which are then analyzed by the ViT employing attention mechanisms to differentiate between authentic and altered videos. We assessed ResViT using two benchmark datasets, the Deepfake Detection Challenge (DFDC) dataset and the FaceForensics++ dataset, attaining outstanding results. Our model attained an accuracy of 97.1% on the DFDC dataset, illustrating its efficacy in Deepfake detection. Furthermore, ResViT attained accuracies of 86.8%, 75.1%, 75.5%, and 94.9% on the FaceForensics++ subsets (Face2Face, FaceSwap, NeuralTextures, and DeepFakes), underscoring its robustness and adaptability across various manipulation methods. These findings emphasize the promise of ResViT as a reliable method for Deepfake video detection.

Author(s):

Anahita Aria
Department of Electrical and Computer Engineering, Kharazmi University,
Iran

Seyedeh Leili Mirtaheri
Department of Informatics, Modeling, Electronics and System Engineering (DIMES), University of Calabria
Italy

Seyyed Amir Asghari
Department of Electrical and Computer Engineering, Kharazmi University,
Iran

Reza Shahbazian
Department of Informatics, Modeling, Electronics and System Engineering (DIMES), University of Calabria
Italy

Andrea Pugliese
Department of Informatics, Modeling, Electronics and System Engineering (DIMES), University of Calabria
Italy