Skip to yearly menu bar Skip to main content


Poster

Common Sense Reasoning for Deep Fake Detection

Yue Zhang · Ben Colman · Xiao Guo · Ali Shahriyari · Gaurav Bharaj

[ ]
Thu 3 Oct 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract:

State-of-the-art deepfake detection approaches rely on image-based features extracted via neural networks. While these approaches trained in a supervised manner extract likely fake features, they may fall short in representing unnatural 'non-physical' semantic facial attributes -- blurry hairlines, double eyebrows, rigid eye pupils, or unnatural skin shading. However, such facial attributes are easily perceived by humans and used to discern the authenticity of an image based on human common sense. Furthermore, image-based feature extraction methods that provide visual explanations via saliency maps can be hard to interpret for humans. To address these challenges, we frame deepfake detection as a Deepfake Detection VQA (DD-VQA) task and model human intuition by providing textual explanations that describe common sense reasons for labeling an image as real or fake. We introduce a new annotated dataset and propose a Vision and Language Transformer-based framework for the DD-VQA task. We also incorporate text and image-aware feature alignment formulation to enhance multi-modal representation learning. As a result, we improve upon existing deepfake detection models by integrating our learned vision representations, which reason over common sense knowledge from the DD-VQA task. We evaluate our method on both the performance of deepfake detection and the quality of the generated explanations. Our empirical results show that incorporating textual explanations into a deepfake task benefits detection performance, generalization ability, and the language-based interpretability of the deepfake detection task.

Live content is unavailable. Log in and register to view live content