ECCV Poster Self-Supervised Video Copy Localization with Regional Token Representation

Poster

Self-Supervised Video Copy Localization with Regional Token Representation

Minlong Lu · Yichen Lu · Siwei Nie · Xudong Yang · Xiaobo Zhang

Strong blind review: This paper was not made available on public preprint services during the review process

Strong Double Blind

[ Abstract ] [ Paper PDF ]

[ Supplemental]

2024 Poster

Abstract:

The task of video copy localization aims at finding the start and end timestamps of all copied segments within a pair of untrimmed videos. Recent approaches usually extract frame-level features and generate a frame-to-frame similarity map for the video pair. Learned detectors are used to identify distinctive patterns in the similarity map to localize the copied segments. There are two major limitations associated with these methods. First, they often rely on a single feature for each frame, which is inadequate in capturing local information for typical scenarios in video copy editing, such as picture-in-picture cases. Second, the training of the detectors requires a significant amount of human annotated data, which is highly expensive and time-consuming to acquire. In this paper, we propose a self-supervised video copy localization framework to tackle these issues. We incorporate a Regional Token into the Vision Transformer, which learns to focus on local regions within each frame using an asymmetric training procedure. A novel strategy that leverages the Transitivity Property is proposed to generate copied video pairs automatically, which facilitates the training of the detector. Extensive experiments and visualizations demonstrate the effectiveness of the proposed approach, which is able to outperform the state-of-the-art without using any human annotated data.

Chat is not available.