Poster

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

Xiang Fang · Zeyu Xiong · Wanlong Fang · Xiaoye Qu · Chen Chen · Jianfeng Dong · Keke Tang · Pan Zhou · Yu Cheng · Daizong Liu

Strong blind review: This paper was not made available on public preprint services during the review process

Strong Double Blind

2024 Poster

Paper PDF [ Poster]

Abstract

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment candidate selection pipeline that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moments. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: (1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. (2) Complex moment candidates: the performance of these methods severely relies on the quality of moment candidates, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each frame-word pair with diverse granularity and flexible combination for fine-grained cross-modal interaction. Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. At last, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for fine-grained moment boundary grounding. Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.

Chat is not available.