Skip to yearly menu bar Skip to main content


Oral Session

Oral 1A: Scene Analysis And Understanding

Gold Room

Moderators: Serge Belongie · Kenneth Marino

Tue 1 Oct midnight PDT — 1:30 a.m. PDT
Abstract:
Chat is not available.

Tue 1 Oct. 0:00 - 0:10 PDT

Towards Scene Graph Anticipation

Rohith Peddi · Saksham Singh · Saurabh . · Parag Singla · Vibhav Gogate

Spatio-temporal scene graphs represent interactions in a video by decomposing scenes into individual objects and their pair-wise temporal relationships. Long-term anticipation of the fine-grained pair-wise relationships between objects is a challenging problem. To this end, we introduce the task of Scene Graph Anticipation (SGA). We adopt state-of-the-art scene graph generation methods as baselines to anticipate future pair-wise relationships between objects. In our proposed approaches, we leverage object-centric representations of relationships to reason about the observed video frames and model the evolution of relationships between objects. We take a continuous time perspective and model the latent dynamics of the evolution of object interactions using concepts of NeuralODE and NeuralSDE, respectively. We infer representations of future relationships by solving an Ordinary Differential Equation and a Stochastic Differential Equation, respectively. Extensive experimentation on the Action Genome dataset validates the efficacy of the proposed methods.

Tue 1 Oct. 0:10 - 0:20 PDT

OP-Align: Object-level and Part-level Alignment for Self-supervised Category-level Articulated Object Pose Estimation

Yuchen Che · Ryo Furukawa · Asako Kanezaki

Category-level articulated object pose estimation task focuses on the pose estimation of unknown articulated objects within known categories. Despite its significance, this task remains challenging due to the varying shapes and poses of objects, expensive dataset annotation costs, and complex real-world environments. In this paper, we propose a novel self-supervised approach that leverages a single-frame point cloud to accomplish the aforementioned task. Our model consistently generates canonical reconstruction with a canonical pose and joint state for the entire input object, and it estimates object-level pose that reduce overall pose variance and part-level poses that align each part of the input with its corresponding part of the reconstruction. Experimental results demonstrate our approach achieves state-of-the-art performance to other self-supervised methods and comparable performance to other supervised methods. To assess the performance of our model in real-world scenarios, we also introduce a novel real-world articulated object benchmark dataset.

Tue 1 Oct. 0:20 - 0:30 PDT

PDiscoFormer: Relaxing Part Discovery Constraints with Vision Transformers

Ananthu Aniraj · Cassio F. Dantas · Dino Ienco · Diego Marcos

Computer vision methods that explicitly detect object parts and reason on them are a step towards inherently interpretable models. Existing approaches that perform part discovery driven by a fine-grained classification task make very restrictive assumptions on the geometric properties of the discovered parts; they should be small and compact. Although this prior is useful in some cases, in this paper we show that pre-trained transformer-based vision models, such as self-supervised DINOv2 ViT, enable the relaxation of these constraints. In particular, we find that a total variation (TV) prior, which allows for multiple connected components of any size, substantially outperforms previous work. We test our approach on three fine-grained classification benchmarks: CUB, PartImageNet and Oxford Flowers, and compare our results to previously published methods as well as a re-implementation of the state-of-the-art method PDiscoNet with a transformer-based backbone. We consistently obtain substantial improvements across the board, both on part discovery metrics and the downstream classification task, showing that the strong inductive biases in self-supervised ViT models require to rethink the geometric priors that can be used for unsupervised part discovery.

Tue 1 Oct. 0:30 - 0:40 PDT

Bi-directional Contextual Attention for 3D Dense Captioning

Minjung Kim · Hyung Suk Lim · Soonyoung Lee · Bumsoo Kim · Gunhee Kim

3D dense captioning is a task to localize objects and generate descriptions for each object in a 3D scene. Recent approaches in 3D dense captioning have attempted to incorporate contextual information by modeling relationships with object pairs or aggregating nearest neighbor features of an object. However, the contextual information constructed in these scenarios is limited in two aspects: first, objects have multiple positional relationship that exists across the entire global scene (not only near the object itself), and second, it faces with contradicting objectives--where localization and attribute descriptions are generated better with tight localization, while descriptions involving global positional relations are generated better with contextualized features of the global scene. To overcome this challenge, we introduce CSI, a transformer encoder-decoder pipeline that engages in 3D dense captioning for each object with contextualized geometries (where the structural contexts relevant to each object is summarized) and contextualized objects (where the objects relevant to the summarized structural contexts are aggregated). This simple extension relieves previous methods from the contradicting objectives, enhancing both localization performance while enabling to aggregate contextual features throughout the global scene; thus improving caption generation performance simultaneously. Extensive experiments on two of the most widely-used 3D dense captioning datasets (ScanRefer and Nr3D) demonstrate that our proposed method achieves a significant improvement over prior methods.

Tue 1 Oct. 0:40 - 0:50 PDT

OmniNOCS: A unified NOCS dataset and model for 3D lifting of 2D objects

Akshay Krishnan · Abhijit Kundu · Kevis Maninis · James Hays · Matthew Brown

We propose OmniNOCS: a large-scale monocular dataset with 3D Normalized Object Coordinate Space (NOCS), object masks, and 3D bounding box annotations for indoor and outdoor scenes. OmniNOCS has 20 times more object classes and 200 times more instances than existing NOCs datasets (NOCS-Real275, Wild6D). We use OmniNOCS to train a novel, transformer-based monocular NOCS prediction model (NOCSformer) that can predict accurate NOCS, instance masks and poses from 2D object detections across diverse classes. It is the first NOCS model that can generalize to a broad range of classes when prompted with 2D boxes. We evaluate our model on the task of 3D oriented bounding box prediction, where it achieves comparable results to state-of-the-art 3D detection methods such as Cube R-CNN. Unlike other 3D detection methods, our model also provides detailed and accurate 3D object shape and segmentation. We propose a novel benchmark for the task of NOCS prediction based on OmniNOCS, which we hope will serve as a useful baseline for future work in this area. Our dataset and code are at the project website: https://omninocs.github.io

Tue 1 Oct. 0:50 - 1:00 PDT

ABC Easy as 123: A Blind Counter for Exemplar-Free Multi-Class Class-agnostic Counting

Michael A Hobley · Victor Adrian Prisacariu

Class-agnostic counting methods enumerate objects of an arbitrary class, providing tremendous utility in many fields. Prior works have limited usefulness as they require either a set of examples of the type to be counted or that the query image contains only a single type of object. A significant factor in these shortcomings is the lack of a dataset to properly address counting in settings with more than one kind of object present. To address these issues, we propose the first Multi-class, Class-Agnostic Counting dataset (MCAC) and A Blind Counter (ABC123), a method that can count multiple types of objects simultaneously without using examples of type during training or inference. ABC123 introduces a new paradigm where instead of requiring exemplars to guide the enumeration, examples are found after the counting stage to help a user understand the generated outputs. We show that ABC123 outperforms contemporary methods on MCAC without needing human in-the-loop annotations. We also show that this performance transfers to FSC-147, the standard class-agnostic counting dataset. MCAC is available at MCAC.active.vision and ABC123 is available at ABC123.active.vision.

Tue 1 Oct. 1:00 - 1:10 PDT

A Fair Ranking and New Model for Panoptic Scene Graph Generation

Julian Lorenz · Alexander Pest · Daniel Kienzle · Katja Ludwig · Rainer Lienhart

In panoptic scene graph generation (PSGG), models retrieve interactions between objects in an image which are grounded by panoptic segmentation masks. Previous evaluations on panoptic scene graphs have been subject to an erroneous evaluation protocol where multiple masks for the same object can lead to multiple relation distributions per mask-mask pair. This can be exploited to increase the final score. We correct this flaw and provide a fair ranking over a wide range of existing PSGG models. The observed scores for existing methods increase by up to 7.4 mR@50 for all two-stage methods, while dropping by up to 19.3 mR@50 for all one-stage methods, highlighting the importance of a correct evaluation. Contrary to recent publications, we show that existing two-stage methods are competitive to one-stage methods. Building on this, we introduce the Decoupled SceneFormer (DSFormer), a novel two-stage model that outperforms all existing scene graph models by a large margin of +11 mR@50 and +10 mNgR@50 on the corrected evaluation, thus setting a new SOTA. As a core design principle, DSFormer encodes subject and object masks directly into feature space.

Tue 1 Oct. 1:10 - 1:20 PDT

Award Candidate
Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

Zuyao Chen · Jinlin Wu · Zhen Lei · Zhaoxiang Zhang · Chang Wen Chen

Scene Graph Generation (SGG) offers a structured representation critical in many computer vision applications.Traditional SGG approaches, however, are limited by a closed-set assumption, restricting their ability to recognize only predefined object and relation categories. To overcome this, we categorize SGG scenarios into four distinct settings based on the node and edge: Closed-set SGG, Open Vocabulary (object) Detection-based SGG (OvD-SGG), Open Vocabulary Relation-based SGG (OvR-SGG), and Open Vocabulary Detection + Relation-based SGG (OvD+R-SGG). While object-centric open vocabulary SGG has been studied recently, the more challenging problem of relation-involved open-vocabulary SGG remains relatively unexplored. To fill this gap, we propose a unified framework named OvSGTR towards fully open vocabulary SGG from a holistic view. The proposed framework is an end-to-end transformer architecture, which learns a visual-concept alignment for both nodes and edges, enabling the model to recognize unseen categories. For the more challenging settings of relation-involved open vocabulary SGG, the proposed approach integrates relation-aware pre-training utilizing image-caption data and retains visual-concept alignment through knowledge distillation. Comprehensive experimental results on the Visual Genome benchmark demonstrate the effectiveness and superiority of the proposed framework.