Track: Oral 5B: Vision Applications

Thu 3 Oct. 0:00 - 0:10 PDT

Award Candidate

Robust Fitting on a Gate Quantum Computer

Frances Yang · Michele Sasdelli · Tat-Jun Chin

Gate quantum computers generate significant interest due to their potential to solve certain difficult problems such as prime factorization in polynomial time. Computer vision researchers have long been attracted to the power of quantum computers. Robust fitting, which is fundamentally important to many computer vision pipelines, has recently been shown to be amenable to gate quantum computing. The previous proposed solution was to compute Boolean influence as a measure of outlyingness using the Bernstein-Vazirani quantum circuit. However, the method assumed a quantum implementation of an $\ell_\infty$ feasibility test, which has not been demonstrated. In this paper, we take a big stride towards quantum robust fitting: we propose a quantum circuit to solve the $\ell_\infty$ feasibility test in the 1D case, which allows to demonstrate for the first time quantum robust fitting on a real gate quantum computer, the IonQ Aria. We also show how 1D Boolean influences can be accumulated to compute Boolean influences for higher-dimensional non-linear models, which we experimentally validate on real benchmark datasets.

Thu 3 Oct. 0:10 - 0:20 PDT

Geospecific View Generation - Geometry-Context Aware High-resolution Ground View Inference from Satellite Views

Ningli Xu · Rongjun Qin

Predicting realistic ground views from satellite imagery in urban scenes is a challenging task due to the significant view gaps between satellite and ground-view images. We propose a novel pipeline to tackle this challenge, by generating geospecifc views that maximally respect the weak geometry and texture from multi-view satellite images. Different from existing approaches that hallucinate images from cues such as partial semantics or geometry from overhead satellite images, our method directly predicts ground-view images at geolocation by using a comprehensive set of information from the satellite image, resulting in ground-level images with a resolution boost at a factor of ten or more. We leverage a novel building refinement method to reduce geometric distortions in satellite data at ground level, which ensures the creation of accurate conditions for view synthesis using diffusion networks. Moreover, we proposed a novel geospecific prior, which prompts distribution learning of diffusion models to respect image samples that are closer to the geolocation of the predicted images. We demonstrate our pipeline is the first to generate close-to-real and geospecific ground views merely based on satellite images. Codes and data will be shared.

Thu 3 Oct. 0:20 - 0:30 PDT

Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance

Tien Toan Nguyen · Minh Nhat Nhat Vu · Baoru Huang · An Dinh Vuong · Quan Vuong · Ngan Le · Thieu Vo · Anh Nguyen

6-DoF grasp detection has been a fundamental and challenging problem in robotic vision. While previous works have focused on ensuring grasp stability, they often do not consider human intention conveyed through natural language, hindering effective collaboration between robots and users in complex 3D environments. In this paper, we present a new approach for language-driven 6-DoF grasp detection in cluttered point clouds. We first introduce Grasp-Anything-6D, a large-scale dataset for the language-driven 6-DoF grasp detection task with 1M point cloud scenes and more than 200M language-associated 3D grasp poses. We further introduce a novel diffusion model that incorporates a new negative prompt guidance learning strategy. The proposed negative prompt strategy directs the detection process toward the desired object while steering away from unwanted ones given the language input. Our method enables an end-to-end framework where humans can command the robot to grasp desired objects in a cluttered scene using natural language. Intensive experimental results show the effectiveness of our method in both benchmarking experiments and real-world scenarios, surpassing other baselines. In addition, we demonstrate the practicality of our approach in real-world robotic applications.

Thu 3 Oct. 0:30 - 0:40 PDT

MaxMI: A Maximal Mutual Information Criterion for Manipulation Concept Discovery

Pei Zhou · Yanchao Yang

We aim to discover manipulation concepts embedded in the unannotated demonstrations, which are recognized as key physical states. The discovered concepts can facilitate training manipulation policies and promote generalization. Current methods relying on multimodal foundation models for deriving key states usually lack accuracy and semantic consistency due to limited multimodal robot data. In contrast, we introduce an information-theoretic criterion to characterize the regularities that signify a set of physical states. We also develop a framework that trains a concept discovery network using this criterion, thus bypassing the dependence on human semantics and alleviating costly human labeling. The proposed criterion is based on the observation that key states, which deserve to be conceptualized, often admit more physical constraints than non-key states. This phenomenon can be formalized as maximizing the mutual information between the putative key state and its preceding state, i.e., Maximal Mutual Information (MaxMI). By employing MaxMI, the trained key state localization network can accurately identify states of sufficient physical significance, exhibiting reasonable semantic compatibility with human perception. Furthermore, the proposed framework produces key states that lead to concept-guided manipulation policies with higher success rates and better generalization in various robotic tasks compared to the baselines, verifying the effectiveness of the proposed criterion.

Thu 3 Oct. 0:40 - 0:50 PDT

Align before Collaborate: Mitigating Feature Misalignment for Robust Multi-Agent Perception

Dingkang Yang · Ke Li · Dongling Xiao · Zedian Shao · Peng Sun · Liang Song

Collaborative perception has received widespread attention recently since it enhances the perception ability of autonomous vehicles via inter-agent information sharing. However, the performance of existing systems is hindered by the unavoidable collaboration noises, which induce feature-level spatial misalignment over the collaborator-shared information. In this paper, we propose a model-agnostic and lightweight plugin to mitigate the feature-level misalignment issue, called dynamic feature alignment (NEAT). The merits of the NEAT plugin are threefold. First, we introduce an importance-guided query proposal to predict potential foreground regions with space-channel semantics and exclude environmental redundancies. On this basis, a deformable feature alignment is presented to explicitly align the collaborator-shared features through query-aware spatial associations, aggregating multi-grained visual clues with corrective mismatch properties. Ultimately, we perform a region cross-attention reinforcement to facilitate aligned representation diffusion and achieve global feature semantic enhancement. NEAT can be readily inserted into existing collaborative perception procedures and significantly improves the robustness of vanilla baselines against pose errors and transmission delay. Extensive experiments on four collaborative 3D object detection datasets under noisy settings confirm that NEAT provides consistent gains for most methods with distinct structures.

Thu 3 Oct. 0:50 - 1:00 PDT

Faceptor: A Generalist Model for Face Perception

Lixiong Qin · Mei Wang · Xuannan Liu · Yuhang Zhang · Wei Deng · Xiaoshuai Song · Weiran Xu · Weihong Deng

With the comprehensive research conducted on various face analysis tasks, there is a growing interest among researchers to develop a unified approach to face perception. Existing methods mainly discuss unified representation and training, which lack task extensibility and application efficiency. To tackle this issue, we focus on the unified model structure, exploring a face generalist model. As an intuitive design, Naive Faceptor enables tasks with the same output shape and granularity to share the structural design of the standardized output head, achieving improved task extensibility. Furthermore, Faceptor is proposed to adopt a well-designed single-encoder dual-decoder architecture, allowing task-specific queries to represent new-coming semantics. This design enhances the unification of model structure while improving application efficiency in terms of storage overhead. Additionally, we introduce Layer-Attention into Faceptor, enabling the model to adaptively select features from optimal layers to perform the desired tasks. Through joint training on 13 face perception datasets, Faceptor achieves exceptional performance in facial landmark localization, face parsing, age estimation, expression recognition, binary attribute classification, and face recognition, achieving or surpassing specialized methods in most tasks. Our training framework can also be applied to auxiliary supervised learning, significantly improving performance in data-sparse tasks such as age estimation and expression recognition. The code and model will be made publicly available.

Thu 3 Oct. 1:00 - 1:10 PDT

A Geometric Distortion Immunized Deep Watermarking Framework with Robustness Generalizability

Linfeng Ma · Han Fang · Tianyi Wei · Zijin Yang · Zehua Ma · Weiming Zhang · Nenghai Yu

Robustness is the most important property of watermarking schemes. In practice, the watermarking mechanism shall be robust to both geometric and non-geometric distortions. In deep learning-based watermarking frameworks, robustness can be ensured by end-to-end training with different noise layers. However, most of the current CNN-based watermarking frameworks, even trained with targeted distortions, cannot well adapt to geometric distortions due to the architectural design. Since the traditional convolutional layer's position structure is relatively fixed, it lacks the flexibility to capture the influence of geometric distortion, making it difficult to train for corresponding robustness. To address such limitations, we propose a Swin Transformer and Deformable Convolutional Network (DCN)-based watermark model backbone. The attention mechanism and the deformable convolutional window effectively improve the feature processing flexibility, greatly enhancing the robustness, especially for geometric distortions. Besides, for non-geometric distortions, aiming at improving the generalizability for more distortions, we also provide a distortion-style-ensembled noise layer, including an image encoder, an image decoder, and distortion-style layers that can effectively simulate styles of different kinds of distortions. In the final watermark model training stage, we can simply train with our proposed noise layer for overall robustness. Extensive experiments illustrate that compared to existing state-of-the-art (SOTA) works, with better visual quality, our method achieves explicitly superior performance under tested geometric distortions, with 100.00% watermark extraction accuracy in almost all cases. Relevant codes will be released after review.

Thu 3 Oct. 1:10 - 1:20 PDT

COHO: Context-Sensitive City-Scale Hierarchical Urban Layout Generation

Liu He · Daniel Aliaga

The generation of large-scale urban layouts has garnered substantial interest across various disciplines. Prior methods have utilized procedural generation requiring manual rule coding or deep learning needing abundant data. However, prior approaches have not considered the context-sensitive nature of urban layout generation. Our approach addresses this gap by leveraging a canonical graph representation for the entire city, which facilitates scalability and captures the multi-layer semantics inherent in urban layouts. We introduce a novel graph-based masked autoencoder (GMAE) for city-scale urban layout generation. The method encodes attributed buildings, city blocks, communities and cities into a unified graph structure, enabling self-supervised masked training for graph autoencoder. Additionally, we employ scheduled iterative sampling for 2.5D layout generation, prioritizing the generation of important city blocks and buildings. Our approach achieves good realism, semantic consistency, and correctness across the heterogeneous urban styles in 330 US cities. Codes and datasets are released at: https://github.com/Arking1995/COHO.

Main Navigation

Oral Session

Oral 5B: Vision Applications