Skip to yearly menu bar Skip to main content


Oral Session

Oral 3C: Point Clouds

Silver Room

Moderators: YIMING WANG · Christian Rupprecht

Wed 2 Oct midnight PDT — 1:30 a.m. PDT
Abstract:
Chat is not available.

Wed 2 Oct. 0:00 - 0:10 PDT

HGL: Hierarchical Geometry Learning for Test-time Adaptation in 3D Point Cloud Segmentation

Tianpei Zou · Sanqing Qu · Zhijun Li · Alois C. Knoll · 何 良华 · Guang Chen · Changjun Jiang

3D point cloud segmentation has received significant interest for its growing applications. However, the generalization ability of models suffers in dynamic scenarios due to the distribution shift between test and training data. To promote robustness and adaptability across diverse scenarios, test-time adaptation (TTA) has recently been introduced. Nevertheless, most existing TTA methods are developed for images, and limited approaches applicable to point clouds ignore the inherent hierarchical geometric structures in point cloud streams, i.e., local (point-level), global (object-level), and temporal (frame-level) structures. In this paper, we delve into TTA in 3D point cloud segmentation and propose a novel Hierarchical Geometry Learning (HGL) framework. HGL comprises three complementary modules from local, global to temporal learning in a bottom-up manner. Technically, we first construct a local geometry learning module for pseudo-label generation. Next, we build prototypes from the global geometry perspective for pseudo-label fine-tuning. Furthermore, we introduce a temporal consistency regularization module to mitigate negative transfer. Extensive experiments on four datasets demonstrate the effectiveness and superiority of our HGL. Remarkably, on the SynLiDAR to SemanticKITTI task, HGL achieves an overall mIoU of 46.91\%, improving GIPSO by 3.0\% and significantly reducing the required adaptation time by 80\%.

Wed 2 Oct. 0:10 - 0:20 PDT

Award Candidate
PointLLM: Empowering Large Language Models to Understand Point Clouds

Runsen Xu · Xiaolong Wang · Tai Wang · Yilun Chen · Jiangmiao Pang · Dahua Lin

The unprecedented advancements in Large Language Models (LLMs) have shown a profound impact on natural language processing but are yet to fully embrace the realm of 3D understanding. This paper introduces PointLLM, a preliminary effort to fill this gap, empowering LLMs to understand point clouds and offering a new avenue beyond 2D data. PointLLM understands colored object point clouds with human instructions and generates contextually appropriate responses, illustrating its grasp of point clouds and common sense. Specifically, it leverages a point cloud encoder with a powerful LLM to effectively fuse geometric, appearance, and linguistic information. To overcome the scarcity of point-text instruction following data, we developed an automated data generation pipeline, collecting a large-scale dataset of more than 730K samples with 660K different objects, which facilitates the adoption of the two-stage training strategy prevalent in MLLM development. Additionally, we address the absence of appropriate benchmarks and the limitations of current evaluation metrics by proposing two novel benchmarks: Generative 3D Object Classification and 3D Object Captioning, which are supported by new, comprehensive evaluation metrics derived from human and GPT analyses. Through exploring various training strategies, we develop PointLLM, significantly surpassing 2D and 3D baselines, with a notable achievement in human-evaluated object captioning tasks where it surpasses human annotators in over 50% of the samples. Codes, datasets, and benchmarks are available at https://github.com/OpenRobotLab/PointLLM.

Wed 2 Oct. 0:20 - 0:30 PDT

RISurConv: Rotation Invariant Surface Attention-Augmented Convolutions for 3D Point Cloud Classification and Segmentation

Zhiyuan Zhang · Licheng Yang · Zhiyu Xiang

Despite the progress on 3D point cloud deep learning, most prior works focus on learning features that are invariant to translation and point permutation, and very limited efforts have been devoted for rotation invariant property. Several recent studies achieve rotation invariance at the cost of lower accuracies. In this work, we close this gap by proposing a novel yet effective rotation invariant architecture for 3D point cloud classification and segmentation. Instead of traditional pointwise operations, we construct local triangle surfaces to capture more detailed surface structure, based on which we can extract highly expressive rotation invariant surface properties which are then integrated into an attention-augmented convolution operator named RISurAAConv to generate refined attention features via self-attention layers. Based on RISurAAConv we build an effective neural network for 3D point cloud analysis that is invariant to arbitrary rotations while maintaining high accuracy. We verify the performance on various benchmarks with supreme results obtained surpassing the previous state-of-the-art by a large margin. We achieve 95.3% (+4.3%) on ModelNet40, 92.6% (+12.3%) on ScanObjectNN, and 96.4% (+7.0%), 87.6% (+13.0%), 88.7%} (+7.7%) respectively on the three categories of FG3D dataset for fine-grained classification task and achieve 81.5% (+1.0%) mIoU on ShapeNet for segmentation task, respectively. The code and models will be released upon publication.

Wed 2 Oct. 0:30 - 0:40 PDT

DVLO: Deep Visual-LiDAR Odometry with Local-to-Global Feature Fusion and Bi-Directional Structure Alignment

Jiuming Liu · Dong Zhuo · Zhiheng Feng · Siting Zhu · Chensheng Peng · Zhe Liu · Hesheng Wang

Information inside visual and LiDAR data is well complementary derived from the fine-grained texture of images and massive geometric information in point clouds. However, it remains challenging to explore effective visual-LiDAR fusion, mainly due to the intrinsic data structure inconsistency between two modalities: Images are regular and dense, but LiDAR points are unordered and sparse. To address the problem, we propose a local-to-global fusion network with bi-directional structure alignment. To obtain locally fused features, we project points onto image plane as cluster centers and cluster image pixels around each center. Image pixels are pre-organized as pseudo points for image-to-point structure alignment. Then, we convert points to pseudo images by cylindrical projection (point-to-image structure alignment) and perform adaptive global feature fusion between point features with local fused features. Our method achieves state-of-the-art performance on KITTI odometry and FlyingThings3D scene flow datasets compared to both single-modal and multi-modal methods. Codes will be released upon publication.

Wed 2 Oct. 0:40 - 0:50 PDT

KeypointDETR: An End-to-End 3D Keypoint Detector

Hairong Jin · Yuefan Shen · Jianwen Lou · Kun Zhou · Youyi Zheng

3D keypoint detection plays a pivotal role in 3D shape analysis. The majority of prevalent methods depend on producing a shared heatmap. This approach necessitates subsequent post-processing techniques such as clustering or non-maximum suppression (NMS) to pinpoint keypoints within high-confidence regions, resulting in performance inefficiencies. To address this issue, we introduce KeypointDETR, an end-to-end 3D keypoint detection framework. KeypointDETR is predominantly trained with a bipartite matching loss, which compels the network to forecast sets of heatmaps and probabilities for potential keypoints. Each heatmap highlights one keypoint's location, and the associated probability indicates not only the presence of that specific keypoint but also its semantic consistency. Together with the bipartite matching loss, we utilize a transformer-based network architecture, which incorporates both point-wise and query-wise self-attention within the encoder and decoder, respectively. The point-wise encoder leverages self-attention mechanisms on a dynamic graph derived from the local feature space of each point, resulting in the generation of heatmap features. As a key part of our framework, the query-wise decoder not only facilitates inter-query information exchange but also captures the underlying connections among keypoints' heatmaps, positions, and semantic attributes via the cross-attention mechanism, enabling the prediction of heatmaps and probabilities. Extensive experiments conducted on the KeypointNet dataset reveal that KeypointDETR outperforms competitive baselines, demonstrating superior performance in keypoint saliency and correspondence estimation tasks.

Wed 2 Oct. 0:50 - 1:00 PDT

Rethinking Data Augmentation for Robust LiDAR Semantic Segmentation in Adverse Weather

Junsung Park · Kyungmin Kim · Hyunjung Shim

Existing LiDAR semantic segmentation methods commonly face performance declines in adverse weather conditions. Prior research has addressed this issue by simulating adverse weather or employing universal data augmentation during training.However, these methods lack a detailed analysis and understanding of how adverse weather negatively affects LiDAR semantic segmentation performance.Motivated by this issue, we characterized adverse weather in several factors and conducted a toy experiment to identify the main factors causing performance degradation: (1) Geometric perturbation due to refraction caused by fog or droplet in the air and (2) Point drop due to energy absorption and occlusions.Based on this analysis, we propose new strategic data augmentation techniques. Specifically, we first introduced a Selective Jittering (SJ) that jitters points in the random range of depth (or angle) to mimic geometric perturbation. Additionally, we developed a Learnable Point Drop (LPD) to learn vulnerable erase patterns with Deep Q-Learning Network to approximate point drop phenomenon from adverse weather conditions.Without precise weather simulation, these techniques strengthen the LiDAR semantic segmentation model by exposing it to vulnerable conditions identified by our data-centric analysis. Experimental results confirmed the suitability of the proposed data augmentation methods for enhancing robustness against adverse weather conditions. Our method attains a remarkable 39.5 mIoU on the SemanticKITTI-to-SemanticSTF benchmark, surpassing the previous state-of-the-art by over 5.4%p, tripling the improvement over the baseline compared to previous methods achieved.

Wed 2 Oct. 1:00 - 1:10 PDT

RAPiD-Seg: Range-Aware Pointwise Distance Distribution Networks for 3D LiDAR Segmentation

Luis Li · Hubert P. H. Shum · Toby P Breckon

3D point clouds play a pivotal role in outdoor scene perception, especially in the context of autonomous driving. Recent advancements in 3D LiDAR segmentation often focus intensely on the spatial positioning and distribution of points for accurate segmentation. However, these methods, while robust in variable conditions, encounter challenges due to sole reliance on coordinates and point intensity, leading to poor isometric invariance and suboptimal segmentation. To tackle this challenge, our work introduces Range-Aware Pointwise Distance Distribution (RAPiD) features and the associated RAPiD-Seg architecture. Our RAPiD features exhibit rigid transformation invariance and effectively adapt to variations in point density, with a design focus on capturing the localized geometry of neighboring structures. They utilize inherent LiDAR isotropic radiation and semantic categorization for enhanced local representation and computational efficiency, while incorporating a 4D distance metric that integrates geometric and surface material reflectivity for improved semantic segmentation. To effectively embed high-dimensional RAPiD features, we propose a double-nested autoencoder structure with a novel class-aware embedding objective to encode high-dimensional features into manageable voxel-wise embeddings. Additionally, we propose RAPiD-Seg which incorporates a channel-wise attention fusion and a two-stage training strategy, further optimizing the embedding for enhanced performance and generalization. Our method outperforms contemporary LiDAR segmentation work in terms of mIoU on SemanticKITTI (76.1) and nuScenes (83.6) datasets (leaderboard rankings: 1st on both datasets).

Wed 2 Oct. 1:10 - 1:20 PDT

Equi-GSPR: Equivariant SE(3) Graph Network Model for Sparse Point Cloud Registration

Xueyang Kang · Zhaoliang Luan · Kourosh Khoshelham · Bing Wang

Point cloud registration is a foundational task crucial for 3D alignment and reconstruction applications. While both traditional and learning-based registration approaches have achieved significant success, the leverage of intrinsic symmetry within input point cloud data often receives insufficient attention, which prohibits the model from sample leveraging and learning efficiency, leading to an increase in data size and model complexity. To address these challenges, we propose a dedicated graph neural network model embedded with a built-in Spherical Euclidean 3D equivariance property achieved through SE(3) node features and message passing equivariance. Pairwise input feature embeddings are derived from sparsely downsampled input point clouds, each with several orders of magnitude less point than two raw input point. These embeddings form a rigidity graph capturing spatial relationships, which is subsequently pooled into global features followed by cross-attention mechanisms, and finally decoded into the regression pose between point clouds. Experiments conducted on the 3DMatch and KITTI datasets exhibits the compelling and distinctive performance of our model compared to state-of-the-art approaches. By harnessing the equivariance properties inherent in the data, our model exhibits a notable reduction in the required input points during training when compared to existing approaches relying on dense input points. Moreover, our model eliminates the necessity for the removal or modeling of outliers present in dense input points, which simplifies the pre-processing pipeline of point cloud.