Self-supervised representation learning on point cloud sequences is a challenging task due to the complex spatio-temporal structure. Most recent attempts aim to train the point cloud sequences representation model by reconstructing the point coordinates or designing frame-level contrastive learning. However, these methods do not effectively explore the information of temporal dimension and global semantics, which are the very important components in point cloud sequences. To this end, in this paper, we propose a novel masked motion prediction and semantic contrast (M2PSC) based self-supervised representation learning framework for point cloud sequences. Specifically, it aims to learn a representation model by integrating three pretext tasks into the same masked autoencoder framework. First, motion trajectory prediction, which can enhance the model's ability to understand dynamic information in point cloud sequences. Second, semantic contrast, which can guide the model to better explore the global semantics of point cloud sequences. Third, appearance reconstruction, which can help capture the appearance information of point cloud sequences. In this way, our method can force the model to simultaneously encode spatial and temporal structure in the point cloud sequences. Experimental results on four benchmark datasets demonstrate the effectiveness of our method.
Live content is unavailable. Log in and register to view live content