Poster

Region-Native Visual Tokenization

Mengyu Wang ⋅ Yuyao Huang ⋅ Henghui Ding ⋅ Xinlong Wang ⋅ Tiejun Huang ⋅ Yao Zhao ⋅ Yunchao Wei ⋅ Shuicheng Yan

Strong blind review: This paper was not made available on public preprint services during the review process

Strong Double Blind

2024 Poster

Paper PDF [ Poster] [ Supplemental]

Abstract

We explore an innovative region-based visual token representation and present the REgion-native AutoencoDER ("Reader"). In contrast to the majority of previous methods, which represent each image as a grid-shaped tokens map, "Reader" perceives each image into sequential region-based tokens, with each token corresponding to an object or one part of an object in the image. Specifically, "Reader" comprises both an encoder and a decoder. The encoder can partition each image into an adaptive number of arbitrary-shaped regions and encode each region into a token. Subsequently, the decoder utilizes this adaptive-length token sequence to reconstruct the original image. Experimental results demonstrate that such region-based token representation possesses two main notable characteristics. Firstly, it achieves highly efficient image encoding. "Reader" can adaptively use more regions to represent complex areas and fewer regions in simpler ones, thus avoiding information redundancy. Consequently, it achieves superior reconstruction fidelity compared to previous methods, despite using significantly fewer tokens for each image. Secondly, the region-based manner enables manipulation on a local region without causing global changes. As a result, "Reader" inherently supports diverse image editing operations, including erasing, adding, replacing, and modifying shapes on the objects, and achieves excellent performance in the image editing benchmark of smile transferring. Codes will be provided for reproducibility.

Chat is not available.