Skip to yearly menu bar Skip to main content


Oral Session

Oral 3A: Datasets And Benchmarking

Gold Room

Moderators: Juan Carlos Niebles · Jose M Alvarez

Wed 2 Oct midnight PDT — 1:30 a.m. PDT
Abstract:
Chat is not available.

Wed 2 Oct. 0:00 - 0:10 PDT

PetFace: A Large-Scale Dataset and Benchmark for Animal Identification

Risa Shinoda · Kaede Shiohara

Automated animal face identification plays a crucial role in the monitoring of behaviors, conducting of surveys, and finding of lost animals. Despite the advancements in human face identification, the lack of datasets and benchmarks in the animal domain has impeded progress. In this paper, we introduce the PetFace dataset, a comprehensive resource for animal face identification encompassing 257,484 unique individuals across 13 animal families and 319 breed categories, including both experimental and pet animals. This large-scale collection of individuals facilitates the investigation of unseen animal face verification, an area that has not been sufficiently explored in existing datasets due to the limited number of individuals. Moreover, PetFace also has fine-grained annotations such as sex, breed, color, and pattern. We provide multiple benchmarks including re-identification for seen individuals and verification for unseen individuals. The models trained on our dataset outperform those trained on prior datasets, even for detailed breed variations and unseen animal families. Our result also indicates that there is some room to improve the performance of integrated identification on multiple animal families. We hope the PetFace dataset will facilitate animal face identification and encourage the development of non-invasive animal automatic identification methods.

Wed 2 Oct. 0:10 - 0:20 PDT

UniIR: Training and Benchmarking Universal Multimodal Information Retrievers

Cong Wei · Yang Chen · Haonan Chen · Hexiang Hu · Ge Zhang · Jie Fu · Alan Ritter · Wenhu Chen

Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image. To approach such different information-seeking demands, we introduce UniIR, a unified instruction-guided multimodal retriever capable of handling eight distinct retrieval tasks across modalities. UniIR, a single retrieval system jointly trained on ten diverse multimodal-IR datasets, interprets user instructions to execute various retrieval tasks, demonstrating robust performance across existing datasets and zero-shot generalization to new tasks. Our experiments highlight that multi-task training and instruction tuning are keys to UniIR's generalization ability. Additionally, we construct the M-BEIR, a multimodal retrieval benchmark with comprehensive results, to standardize the evaluation of universal multimodal information retrieval.

Wed 2 Oct. 0:20 - 0:30 PDT

Towards Model-Agnostic Dataset Condensation by Heterogeneous Models

Jun-Yeong Moon · Jung Uk Kim · Gyeong-Moon Park

The advancement of deep learning has coincided with the proliferation of both models and available data. The surge in dataset sizes and the subsequent surge in computational requirements have led to the development of the Dataset Condensation (DC). While prior studies have delved into generating synthetic images through methods like distribution alignment and training trajectory tracking for more efficient model training, a significant challenge arises when employing these condensed images practically. Notably, these condensed images tend to be specific to particular models, constraining their versatility and practicality. In response to this limitation, we introduce a novel method, Heterogeneous Model Dataset Condensation (HMDC), designed to produce universally applicable condensed images through cross-model interactions. To address the issues of gradient magnitude difference and semantic distance in models when utilizing heterogeneous models, we propose the Gradient Balance Module (GBM) and Mutual Distillation (MD) with the Spatial-Semantic Decomposition method. By balancing the contribution of each model and maintaining their semantic meaning closely, our approach overcomes the limitations associated with model-specific condensed images and enhances the broader utility.

Wed 2 Oct. 0:30 - 0:40 PDT

Parrot Captions Teach CLIP to Spot Text

Yiqi Lin · Conghui He · Alex Jinpeng Wang · Bin Wang · Li Weijia · Mike Zheng Shou

Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content and around 30% of captions words are concurrently embedded in the visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is a dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models across various vision-language downstream tasks. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.

Wed 2 Oct. 0:40 - 0:50 PDT

Towards Open-ended Visual Quality Comparison

Haoning Wu · Hanwei Zhu · Zicheng Zhang · Erli Zhang · Chaofeng Chen · Liang Liao · Chunyi Li · Annan Wang · Wenxiu Sun · Qiong Yan · Xiaohong Liu · Guangtao Zhai · Shiqi Wang · Weisi Lin

Comparative settings (e.g. pairwise choice, listwise ranking) have been adopted by a wide range of subjective studies for image quality assessment (IQA), as it inherently standardizes the evaluation criteria across different observers and offer more clear-cut responses. In this work, we extend the edge of emerging large multi-modality models (LMMs) to further advance visual quality comparison into open-ended settings, that 1) can respond to open-range questions on quality comparison; 2) can provide detailed reasonings beyond direct answers. To this end, we propose the Co-Instruct. To train this first-of-its-kind open-source open-ended visual quality comparer, we collect the Co-Instruct-562K dataset, from two sources: (a) LLM-merged single image quality description, (b) GPT-4V “teacher” responses on unlabeled data. Furthermore, to better evaluate this setting, we propose the MICBench, the first benchmark on multi-image comparison for LMMs. We demonstrate that Co-Instruct not-only achieves in average 30% higher accuracy than state-of-the-art open-source LMMs, but also outperforms GPT-4V (its teacher), on both existing quality-related benchmarks and the proposed MICBench. We will publish our datasets, training scripts and model weights upon acceptance.

Wed 2 Oct. 0:50 - 1:00 PDT

VETRA: A Dataset for Vehicle Tracking in Aerial Imagery - New Challenges for Multi-Object Tracking

Jens Hellekes · Manuel Mühlhaus · Reza Bahmanyar · Seyed Majid Azimi · Franz Kurz

The informative power of traffic analysis can be enhanced by considering changes in both time and space. Vehicle tracking algorithms applied to drone videos provide a better overview than street-level surveillance cameras. However, existing aerial MOT datasets only cover stationary settings, leaving the performance in moving-camera scenarios covering a considerably larger area unknown. To fill this gap, we present VETRA, a dataset for vehicle tracking in aerial imagery introducing heterogeneity in terms of camera movement, frame rate, as well as type, size and number of objects. When dealing with these challenges, state-of-the-art online MOT algorithms exhibit a significant decrease in performance compared to other benchmark datasets. Despite the performance gains achieved by our baseline method through the integration of camera motion compensation, there remains potential for improvement, particularly in situations where vehicles have similar visual appearance, prolonged occlusions, and complex urban driving patterns. Making VETRA available to the community adds a missing building block for both testing and developing vehicle tracking algorithms for versatile real-world applications.

Wed 2 Oct. 1:00 - 1:10 PDT

Insect Identification in the Wild: The AMI Dataset

Aditya Jain · Fagner Cunha · Michael J Bunsen · Juan Sebastián Cañas · Léonard Pasi · Nathan Pinoy · Flemming Helsing · JoAnne Russo · Marc S Botham · Michael Sabourin · Jonathan Fréchette · Alexandre Anctil · Yacksecari Lopez · Eduardo Navarro · Filonila Pérez · Ana C Zamora · Jose Alejandro Ramirez-Silva · Jonathan Gagnon · Tom A August · Kim Bjerge · Alba Gomez Segura · Marc Belisle · Yves Basset · Kent P McFarland · David B Roy · Toke T Høye · Maxim Larrivee · David Rolnick

Insects represent half of all global biodiversity, yet many of the world's insects are disappearing, with severe implications for ecosystems and agriculture. Despite this crisis, data on insect diversity and abundance remain woefully inadequate, due to the scarcity of human experts and the lack of scalable tools for monitoring. Ecologists have started to adopt camera traps to record and study insects, and have proposed computer vision algorithms as an answer for scalable data processing. However, insect monitoring in the wild poses unique challenges that have not yet been addressed within computer vision, including the combination of long-tailed data, extremely similar classes, and significant distribution shifts. We provide the first large-scale machine learning benchmarks for fine-grained insect recognition, designed to match real-world tasks faced by ecologists. Our contributions include a curated dataset of images from citizen science platforms and museums, and an expert-annotated dataset drawn from automated camera traps across multiple continents, designed to test out-of-distribution generalization under field conditions. We train and evaluate a variety of baseline algorithms and introduce a combination of data augmentation techniques that enhance generalization across geographies and hardware setups. Code and datasets will be made publicly available.

Wed 2 Oct. 1:10 - 1:20 PDT

MarineInst: A Foundation Model for Marine Image Analysis with Instance Visual Description

Ziqiang Zheng · Yiwei Chen · Huimin Zeng · Tuan-Anh Vu · Binh-Son Hua · Sai Kit Yeung

Recent foundation models trained on a tremendous scale of data have shown great promise in a wide range of computer vision tasks and application domains. However, less attention has been paid to the marine realms, which in contrast cover the majority of our blue planet. The scarcity of labeled data is the most hindering issue, and marine photographs illustrate significantly different appearances and contents from general in-air images. Using existing foundation models for marine visual analysis does not yield satisfactory performance, due to not only the data distribution shift, but also the intrinsic limitations of the existing foundation models (e.g., lacking semantics, redundant mask generation, or restricted to image-level scene understanding). In this work, we emphasize both model and data approaches for understanding marine ecosystems. We introduce MarineInst, a foundation model for the analysis of the marine realms with instance visual description, which outputs instance masks and captions for marine object instances. To train MarineInst, we acquire MarineInst20M, the largest marine image dataset to date, which contains a wide spectrum of marine images with high-quality semantic instance masks constructed by a mixture of human-annotated instance masks and model-generated instance masks from our automatic procedure of binary instance filtering. To generate informative and detailed semantic instance captions, we use vision-language models to produce semantic richness with various granularities. Our model and dataset support a wide range of marine visual analysis tasks, from image-level scene understanding to regional mask-level instance understanding. More significantly, MarineInst exhibits strong generalization ability and flexibility to support a wide range of downstream tasks with state-of-the-art performance.