Track: Oral 2B: Recognition

Tue 1 Oct. 4:30 - 4:40 PDT

Award Candidate

Efficient Bias Mitigation Without Privileged Information

Mateo Espinosa Zarlenga · Sankaranarayanan · Jerone Andrews · Zohreh Shams · Mateja Jamnik · Alice Xiang

Deep neural networks trained via empirical risk minimisation often exhibit significant performance disparities across groups, particularly when group and task labels are spuriously correlated (e.g., grassy background'' andcows''). Existing bias mitigation methods that aim to address this issue often either rely on group labels for training or validation, or require an extensive hyperparameter search. Such data and computational requirements hinder the practical deployment of these methods, especially when datasets are too large to be group-annotated, computational resources are limited, and models are trained through already complex pipelines. In this paper, we propose Targeted Augmentations for Bias Mitigation (TAB), a simple hyperparameter-free framework that leverages the entire training history of a helper model to identify spurious samples, and generate a group-balanced training set from which a robust model can be trained. We show that TAB improves worst-group performance without any group information or model selection, outperforming existing methods while maintaining overall accuracy.

Tue 1 Oct. 4:40 - 4:50 PDT

Fast Diffusion-Based Counterfactuals for Shortcut Removal and Generation

Nina Weng · Paraskevas Pegios · Eike Petersen · Aasa Feragen · Siavash Bigdeli

Shortcut learning is when a model -- e.g. a cardiac disease classifier -- exploits correlations between the target label and a spurious shortcut feature, e.g. a pacemaker, to predict the target label based on the shortcut rather than real discriminative features. This is common in medical imaging, where treatment and clinical annotations correlate with disease labels, making them easy shortcuts to predict disease. We propose a novel detection and quantification of the impact of potential shortcut features via diffusion-based counterfactual image generation that can synthetically remove or add shortcuts. Via a novel self-optimized masking scheme we spatially limit the changes made with no extra inference step, encouraging the removal of spatially constrained shortcut features while ensuring that the shortcut-free counterfactuals preserve their remaining image features to a high degree. Using these, we assess how shortcut features influence model predictions. This is enabled by our second contribution: An efficient diffusion-based counterfactual generation with significant inference speed-up at comparable counterfactual quality as state-of-the-art. We confirm our performance on two large chest X-ray datasets, a skin lesion dataset, and CelebA.

Tue 1 Oct. 4:50 - 5:00 PDT

MobileNetV4: Universal Models for the Mobile Ecosystem

Danfeng Qin · Chas Leichner · Manolis Delakis · Marco Fornoni · Shixin Luo · Fan Yang · Weijun Wang · Colby Banbury · Chengxi Ye · Berkin Akin · Vaibhav Aggarwal · Tenghui Zhu · Daniele Moro · Andrew Howard

We present the latest generation of MobileNets, known as MobileNetV4 (MNv4), featuring universally efficient architecture designs for mobile devices. At its core, we introduce the Universal Inverted Bottleneck (UIB) search block, a unified and flexible structure that merges Inverted Bottleneck (IB), ConvNext, Feed Forward Network (FFN), and a novel Extra Depthwise (ExtraDW) variant. Alongside UIB, we present Mobile MQA, an attention block tailored for mobile accelerators, delivering a significant 39% speedup. An optimized neural architecture search (NAS) recipe was also crafted to improve MNv4 search effectiveness. The integration of UIB, Mobile MQA and the refined NAS recipe results in a new suite of MNv4 models that are mostly Pareto optimal across mobile CPUs, DSPs, GPUs, as well as specialized accelerators like Apple Neural Engine and Google Pixel EdgeTPU - a characteristic not found in any other models tested. Our approach emphasizes simplicity, utilizing standard components and a straightforward attention mechanism to ensure broad hardware compatibility. To further boost efficiency, we finally introduce a novel distillation technique. Enhanced by this technique, our MNv4-Hybrid-Large model delivers impressive 87% ImageNet-1K accuracy, with a Pixel 8 EdgeTPU runtime of just 3.8ms.

Tue 1 Oct. 5:00 - 5:10 PDT

Momentum Auxiliary Network for Supervised Local Learning

Junhao Su · Changpeng Cai · Feiyu Zhu · Chenghao He · Xiaojie Xu · Dongzhi Guan · Chenyang Si

Deep neural networks conventionally employ end-to-end backpropagation for their training process, which lacks biological credibility and triggers a locking dilemma during network parameter updates, leading to significant GPU memory use. Supervised local learning, which segments the network into multiple local blocks updated by independent auxiliary networks. However, these methods cannot replace end-to-end training due to lower accuracy, as gradients only propagate within their local block, creating a lack of information exchange between blocks. To address this issue and establish information transfer across blocks, we propose a Momentum Auxiliary Network (MAN) that establishes a dynamic interaction mechanism. The MAN leverages an exponential moving average (EMA) of the parameters from adjacent local blocks to enhance information flow. This auxiliary network, updated through EMA, helps bridge the informational gap between blocks. Nevertheless, we observe that directly applying EMA parameters has certain limitations due to feature discrepancies among local blocks. To overcome this, we introduce learnable biases, further boosting performance. We have validated our method on four image classification datasets (CIFAR-10, STL-10, SVHN, ImageNet), attaining superior performance and substantial memory savings. Notably, our method can reduce GPU memory usage by more than 45% on the ImageNet dataset compared to end-to-end training, while achieving higher performance. The Momentum Auxiliary Network thus offers a new perspective for supervised local learning.

Tue 1 Oct. 5:10 - 5:20 PDT

From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition

Maan Qraitem · Kate Saenko · Bryan Plummer

Visual recognition models are prone to learning spurious correlations induced by a biased training set where certain conditions $B$ (\eg, Indoors) are over-represented in certain classes $Y$ (\eg, Big Dogs). Synthetic data from off-the-shelf large-scale generative models offers a promising direction to mitigate this issue by augmenting underrepresented subgroups in the real dataset. However, by using a mixed distribution of real and synthetic data, we introduce another source of bias due to distributional differences between synthetic data and real data (\eg synthetic artifacts). As we will show, prior work methods for using synthetic data to resolve the model's bias toward $B$ doesn't correct the models' bias toward the pair $(B, G)$ where $G$ denotes whether the sample is real or synthetic. Thus, the model could simply learn signals based on the pair $(B, G)$ (\eg, Synthetic Indoors) to make predictions about $Y$ (\eg, Big Dogs). To address this issue, we propose a simple, easy to implement, two-step training pipeline that we call From Fake to Real (FFR). The first step of FFR pre-trains a model on balanced synthetic data to learn robust representations across subgroups. In the second step, FFR fine-tunes the model on real data using ERM or common loss-based bias mitigation methods. By training on real and synthetic data separately, FFR doesn't expose the model to the statistical differences between real and synthetic data and thus avoids the issue of bias toward the pair $(B, G)$. Our experiments show that FFR improves worst accuracy over the state-of-the-art by up to 20\% over three datasets (CelebA, UTK-Face, and SpuCO Animals).

Tue 1 Oct. 5:20 - 5:30 PDT

Dataset Enhancement with Instance-Level Augmentations

Orest Kupyn · Christian Rupprecht

We present a method for expanding a dataset by incorporating knowledge from the wide distribution of pre-trained latent diffusion models. Data augmentations typically incorporate inductive biases about the image formation process into the training (e.g. translation, scaling, colour changes, etc.). Here, we go beyond simple pixel transformations and introduce the concept of instance-level data augmentation by repainting parts of the image at the level of object instances. The method combines a conditional diffusion model with depth and edge maps control conditioning to seamlessly repaint individual objects inside the scene, being applicable to any segmentation or detection dataset. Used as a data augmentation method, it improves the performance and generalization of the state-of-the-art salient object detection, semantic segmentation and object detection models. By redrawing all privacy-sensitive instances (people, license plates, etc.), the method is also applicable for data anonymization. We also release fully synthetic and anonymized expansions for popular datasets: COCO, Pascal VOC and DUTS. All datasets and the code will be released.

Tue 1 Oct. 5:30 - 5:40 PDT

Adaptive Parametric Activation

Konstantinos P Alexandridis · Jiankang Deng · Anh Nguyen · Shan Luo

The activation function plays a crucial role in model optimisation, yet the optimal choice remains unclear. For example, the Sigmoid activation is the de-facto activation in balanced classification tasks, however, in imbalanced classification, it proves inappropriate due to bias towards frequent classes. In this work, we delve deeper in this phenomenon by performing a comprehensive statistical analysis in the classification and intermediate layers of both balanced and imbalanced networks and we empirically show that aligning the activation function with the data distribution, enhances the performance in both balanced and imbalanced tasks. To this end, we propose the Adaptive Parametric Activation (APA) function, a novel and versatile activation function that unifies most common activation functions under a single formula. APA can be applied in both intermediate layers and attention layers, significantly outperforming the state-of-the-art on several imbalanced benchmarks such as ImageNet-LT, iNaturalist2018, Places-LT, CIFAR100-LT and LVIS and balanced benchmarks such as ImageNet1K, COCO and V3DET. Code is provided in the supplementary material.

Tue 1 Oct. 5:40 - 5:50 PDT

Relation DETR: Exploring Explicit Position Relation Prior for Object Detection

Xiuquan Hou · Meiqin Liu · Senlin Zhang · Ping Wei · Badong Chen · Xuguang Lan

This paper presents a general scheme for enhancing the convergence and performance of DETR (DEtection TRansformer). We investigate the slow convergence problem in Transformers from a new perspective, suggesting that it arises from the self-attention that introduces no structural bias over inputs. To address this issue, we explore incorporating position relation prior as attention bias to augment object detection, following the verification of its statistical significance using a proposed quantitative macrosopic correlation (MC) metric. Our approach, termed Relation-DETR, introduces an encoder to construct position relation embeddings for progressive attention refinment, which further extends the traditional streaming pipeline of DETR into a contrastive relation pipeline to address the conflicts between non-duplicate predictions and positive supervision. Extensive experiments on both generic and task-specific datasets demonstrate the effectiveness of our approach. Under the same configurations, Relation-DETR achieves a significant improvement (+2.0% AP) on COCO 2017 compared to DINO, the previous highly optimized DETR detector, and achieves state-of-the-art performance (reaching 51.7% at 12 epochs and 52.1% AP at 24 epochs with a ResNet-50 backbone, respectively). Moreover, Relation-DETR exhibits a remarkably fast convergence speed, achieving over 40% AP with only 2 training epochs on COCO 2017 using the basic ResNet50 backbone, suppressing existing DETR detectors under the same settings. Furthermore, the proposed relation encoder serves as a universal plug-in-and-play component, bringing clear improvements for theoretically any DETR-like methods. The code will be available.

Tue 1 Oct. 5:50 - 6:00 PDT

Projecting Points to Axes: Oriented Object Detection via Point-Axis Representation

Zeyang Zhao · Qilong Xue · Yifan Bai · Yuhang He · Xing Wei · Yihong Gong

This paper introduces the Point-Axis representation for oriented objects in aerial images, as depicted in Figure 1, emphasizing its flexibility and geometrically intuitive nature with two key components: points and axes. 1) Points delineate the spatial extent and contours of objects, providing detailed shape descriptions. 2) Axes define the primary directionalities of objects, providing essential orientation cues crucial for precise detection. The point-axis representation decouples location and rotation, addressing the loss discontinuity issues commonly encountered in traditional bounding box based approaches. For effective optimization without introducing additional annotations, we propose the max-projection loss to supervise point set learning and the cross-axis loss for robust axis representation learning. Further, leveraging this representation, we present the Oriented DETR model, seamlessly integrating the DETR framework for precise point-axis prediction and end-to-end detection. Experimental results demonstrate effectiveness in main datasets, showing significant performance improvements in aerial-oriented object detection tasks. The code will be released to the community.

Tue 1 Oct. 6:00 - 6:10 PDT

CLIFF: Continual Latent Diffusion for Open-Vocabulary Object Detection

Wuyang Li · Xinyu Liu · Jiayi Ma · Yixuan Yuan

Open-vocabulary object detection (OVD) utilizes image-level cues to expand the linguistic space of region proposals, thereby facilitating the detection of diverse novel classes. Recent works adapt CLIP embedding by minimizing the object-image and object-text discrepancy combinatorially in a discriminative paradigm. However, they ignore the underlying distribution and the disagreement between the image and text objective, leading to the misaligned distribution between the vision and language sub-space. To address the deficiency, we explore the advanced generative paradigm with distribution perception and propose a novel framework based on the diffusion model, coined Continual Latent Diffusion (CLIFF), which formulates a continual distribution transfer among the object, image, and text latent space probabilistically. CLIFF consists of a Variational Latent Sampler (VLS) enabling the probabilistic modeling and a Continual Diffusion Module (CDM) for the distribution transfer. Specifically, in VLS, we first establish a probabilistic object space with region proposals by estimating distribution parameters. Then, the object-centric noise is sampled from the estimated distribution to generate text embedding for OVD. To achieve this generation process, CDM conducts a short-distance object-to-image diffusion from the sampled noise to generate image embedding as the medium, which guides the long-distance diffusion to generate text embedding. Extensive experiments verify that CLIFF can significantly surpass state-of-the-art methods on benchmarks. Codes will be released to bring insights to the community.

Tue 1 Oct. 6:10 - 6:20 PDT

On Calibration of Object Detectors: Pitfalls, Evaluation and Baselines

Selim Kuzucu · Kemal Oksuz · Jonathan Sadeghi · Puneet Dokania

Building calibrated object detectors is a crucial challenge to address for their reliable usage in safety-critical applications. Recent approaches towards this involve (1) designing new loss functions to obtain calibrated detectors by training them from scratch, and (2) post-hoc Temperature Scaling (TS) that learns to scale the likelihood of a trained detector to output calibrated predictions. These approaches are then evaluated based on a combination of Detection Expected Calibration Error (D-ECE) and Average Precision. In this work, via extensive analysis and insights, we highlight that these recent evaluation frameworks, evaluation metrics, and the use of TS have significant drawbacks leading to incorrect conclusions. As a remedy, we propose a principled evaluation framework to jointly measure calibration and accuracy of object detectors. We also tailor efficient and easy-to-use post-hoc calibration approaches, Platt Scaling and Isotonic Regression, specifically to object detection. As opposed to the common notion, our experiments show that, once designed and evaluated properly, post-hoc calibrators, which are extremely cheap to build and use, are much more powerful and effective than the recent train time calibration methods. To illustrate, D-DETR with our post-hoc Isotonic Regression calibrator outperforms the state-of-the-art Cal-DETR by more than 7 D-ECE on the COCO dataset. We also provide improved versions of Localization-aware ECE and show the efficacy of our method on these metrics as well. Code will be made public.

Main Navigation

Oral Session

Oral 2B: Recognition