In computer vision, pedestrian detection is an important task that finds its application in many scenarios. However, obtaining extensive labels can be challenging, especially in crowded scenes. Current solutions to this problem often rely on massive unlabeled data and extensive training to achieve promising results. In this paper, we propose Crowd-SAM, a novel few-shot object detection/segmentation framework that utilizes SAM for pedestrian detection. CrowdSAM combines the advantage of DINO and SAM, to localize the foreground first and then generate dense prompts conditioned on the heatmap. Specifically, we design an efficient prompt sampler to deal with the dense point prompts and a Part-Whole Discrimination Network that re-scores the decoded masks and selects the proper ones. Our experiments are conducted on the CrowdHuman and CityScape benchmarks. On CrowdHuman, CrowdSAM achieves 78.4 A, which is comparable to fully supervised object detectors, and SOTA in few-shot object detectors, with 10-shot labeled images and less than 1M learnable parameters.
Live content is unavailable. Log in and register to view live content