Meta AI released CutLER, a state-of-the-art zero-shot unsupervised object detector which improves detection performance by over 2.7 times on 11 benchmark datasets for different domains like video frames, painting, sketches, etc.
This model's simplicity allows compatibility with different object-detection architectures (e.g. Mask R-CNN) across different domains. In addition, it requires much less data to train and much less human labor to label data for object detection. The requirement for less-labeled data compared with other models is important in an age when data is abundant, which paves the way to better models without intensive effort on data labeling.
Meta AI released an initial self-supervised model for learning image representation, meaning to learn important objects from images in an unsupervised fashion, called DINO in 2021. This initial work allowed the research community to track objects in images and generate attention maps. DINO attention maps can be used as image features to perform tasks such as semantic segmentation and object detection.
Using these features it is possible to create a patch-wise similarity matrix that correlates the different patches in an image. Using the similarity matrix as input to normalized cuts and image segmentation, which treats image segmentation as a graph partition task, it is possible to obtain a single foreground object mask of an image. As one object mask is obtained, the matrix values associated with the previous object are masked out. This algorithm is called MaskCut and is repeated for multiple object masks in the image.
The next step is using a detector, according to user preference (such as Mask R-CNN), that uses a lost function called DropLoss. This detector will help to detect missed objects by MaskCut by exploring other image regions for object detection. Instead of penalizing the prediction regions that do not overlap the ground truth, DropLoss neglects the loss for each region with maximum overlap (intersection-over-union) and encourages the exploration of different image regions (i.e., explore regions of low overlap). The detector is trained on the ImageNet dataset, using DINO for neural-network weights initialization.
The model trains itself using the same dataset multiple times, improving the MaskCut detection by using a supervised object detector. Each training uses the previous weights to the next iteration. CutLER is short for Cut and Learn, which is what it does.
Source: Cut and Learn for Unsupervised Object Detection and Instance Segmentation
The most common measurement for rating the performance of object detectors is average precision, but since the model is class agnostic, average recall is a good complementary metric. Considering this, this model outperforms previous unsupervised state-of-the-art models (e.g. FreeSOLO) more than two times the average precision and average recall. In addition, performance-wise it gets closer to state-of-the-art supervised methods such as Mask R-CNN, getting closer to learning image representation as humans do.
Source: Cut and Learn for Unsupervised Object Detection and Instance Segmentation
Meta AI released CutLER on GitHub. For training and evaluating the model try the following command:
python train_net.py --num-gpus 8 \
--config-file model_zoo/configs/COCO-Semisupervised/cascade_mask_rcnn_R_50_FPN_{K}perc.yaml \
MODEL.WEIGHTS /path/to/cutler_pretrained_model
If you want to play around with the model visualizations check the Google collab here.