Got only a few seconds? here's a tl;dr!: We present UDOS for instance segmentation in an open-world by leveraging weak-supervision from unsupervised bottom-up segmentation algorithm like selective search. We first predict part masks corresponding to objects, followed by affinity-based grouping and refinement modules to predict full-instance masks for both seen and unseen objects in an image. We achieve significantly better results compared to existing SOTA on open-world instance segmentation on several datasets!
What is Open-world Instance Segmentation (OWIS)? In OWIS, we need to detect and predict object masks for both seen (annotated) and unseen (unannotated) object classes in an image. For example, MS-COCO only has mask annotation for 80 object categories, but at test-time we need to predict masks for objects beyond these 80 categories. Efficient open-world models significantly improve the performance of real-world deployment of visual perception systems.
What's wrong with Mask-RCNN? Nothing much. But, instance segmentation frameworks such as Mask R-CNN often couple recognition and segmentation too closely to the extent that they are unable to segment out objects not labeled in the training data. This problem is exacerbated when these frameworks are trained with non-exhaustive annotations like MS-COCO, where out-of-taxonomy objects are perceived as negatives (background). A prediction made on these objects are punished by the top-down supervision. For example, in this picture, many classes like hat, plates and paper are missed because they are not part of VOC taxonomy.
What does UDOS do? We make two key contributions in UDOS. First, we observe that unsupervised bottom-up segmentation algorithms such as Selective Search, which have been de-facto method for proposal generation before the tsumani called deep learning, are inherently class-agnostic and perfectly suitable for the problem of open-world detection. They provide far more coverage (in terms of objects) in the image since they only rely on low-level cues such as shape, size, color, texture and brightness between pixels to generate candidate object masks. So we generate candidate object masks for all images in MS-COCO and add these as augmented supervision to train the top-down network, in addition to the already available annotated masks for a subset of object.
So can this solve the problem? Not quite. Look at the figure below: these proposal generation methods (MCG here) are unsupervised (which is good because it has no class-bias) but they are also prone to over-segmentation (which is bad since they can only predict parts of object and not full objects - like head, torso and hands separately instead of person as a whole). So our second contribution is to design a grouping method and a refinement module to cluster these part masks into larger object masks corresponding to the objects. The grouping module is based on agglomerative clustering the part-masks based on pairwise affinities, to solve the following objective.
Put in words, given a set of part-masks along with their pairwise affinity scores, our clustering algorithm is employed to produce a partition of the masks that maximizes the average affinities within each resulting partition. So all the parts of body might be merged into a single person mask, the legs and body of a chair might be merged into a single mask for chair and soon. Our refinement modules then further refines the merged masks along with predicting an objectness score for each of the predicted proposals. Now, equipped with a richer supervision and effective modules for clustering and refining part-masks, UDOS is much better at predicting both seen and unseen objects during test-time.
How does UDOS compare to SOTA? UDOS much better than several SOTA methods for OWIS, like OLN, LDET and GGN. It beats GGN by 3% Box-AR100 and 2% Mask-AR100 on the competetive setting of VOC to NonVOC transfer from COCO dataset. On cross-dataset setting, UDOS performs much better than previous works on several datasets like UVO, ADE20K and OpenImages. Please refer to our paper for additional ablations and qualitative results!
The masks in last two columns correspond to true-positives with respect to the ground truth. Note that classes such as pot, van, elephant, and auto-rickshaw do not belong to any of the training VOC categories. Also note that the merged outputs might be noisy due to the imperfection in the initial part-mask supervision used, which are corrected by our refinement layer.
We show open-world instance segmentation results using UDOS using wide range of experiments covering both cross-category as well as cross-dataset settings.
We achieve new SOTA on the cross-category setting on COCO using 20 VOC classes for training and testing on 60 NonVOC classes.
Even on a more realistic setting of cross-dataset evaluation, we achieve SOTA or comparable to SOTA on all datasets.
@article{kalluri2023udos
author = {Kalluri, Tarun and Wang, Weiyao and Wang, Heng and Chandraker, Manmohan and Torresani, Lorenzo and Tran, Du},
title = {Open-world Instance Segmentation: Top-down Learning with Bottom-up Supervision},
journal = {arxiv},
year = {2023},
},