Promising or Elusive? Unsupervised Object Segmentation
from Real-world Single Images

NeurIPS 2022

Yafei YangBo Yang

vLAR Group, The Hong Kong Polytechnic University
teaser.png

Paper Code Video Slides Poster

Abstract


In this paper, we study the problem of unsupervised object segmentation from single images. We do not introduce a new algorithm, but systematically investigate the effectiveness of existing unsupervised models on challenging real-world images. We firstly introduce four complexity factors to quantitatively measure the distributions of object- and scene-level biases in appearance and geometry for datasets with human annotations. With the aid of these factors, we empirically find that, not surprisingly, existing unsupervised models catastrophically fail to segment generic objects in real-world images, although they can easily achieve excellent performance on numerous simple synthetic datasets, due to the vast gap in objectness biases between synthetic and real images. By conducting extensive experiments on multiple groups of ablated real-world datasets, we ultimately find that the key factors underlying the colossal failure of existing unsupervised models on real-world images is the challenging distributions of object- and scene-level biases in appearance and geometry. Because of this, the inductive biases introduced in existing unsupervised models can hardly capture the diverse object distributions. Our research results suggest that future work should exploit more explicit objectness biases in the network design.

Unsupervised Segmentation Performance


Synthetic datasets

training

dSprites dSprites/train/0_input_image.png
Tetris Tetris/train/0_input_image.png
CLEVR CLEVR/train/1_input_image.png
dSprites/train/0_gt_mask.png
Tetris/train/0_gt_mask.png
CLEVR/train/1_gt_mask.png
dSprites/train/IODINE/train_result.gif
Tetris/train/IODINE/train_result.gif
CLEVR/train/IODINE/sample/train_result.gif
dSprites/test/SlotAtt/train_result_05.gif
Tetris/test/SlotAtt/train_result_05.gif
CLEVR/test/SlotAtt/sample 1/train_result_05.gif

testing

dSprites dSprites/test/0_input_image.png
Tetris Tetris/test/0_input_image.png
CLEVR CLEVR/test/0_input_image.png
dSprites/test/0_gt_mask.png
Tetris/test/0_gt_mask.png
CLEVR/test/0_gt_mask.png
dSprites/test/IODINE/test_result.gif
Tetris/test/IODINE/test_result.gif
CLEVR/test/IODINE/test_result.gif
dSprites/test/SlotAtt/test_result_05.gif
Tetris/test/SlotAtt/test_result_05.gif
CLEVR/test/SlotAtt/test_result_05.gif

Real-world datasets

training

YCB YCB/train/0_input_image.png
ScanNet ScanNet/train/3_input_image.png
COCO COCO/train/0_input_image.png
YCB/train/0_gt_mask.png
ScanNet/train/3_gt_mask.png
COCO/train/0_gt_mask.png
YCB/train/IODINE/train_result.gif
ScanNet/train/IODINE/sample 3/train_result.gif
COCO/train/IODINE/sample 1/train_result.gif
YCB/test/SlotAtt/train_result_05.gif
ScanNet/test/SlotAtt/sample 3/train_result_05.gif
COCO/test/SlotAtt/sample 1/train_result_05.gif

testing

YCB YCB/test/0_input_image.png
ScanNet ScanNet/test/0_input_image.png
COCO COCO/test/0_input_image.png
YCB/test/0_gt_mask.png
ScanNet/test/0_gt_mask.png
COCO/test/0_gt_mask.png
YCB/test/IODINE/test_result.gif
ScanNet/test/IODINE/test_result.gif
COCO/test/IODINE/test_result.gif
YCB/test/SlotAtt/test_result_05.gif
ScanNet/test/SlotAtt/test_result_05.gif
COCO/test/SlotAtt/test_result_05.gif
* First row are input images. Second row are GT object masks. Third row are results from IODINE. Last row are results from SlotAtt.

Complexity Factors


Object Color Gradient

Object Shape Concavity

object_color_gradient.png
object_shape_concavity.png

Given an RGB image, we first convert it to grayscale, then calculate its gradient horizontally and vertically. Specifically, to avoid the effect from background, we remove gradient from object boundary. The final score is the averaged inner gradient.

Given a binary mask of an object shape, we first find its smallest convex polygon that surrounds the object. Factor value is computed as 1 - area of object / area of convex mask.

Inter-object Color Similarity

Inter-object Shape Variation

inter_object_color_similarity.png
inter_object_shape_variation.png

Given an image consisiting of multiple objects, we first calculate the average RGB color of each object. In RGB space, we average Euclidean distance between each pair of objects. Factor value if computed as 1 - normalized averaged distance.

We calculate diagonal length of bounding box for each object. The averaged diagonal variation is normalized to be the final factor value.

Ablations


C: Single Color Ablation

S: Convex Shape Ablation

C_single_color_ablation.png
S_convex_shape_ablation.png

Remove color gradient inside each object such that: Object Color Gradient is effectively reduced; Inter-object Color Similarity remains similar.

Make convex the shape of each object such that: Object Shape Concavity is effectively reduced; Inter-object Shape Variation remains similar.

T: Texture Replaced Ablation

U: Uniform Scale Ablation

inter_object_color_similarity.png
inter_object_shape_variation.png

Replaced with distinctive texture for all objects such that: Object Color Gradient remains similar; Inter-object Color Similarity is effectively reduced.

Rescale for all objects such that: Object Shape Concavity remains similar; Inter-object Shape Variation is effectively reduced.

Qualitative Results from Ablation


Full Ablation

YCB

YCB-CSTU YCB_CSTU/test/0_input_image.png
GT mask YCB_CSTU/test/0_gt_mask.png
IODINE YCB_CSTU/test/IODINE/test_result.gif
SlotAtt YCB_CSTU/test/SlotAtt/test_result_05.gif

ScanNet

ScanNet-CSTU ScanNet_CSTU/test/0_input_image.png
GT mask ScanNet_CSTU/test/0_gt_mask.png
IODINE ScanNet_CSTU/test/IODINE/test_result.gif
SlotAtt ScanNet_CSTU/test/SlotAtt/test_result_075.gif

COCO

COCO-CSTU COCO_CSTU/test/0_input_image.png
GT mask COCO_CSTU/test/0_gt_mask.png
IODINE COCO_CSTU/test/IODINE/test_result.gif
SlotAtt COCO_CSTU/test/SlotAtt/test_result_05.gif

Object-level Ablation

C: Single Color Ablation

YCB-C YCB_C/test/0_input_image.png
ScanNet-C ScanNet_C/test/0_input_image.png
COCO-C COCO_C/test/0_input_image.png
YCB_C/test/0_gt_mask.png
ScanNet_C/test/0_gt_mask.png
COCO_C/test/1_gt_mask.png
YCB_C/test/IODINE/test_result.gif
ScanNet_C/test/IODINE/test_result.gif
COCO_C/test/IODINE/sample/test_result.gif
YCB_C/test/SlotAtt/test_result_05.gif
ScanNet_C/test/SlotAtt/test_result_05.gif
COCO_C/test/SlotAtt/test_result_05.gif

S: Convex Shape Ablation

YCB-S YCB_S/test/0_input_image.png
ScanNet-S ScanNet_S/test/3_input_image.png
COCO-S COCO_S/test/0_input_image.png
YCB_S/test/0_gt_mask.png
ScanNet_S/test/0_gt_mask.png
COCO_S/test/0_gt_mask.png
YCB_S/test/IODINE/test_result.gif
ScanNet/test/IODINE/test_result.gif
COCO/test/IODINE/test_result.gif
/YCB_S/test/SlotAtt/test_result_05.gif
ScanNet_S/test/SlotAtt/test_result_05.gif
COCO_S/test/SlotAtt/test_result_05.gif

Scene-level Ablation

T: Texture Replaced Ablation

YCB-T YCB_T/test/0_input_image.png
ScanNet-T ScanNet_T/test/0_input_image.png
COCO-T COCO_T/test/0_input_image.png
YCB_T/test/0_gt_mask.png
ScanNet_T/test/0_gt_mask.png
COCO_T/test/0_gt_mask.png
YCB_T/test/IODINE/test_result.gif
ScanNet_T/test/IODINE/test_result.gif
COCO_T/test/IODINE/test_result.gif
YCB_T/test/SlotAtt/test_result_05.gif
ScanNet_T/test/SlotAtt/test_result_05.gif
COCO_T/test/SlotAtt/test_result_05.gif

U: Uniform Scale Ablation

YCB-U YCB_U/test/0_input_image.png
ScanNet-U ScanNet_U/test_U/0_input_image.png
COCO-U COCO_U/test/0_input_image.png
YCB_U/test/0_gt_mask.png
ScanNet_U/test/0_gt_mask.png
COCO_U/test/0_gt_mask.png
YCB_U/test/IODINE/test_result.gif
ScanNet_U/test/IODINE/test_result.gif
COCO_U/test/IODINE/test_result.gif
YCB_U/test/SlotAtt/test_result_05.gif
ScanNet_U/test/SlotAtt/test_result_05.gif
COCO_U/test/SlotAtt/test_result_05.gif

Quantitative Results from Ablation


Complexity Factor Distributions

complexity_factor_results.png

Quantitatively Segmentation Performance

experiments_quantitative.png

Video


Short Demo (40s)

Long presentation (11min)

BibTeX

If you find this work useful for your research, please cite:
@inproceedings{yang2022,
  title={{Promising or Elusive? Unsupervised Object Segmentation from Real-world Single Images}},
  author={Yang, Yafei and Yang, Bo},
  booktitle={NeurIPS},
  year={2022},
}

© This page takes inspiration from http://imagine.enpc.fr/~monniert/DTIClustering/.