Unsupervised Object Segmentation

Abstract

In this paper, we study the problem of unsupervised object segmentation from single images. We do not introduce a new algorithm, but systematically investigate the effectiveness of existing unsupervised models on challenging real-world images. We firstly introduce four complexity factors to quantitatively measure the distributions of object- and scene-level biases in appearance and geometry for datasets with human annotations. With the aid of these factors, we empirically find that, not surprisingly, existing unsupervised models catastrophically fail to segment generic objects in real-world images, although they can easily achieve excellent performance on numerous simple synthetic datasets, due to the vast gap in objectness biases between synthetic and real images. By conducting extensive experiments on multiple groups of ablated real-world datasets, we ultimately find that the key factors underlying the colossal failure of existing unsupervised models on real-world images is the challenging distributions of object- and scene-level biases in appearance and geometry. Because of this, the inductive biases introduced in existing unsupervised models can hardly capture the diverse object distributions. Our research results suggest that future work should exploit more explicit objectness biases in the network design.

Unsupervised Segmentation Performance

Synthetic datasets

training

dSprites

Tetris

CLEVR

CLEVR/train/IODINE/sample/train_result.gif

dSprites/test/SlotAtt/train_result_05.gif

CLEVR/test/SlotAtt/sample 1/train_result_05.gif

testing

dSprites

Tetris

CLEVR

dSprites/test/SlotAtt/test_result_05.gif

Real-world datasets

training

YCB

ScanNet

COCO

ScanNet/train/IODINE/sample 3/train_result.gif

COCO/train/IODINE/sample 1/train_result.gif

ScanNet/test/SlotAtt/sample 3/train_result_05.gif

COCO/test/SlotAtt/sample 1/train_result_05.gif

testing

YCB

ScanNet

COCO

* First row are input images. Second row are GT object masks. Third row are results from IODINE. Last row are results from SlotAtt.

Complexity Factors

Object Color Gradient

Object Shape Concavity

Given an RGB image, we first convert it to grayscale, then calculate its gradient horizontally and vertically. Specifically, to avoid the effect from background, we remove gradient from object boundary. The final score is the averaged inner gradient.

Given a binary mask of an object shape, we first find its smallest convex polygon that surrounds the object. Factor value is computed as 1 - area of object / area of convex mask.

Inter-object Color Similarity

Inter-object Shape Variation

Given an image consisiting of multiple objects, we first calculate the average RGB color of each object. In RGB space, we average Euclidean distance between each pair of objects. Factor value if computed as 1 - normalized averaged distance.

We calculate diagonal length of bounding box for each object. The averaged diagonal variation is normalized to be the final factor value.

Ablations

C: Single Color Ablation

S: Convex Shape Ablation

Remove color gradient inside each object such that: Object Color Gradient is effectively reduced; Inter-object Color Similarity remains similar.

Make convex the shape of each object such that: Object Shape Concavity is effectively reduced; Inter-object Shape Variation remains similar.

T: Texture Replaced Ablation

U: Uniform Scale Ablation

Replaced with distinctive texture for all objects such that: Object Color Gradient remains similar; Inter-object Color Similarity is effectively reduced.

Rescale for all objects such that: Object Shape Concavity remains similar; Inter-object Shape Variation is effectively reduced.

Qualitative Results from Ablation

Full Ablation

YCB

YCB-CSTU

GT mask

IODINE

SlotAtt

ScanNet

ScanNet-CSTU ScanNet_CSTU/test/0_input_image.png

GT mask

IODINE

SlotAtt

COCO

COCO-CSTU

GT mask

IODINE

SlotAtt

Object-level Ablation

C: Single Color Ablation

YCB-C

ScanNet-C

COCO-C

COCO_C/test/IODINE/sample/test_result.gif

ScanNet_C/test/SlotAtt/test_result_05.gif

S: Convex Shape Ablation

YCB-S

ScanNet-S

COCO-S

ScanNet_S/test/SlotAtt/test_result_05.gif

Scene-level Ablation

T: Texture Replaced Ablation

YCB-T

ScanNet-T

COCO-T

ScanNet_T/test/SlotAtt/test_result_05.gif

U: Uniform Scale Ablation

YCB-U

ScanNet-U

COCO-U

ScanNet_U/test/SlotAtt/test_result_05.gif

Quantitative Results from Ablation

Complexity Factor Distributions

Quantitatively Segmentation Performance

Video

Short Demo (40s)

Long presentation (11min)

BibTeX

If you find this work useful for your research, please cite:

@inproceedings{yang2022,
  title={{Promising or Elusive? Unsupervised Object Segmentation from Real-world Single Images}},
  author={Yang, Yafei and Yang, Bo},
  booktitle={NeurIPS},
  year={2022},
}

Promising or Elusive? Unsupervised Object Segmentation
from Real-world Single Images

NeurIPS 2022

Yafei Yang Bo Yang

Paper Code Video Slides Poster

Abstract

Unsupervised Segmentation Performance

Synthetic datasets

training

testing

Real-world datasets

training

testing

Complexity Factors

Object Color Gradient

Object Shape Concavity

Inter-object Color Similarity

Inter-object Shape Variation

Ablations

C: Single Color Ablation

S: Convex Shape Ablation

T: Texture Replaced Ablation

U: Uniform Scale Ablation

Qualitative Results from Ablation

Full Ablation

YCB

ScanNet

COCO

Object-level Ablation

C: Single Color Ablation

S: Convex Shape Ablation

Scene-level Ablation

T: Texture Replaced Ablation

U: Uniform Scale Ablation

Quantitative Results from Ablation

Complexity Factor Distributions

Quantitatively Segmentation Performance

Video

Short Demo (40s)

Long presentation (11min)

BibTeX