Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation

Kehan Li1,3, Yian Zhao1,3, Zhennan Wang2, Zesen Cheng1,3, Peng Jin1,3, Xiangyang Ji5, Li Yuan1,2,3, Chang Liu4 *, Jie Chen1,2,3 *
1School of Electronic and Computer Engineering, Peking University 2Peng Cheng Laboratory 3AI for Science (AI4S)-Preferred Program, Peking University 4Department of Automation and BNRist, Tsinghua University

Abstract

Interactive segmentation enables users to segment as needed by providing cues of objects, which introduces human-computer interaction for many fields, such as image editing and medical image analysis. Typically, massive and expansive pixel-level annotations are spent to train deep models by object-oriented interactions with manually labeled object masks.

In this work, we reveal that informative interactions can be made by simulation with semantic-consistent yet diverse region exploration in an unsupervised paradigm. Concretely, we introduce a Multi-granularity Interaction Simulation (MIS) approach to open up a promising direction for unsupervised interactive segmentation.

Drawing on the high-quality dense features produced by recent self-supervised models, we propose to gradually merge patches or regions with similar features to form more extensive regions and thus, every merged region serves as a semantic-meaningful multi-granularity proposal. By randomly sampling these proposals and simulating possible interactions based on them, we provide meaningful interaction at multiple granularities to teach the model to understand interactions. Our MIS significantly outperforms non-deep learning unsupervised methods and is even comparable with some previous deep-supervised methods without any annotation.

Method

We first introduce a self-supervised ViT to produce semantic features for each patch of an image, then gradually merge two similar patches or regions with similar features to perform a new region. All newly generated regions in the merging process constitute meaningful proposals at multiple granularities.

For implementation, the MIS is composed of two stages. In the first pre-processing stage, we first parse the semantic of each patch in the image using a pre-trained ViT and take a gradual merging strategy to produce semantic-consistent region proposals at multiple granularities and save them efficiently with a tree structure. In the training stage, given an image, we randomly select proposals from the corresponding merging tree in a top-down manner and further simulate possible clicks based on them, thus providing the model informative interactions.

Demo (Comming Soon)

Qualitative Results

Merging Process


Multi-granularity Proposals


Interactive Segmentation