Mask Grounding for Referring Image Segmentation

Abstract

Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years, most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level.

These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently, they exhibit weak object-level correspondence between visual and language features. Without well-grounded features, prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects, especially when dealing with rarely used or ambiguous clauses.

To tackle this challenge, we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features, by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore, to holistically address the modality gap, we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding.

With all these techniques, our comprehensive approach culminates in MagNet (Mask-grounded Network), an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating our method’s effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.

Method

In this figure, we show the overview of Mask Grounding. To perform this task, we first use a MLP-based Mask Encoder to encode center-coordinates of segmentation masks. Then, we randomly mask textual tokens in language inputs before extracting their features. Finally, we pass the encoded language, image and mask features to a Transformer-based Masked Token Predictor to perform masked token prediction using cross-entropy loss. We use the large-scale BERT vocabulary as our word class list, which is generally accepted to have open-vocabulary capability.

Main Results

In this table, we compare our method with other leading RIS methods using the oIoU metric and show that it achieves multiple new state-of-the-art results. Single dataset refers to strictly following the predefined train/test splits of the original RefCOCO, RefCOCO+ and G-Ref datasets. Multiple datasets refers to combining the train splits from these 3 datasets with test images removed to prevent data leakage. Extra datasets refers to using additional data beyond RefCOCO, RefCOCO+ and G-Ref. † indicates models that use extra datasets. ‡ indicates that our model only uses multiple datasets. Bold indicates best.

Visualizations

In this figure, we show some visualizations of our network's predictions. Compared to one of the state-of-the-art method, LAVT, our method performs much better in various complex scenerios, suggesting its impressive capability to reason about various complex visual-object relationships.

BibTeX

@inproceedings{chng2023mask,
  author    = {Chng, Yong Xien and Zheng, Henry and Han, Yizeng and Qiu, Xuchong and Huang, Gao},
  title     = {Mask Grounding for Referring Image Segmentation},
  booktitle = {CVPR},
  year      = {2024},
}