Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

CVPR 2023 Highlight

Jiarui Xu^1*, Sifei Liu^2†, Arash Vahdat^2†, Wonmin Byeon², Xiaolong Wang¹, Shalini De Mello²

¹University of California San Diego, ²NVIDIA

(* the work was done at an internship at NVIDIA, † equal contribution)

Segment and categorize any object, even ones not seen during training

Abstract

We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have shown the remarkable capability of generating high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels. We propose to leverage the frozen representation of both these models to perform panoptic segmentation of any category in the wild. Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks. In particular, with COCO training only, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement over previous state of the art.

Video

Qualitative Results

To demonstrate open-vocabulary recognition capabilities, we merge category names of LVIS, COCO, ADE20K together and perform open-vocabulary inference with \({\sim} 1.5k\) classes directly (hover to view the input image).