Computer Science
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
J. Xu, S. Liu, et al.
Humans can recognize a virtually limitless set of categories and understand fine-grained distinctions and relationships within scenes. The goal of open-vocabulary recognition is to approach this ability by enabling models to recognize categories beyond those seen during training. However, few works provide a unified framework that parses all object instances and scene semantics simultaneously (panoptic segmentation). Existing open-vocabulary approaches primarily rely on text-image discriminative models (e.g., CLIP) trained on Internet-scale data, which are strong at classification but exhibit weaknesses in spatial and relational understanding necessary for scene-level parsing. Meanwhile, text-to-image diffusion models trained on large-scale image-text pairs have revolutionized image synthesis and use cross-attention between text embeddings and internal visual features, suggesting these internal features are semantically rich and spatially differentiated. Preliminary clustering of diffusion features reveals semantically distinct and localized groups. This motivates leveraging Internet-scale text-to-image diffusion models to create a universal open-vocabulary panoptic segmentation learner. ODISE (Open-vocabulary DIffusion-based panoptic SEgmentation) integrates frozen large-scale text-to-image diffusion and text-image discriminative models. The diffusion model provides dense, semantically rich visual features; a mask generator predicts class-agnostic masks; and a mask classification module uses text embeddings to assign open-vocabulary labels, trained with either category labels or image captions. At inference, ODISE fuses predictions from diffusion and discriminative models. Contributions: (1) First to explore large-scale text-to-image diffusion models for open-vocabulary segmentation; (2) A novel pipeline leveraging diffusion and discriminative models for open-vocabulary panoptic segmentation; (3) State-of-the-art results across multiple open-vocabulary recognition benchmarks.
Related Work: Panoptic segmentation combines instance and semantic segmentation but typically assumes closed vocabularies limited to training categories. Open-vocabulary segmentation prior works mainly target either instance or semantic segmentation, relying on large-scale discriminative pretraining (e.g., classification or contrastive image-text learning), with concurrent MaskCLIP using CLIP features. However, discriminative internal representations are sub-optimal for segmentation compared to those derived from diffusion models. Generative models (GANs or diffusion) have been used for semantic segmentation in small, closed vocabularies, often requiring synthesis or few-shot annotations. DDPM-Seg demonstrated strong label-efficient segmentation by using internal generative features, but prior efforts mainly focus on small/closed vocabularies and semantic (not panoptic) segmentation. ODISE differs by targeting open-vocabulary panoptic segmentation leveraging Internet-scale text-to-image diffusion models’ internal features and unifying them with discriminative models for open-set label assignment.
Problem Definition: Train on base categories C_train (possibly different from test categories C_test; C_test may include unseen categories). Training data provides binary panoptic mask annotations for each category instance in an image, and either mask-level category labels or an image-level caption. At test time, neither labels nor captions are available; only the names of test categories C_test are provided.
Method Overview: ODISE consists of (1) a frozen text-to-image diffusion model feature extractor augmented with an implicit captioner; (2) a mask generator predicting class-agnostic masks and mask embeddings; (3) a mask classification module that assigns open-vocabulary labels by comparing mask embeddings with text embeddings of category names. Training uses mask annotations (class-agnostic) plus either label supervision or image-caption grounding. Inference uses both diffusion features and a text-image discriminative model to classify masks into C_test, with scores fused by geometric mean.
Text-to-Image Diffusion Model and Feature Extraction: Use a UNet-based text-to-image diffusion model (e.g., Stable Diffusion) trained on large-scale image-text pairs. The UNet contains convolutional and attention blocks with cross-attention between text embeddings and visual features, yielding rich semantic feature maps suitable for segmentation. For an input image x and caption s, a noisy image x_t is formed using a predefined noise schedule; features f are extracted via UNet(x_t, T(s)), where T is a frozen text encoder. Directly using captions is problematic when images lack captions.
Implicit Captioner: To avoid explicit captions, an implicit captioner is trained to produce a text embedding directly from the image. A frozen image encoder V (from a text-image discriminative model like CLIP) encodes x; a learned MLP projects V(x) into an implicit text embedding fed into the diffusion UNet. During ODISE training, only the MLP is updated; UNet and V are frozen. The final diffusion features are f = UNet(x_t, MLP(V(x))).
Mask Generator: Given diffusion features f, a mask generator outputs N class-agnostic binary masks {m_i} and corresponding mask embeddings {z_i}. The architecture is flexible (box-based or direct mask-based). ODISE uses a direct segmentation model (e.g., Mask2Former). Predicted masks are supervised as class-agnostic via pixel-wise binary cross-entropy against ground-truth masks. Mask embeddings are obtained via masked pooling on feature maps (or ROI pooling for box-based methods).
Mask Classification: Assign open-vocabulary labels using a text-image discriminative model with image encoder V and text encoder T. Two supervision options during training:
- Category Label Supervision: With known mask labels y_i ∈ C_train, compute text embeddings T(C_train) for category names. For each mask embedding z_i, compute class probabilities via softmax over dot products z_i·T(c_k)/τ, and optimize cross-entropy.
- Image Caption Supervision (Grounding): Without mask labels, extract nouns from image caption s^(m) to form C_word. Define a grounding similarity g(x^(m), s^(n)) encouraging nouns to be grounded by some predicted masks using soft assignment p(z_i, C_word). Train with a bidirectional contrastive grounding loss over a batch, with learnable temperature τ. Training minimizes either L_C or L_G plus the mask loss (Hungarian matching used for mask assignment).
Open-Vocabulary Inference: For a test image, the implicit captioner produces an implicit text embedding; the diffusion UNet extracts features; the mask generator predicts masks {m_i}. For classification, compute p(z_i, C_test) via similarity to T(C_test) and select the top category per mask. To strengthen classification, pool features from the discriminative image encoder V over each predicted mask (ẑ_i = MaskPooling(V(x), m_i)) and compute p(ẑ_i, C_test). Fuse predictions by geometric mean: p_final ∝ p(z_i, C_test)^λ · p(ẑ_i, C_test)^(1−λ), with fixed λ.
Implementation: Use Stable Diffusion (trained on LAION subset) for diffusion features, forming a feature pyramid from selected UNet blocks; default diffusion step t=0. Use CLIP for V and T. Use Mask2Former as mask generator, producing N=100 masks. Train 90k iterations on COCO panoptic masks with images at 1024², large-scale jittering, batch size 64, AdamW (lr=1e−4, wd=0.05). For caption supervision, use K_word=8 nouns per image; captions sampled from COCO Captions.
Evaluation: Single checkpoint evaluated for open-vocabulary panoptic, instance, and semantic segmentation on ADE20K; semantic segmentation on Pascal (PC-59/459, PAS-21), and additional analyses on Cityscapes and Mapillary (supplement). Metrics: PQ, mAP (things), mIoU (and SQ, RQ in supplement). Masks for semantic segmentation are merged per category. Speed: 1.26 FPS on V100 (1024²), 11.9 GB; 28.1M trainable parameters, 1,493.8M frozen.
- ODISE achieves state-of-the-art open-vocabulary panoptic and semantic segmentation. With COCO-only training, it reaches 23.4 PQ and approximately 30.0 mIoU on ADE20K, improving over the prior SOTA (MaskCLIP) by +8.3 PQ and +7.9 mIoU.
- Table 1 (Open-vocabulary panoptic segmentation): On ADE20K, ODISE obtains PQ up to 23.4 (mAP 13.9, mIoU 28.7) or PQ 22.6 (mAP 14.4, mIoU 29.9), surpassing MaskCLIP’s PQ 15.1. On COCO, ODISE reports PQ 55.4 (mAP 46.0, mIoU 65.2) or PQ 45.6 (mAP 38.4, mIoU 52.4), using a single checkpoint.
- Table 2 (Open-vocabulary semantic segmentation): ODISE outperforms state-of-the-art methods across ADE20K (A-150, A-847) and Pascal Context (PC-59, PC-459) and Pascal VOC (PAS-21). Reported gains include: +7.6 mIoU on A-150, +4.7 on A-847, +4.8 on PC-459 with caption supervision; and +6.2 on A-150, +4.5 on PC-459 with label supervision compared to next best.
- Ablations show diffusion features outperform discriminative features for open-vocabulary segmentation even when trained on similarly scaled data (e.g., ODISE vs. CLIP(H) trained on LAION). For example (Table 3): On ADE20K, ODISE achieves PQ 23.3, mAP 13.0, mIoU 29.2 vs CLIP(H) PQ 21.2, mAP 10.8, mIoU 28.1.
- Implicit captioner improves feature quality: implicit or explicit captions outperform empty text; the implicit captioner generalizes better across datasets than off-the-shelf captioners.
- Diffusion timestep analysis: best results at t=0; performance degrades as t increases. Concatenating multiple timesteps offers no practical gain relative to cost; learning t converges near zero.
- Fusion of diffusion- and discriminative-based classification improves accuracy over either alone; diffusion-only already surpasses prior work, and fusion yields further gains (Table 6).
- Additional tasks in supplement: ODISE improves open-world instance segmentation (e.g., AR@100: 57.7 on UVO, 30.3 on ADE20K) and open-vocabulary object detection on LVIS (e.g., mAP and mAP_r improvements over MaskCLIP), and generalizes to Cityscapes and Mapillary Vistas with notable margins over CLIP(H)-based variants.
- Efficiency: 28.1M trainable parameters (1.8% of full model), inference at 1.26 FPS on V100 with mask pooling providing 3× speedup vs. bbox cropping while maintaining PQ.
The findings support the hypothesis that internal representations learned by large-scale text-to-image diffusion models are semantically rich, spatially well-differentiated, and better aligned with open-world visual concepts than those from discriminative pretraining alone. By leveraging these diffusion features for dense mask generation and combining them with text-image discriminative models for open-vocabulary classification, ODISE achieves a unified, scalable approach to panoptic segmentation across unseen categories. The implicit captioner addresses the practical challenge of missing captions by producing effective text embeddings directly from images, enabling consistent feature extraction and strong generalization across datasets. Ablations validate design choices: using diffusion features (t=0), implicit captioning, and fusing discriminative classifications all contribute to performance. The improvements across benchmarks and tasks indicate that diffusion-based representations can enhance scene-level understanding and instance discovery in open-world settings.
ODISE demonstrates that frozen internal representations from large-scale text-to-image diffusion models can be effectively harnessed for open-vocabulary panoptic segmentation. By unifying diffusion-derived dense features with discriminative text-image models for open-set classification—and introducing an implicit captioner to avoid reliance on explicit captions—ODISE establishes new state-of-the-art results on multiple open-vocabulary panoptic and semantic segmentation benchmarks. This work highlights diffusion models’ potential beyond image synthesis, suggesting further exploration of their internal representations for other recognition tasks. Future work includes refining category definitions and prompts, extending to additional modalities and tasks, and improving robustness and fairness in the presence of dataset biases.
Ambiguities and non-exclusive category definitions in current datasets (e.g., ADE20K: “tower” vs. “building”) can affect evaluation and lead to misclassifications; better prompt engineering can mitigate but not fully resolve this. The diffusion model used is pre-trained on web-crawled image-text pairs and may inherit biases from the data despite filtering. ODISE’s computational demands (large frozen backbones, training time, and memory) may limit accessibility; although only a small fraction of parameters are trainable, inference still requires substantial compute. The approach depends on the quality of text prompts and noun extraction for caption-supervised training, which may be dataset-dependent.
Related Publications
Explore these studies to deepen your understanding of the subject.

