IEEE IGARSS 2026

Mitigating Long-Tail Bias via Prompt-Controlled Diffusion Augmentation

SyntheticGen turns augmentation into controllable data design. Instead of adding more samples at random, it generates the missing semantic compositions in the right domain, a ratio-conditioned D3PM first synthesizes semantic layouts, then a ControlNet-guided latent diffusion model renders domain-consistent satellite images. We show better segmentation does not only come from better architectures, it can also come from deliberately generating the training distribution that real data is missing.

Buddhi Wijenayake¹ Nichula Wasalathilake¹ Roshan Godaliyadda¹ Vijitha Herath¹ Parakrama Ekanayake¹ Vishal M. Patel²

¹ University of Peradeniya, Sri Lanka | ² Johns Hopkins University, USA

Paper arXiv Code Weights Dataset Demo

Abstract

Long-tailed class imbalance remains a major obstacle in semantic segmentation of high-resolution remote-sensing imagery, where frequent classes dominate optimization and rare classes are systematically under-segmented. The problem becomes harder under domain shift: LoveDA explicitly separates Urban and Rural scenes, whose appearance statistics and class frequencies differ substantially. SyntheticGen addresses both challenges with a prompt-controlled diffusion augmentation framework that generates paired label-image samples with explicit control over semantic composition and domain. A domain-aware, masked, ratio-conditioned discrete diffusion model first synthesizes semantic layouts that satisfy class-ratio targets while preserving realistic co-occurrence structure, and a ControlNet-guided latent diffusion model then renders photorealistic, domain-consistent images from those layouts. When mixed with real data, the resulting synthetic pairs improve multiple segmentation backbones, especially on minority and mid-tail classes and in cross-domain evaluation, showing that better downstream segmentation can come from adding the right samples in the right proportions.

Core Contributions

When semantic labels, class imbalance, and appearance shift coexist, the training distribution can be deliberately reshaped instead of passively inherited.

Controllable augmentation

Stage A operates directly on semantic maps and can target selected class ratios, turning augmentation from random synthesis into controlled distribution shaping.

Domain-aware generation

Layouts are generated and rendered with Urban/Rural awareness, so the synthetic samples match the appearance statistics of the domain where data is lacking.

Backbone-agnostic gains

The synthetic set adds 894 Rural and 1,106 Urban image-label pairs and improves all reported in-domain mIoU scores as well as both domain-transfer directions.

Architecture

Two-stage prompt-controlled generation

SyntheticGen prompt-controlled inference pipeline. — A user prompt is parsed into a domain and class-ratio targets. Stage A samples a semantic layout, then Stage B renders a photorealistic satellite image using the sampled layout as spatial guidance.

Method in 30 seconds

Represent each label map as a class-ratio vector over valid pixels, excluding ignored pixels.
Downsample the semantic map to a 256 by 256 one-hot layout and train a D3PM denoiser over categorical labels.
Condition the denoiser on a masked ratio target and a learnable Urban/Rural domain embedding so the model supports full or sparse ratio control.
Train Stage A with a variational diffusion term, a 0.5-weighted denoising cross-entropy term, and a ratio-matching loss that emphasizes constrained classes.
Train Stage B with a latent diffusion noise-prediction objective, using one-hot layouts through ControlNet and domain and ratio prompts through CLIP text embeddings.
Build the final synthetic corpus with a greedy enrichment loop that repeatedly targets the most underrepresented non-background classes in each domain.

Stage A ratio- and domain-conditioned D3PM layout generator. — Stage A: a U-Net denoiser predicts categorical logits from a noisy layout under masked ratio and domain conditioning.

Stage B layout-guided latent diffusion image generator. — Stage B: ControlNet injects layout-conditioned features into Stable Diffusion, while FiLM gates regulate residual strength.

Benchmarks

Original vs. Original+Synthetic training

Metric

In-domain evaluation

Domain generalization

Each pair compares the same segmentation backbone trained on LoveDA Original data and LoveDA Original+Synthetic data.

In-domain gains

All five backbones improve in mIoU. U-Net shows the largest jump, from 39.77 to 51.36 mIoU, while AerialFormer reaches the highest reported in-domain mIoU among the tested models at 54.26.

Cross-domain gains

For Rural-to-Urban transfer, FactSeg improves from 39.98 to 50.45 mIoU and HRNet improves from 43.95 to 53.79, showing that targeted synthesis reduces domain-specific shortcuts.

Generated examples

Prompt-controlled synthetic results

Generated synthetic result for a rural prompt with road, water, and forest ratio constraints. — A high-resolution satellite image of a rural area with 30% road, 20% water, and 10% forest.

Generated synthetic result for a rural prompt with a road ratio constraint. — A high-resolution remote-sensing image of a rural area with 5% road.

Generated synthetic result for a rural prompt with forest and barren ratio constraints. — A satellite image of a rural area with 30% forest and 30% barren.

Generated synthetic result for a rural prompt with water and forest ratio constraints. — A satellite image of a rural area with 20% water and 10% forest.

Generated synthetic result for a rural prompt with a building ratio constraint. — A high-resolution remote-sensing image of a rural area with 30% building.

Generated synthetic result for an urban prompt with a building ratio constraint. — A high-resolution satellite image of an urban area with 30% building.

Core takeaway

Better data can be deliberately generated

SyntheticGen shows that diffusion augmentation is most useful for long-tailed segmentation when it is controllable. The framework does not simply add more images; it asks for domain-specific semantic compositions that the real training set lacks, checks whether generated candidates satisfy those constraints, and uses the accepted pairs to improve segmentation models under imbalance and domain shift.

The current evidence is on LoveDA, but the broader research message is more general: when the data distribution is the bottleneck, controllable generation can be used to reshape that distribution rather than only reweighting the loss or changing the backbone.

Citation

BibTex

@article{wijenayake2026mitigatinglongtailbiaspromptcontrolled,
      title={Mitigating Long-Tail Bias via Prompt-Controlled Diffusion Augmentation},
      author={Buddhi Wijenayake and Nichula Wasalathilake and Roshan Godaliyadda and Vijitha Herath and Parakrama Ekanayake and Vishal M. Patel},
      year={2026},
      eprint={2602.04749},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.04749},
}