VietFashion: Benchmarking Sketch–Text Composed Image Retrieval for Cultural Outfits

Published in ACM International Conference on Multimedia Retrieval (ICMR) 2026, 2026

Overview

This research project presents a novel pipeline combining SANA-ControlNet for high-fidelity sketch-conditioned image synthesis with Qwen2.5 3B Instruct for semantic caption generation. We address the critical challenge of obtaining diverse, high-quality training data for specialized domains like cultural garments. Our approach automates the entire triplet synthesis workflow, eliminating manual annotation and producing a large-scale, multi-modal dataset suitable for training and evaluation of composed image retrieval systems.

Overall Pipeline

🎯 Motivation

Composed image retrieval (CIR) is the task of finding images in a database based on a query combining a reference image with text modifications. However, obtaining diverse, high-quality training data remains a significant bottleneck, particularly for specialized domains like cultural garments where data is scarce.

This project addresses this challenge by:

  • Automating triplet generation from human-drawn sketch inputs, eliminating manual annotation requirements
  • Leveraging SANA-ControlNet to synthesize photorealistic, culturally accurate garment images conditioned on spatial sketches
  • Utilizing Qwen2.5 3B Instruct to generate consistent, attribute-rich captions starting with “A photo of…”
  • Modeling real-world uncertainty via a Multi-Target Query Design ($1 \rightarrow 3$ mapping) that addresses false-negative supervision in composed retrieval
  • Producing large-scale datasets (20,000 triplets) that enable training of robust CIR models for cultural garment applications

🔧 Technical Approach

The Synthesis Pipeline

Our pipeline follows a structured flow from abstract sketch input to aligned multi-modal triplets:

Overall Pipeline

Key Components

  1. Attribute Sampling
    • 11 curated categories from fashion archives: Fabric, Silhouette, Neckline, Sleeve Style, Color, Pattern, Embroidery, Fit, Length, Collar Type, and Ornamental Details
    • Structured sampling ensures diverse, culturally authentic garment representations
    • Maintains semantic consistency across triplets
  2. Image Synthesis (SANA-ControlNet)
    • Generates high-quality garment images conditioned on sketch inputs and structured prompts
    • Spatial control via ControlNet ensures sketch-image semantic alignment
    • Produces diverse variations suitable for CIR training with photorealistic details
    • Maintains cultural authenticity while adding photorealistic rendering
  3. Caption Generation (Qwen2.5 3B Instruct)
    • Distills the attribute set into neutral, factual single-sentence captions
    • All captions follow the format: “A photo of [garment description]…”
    • Captures essential attributes (style, color, fit, cultural elements) comprehensively
    • Produces descriptions that complement sketch-image pairs for robust triplet learning
  4. Triplet Formation & Multi-Target Query Design
    • Aggregates Sketch (S), Caption (C), and Multiple Target Images (I₁, I₂, I₃) into final dataset entries
    • One-to-Three mapping ($1 \rightarrow 3$) explicitly addresses false-negative supervision in CIR
    • Provides multiple valid targets for each query, reflecting real-world retrieval scenarios
    • Creates balanced datasets with diverse cultural garment categories

📊 Dataset

Dataset Statistics

  • Total Triplets: 20,000 curated sketch-text-image triplets
  • Human Sketches: 650 unique human-drawn query sketches
  • Synthesized Images: 21,000 high-resolution fashion renderings (3 per triplet)
  • Domain: Vietnamese Cultural Garments (Áo Dài and related traditional clothing)
  • Format: JSON-based annotations with split support (train.json, val.json, test.json)
  • Modalities: Sketch, Image, Text with explicit one-to-three mapping

Data Organization

  • Triplet Format: Each entry contains (Sketch S, Caption C, Image₁ I₁, Image₂ I₂, Image₃ I₃)
  • Splits: Training, validation, and test sets with balanced class distributions
  • Image Quality: High-resolution photorealistic renderings suitable for deep learning models

📚 Benchmarks

The project includes implementations and evaluations of:

  • Bi-Blip4CIR: Bidirectional BLIP-based composed image retrieval
  • CLIP4Cir: CLIP-based composed image retrieval with fine-tuning
  • Pic2Word: Vision-language model for composed retrieval
  • SEARLE: Scalable end-to-end architecture for image retrieval

Result

Overall Pipeline

Citation

@inproceedings{Cao2026ICMR,
  title = {VietFashion: Benchmarking Sketch–Text Composed Image Retrieval for Cultural Outfits},
  author = {Cao, Hoang-Nguyen and Bui, Le-Hoang and Vo, Dinh-Khoi and Tran, Minh-Triet and Le, Trung-Nghia},
  booktitle = {International Conference on Multimedia Retrieval (ICMR)},
  year = {2026},
  note = {(B Rank)},
  presentation = {},
  project_page = {https://hng0303.github.io/VietFashion}
}