VietFashion

๐Ÿ‘— VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Garments (ACCEPTED AT ACM ICMR 2026 - B Rank Conference)

Official repository for VietFashion, a comprehensive pipeline for synthesizing high-quality triplet datasets (sketch-text-image) specifically for cultural garments. We address the data scarcity bottleneck in Composed Image Retrieval (CIR) by leveraging state-of-the-art generative models to produce 20,000 curated triplets of the Vietnamese รกo dร i.

๐Ÿ“‹ Overview

This research project presents a novel pipeline combining SANA-ControlNet for high-fidelity sketch-conditioned image synthesis with Qwen2.5 3B Instruct for semantic caption generation. We address the critical challenge of obtaining diverse, high-quality training data for specialized domains like cultural garments. Our approach automates the entire triplet synthesis workflow, eliminating manual annotation and producing a large-scale, multi-modal dataset suitable for training and evaluation of composed image retrieval systems.

Example Images

๐ŸŽฏ Motivation

Composed image retrieval (CIR) is the task of finding images in a database based on a query combining a reference image with text modifications. However, obtaining diverse, high-quality training data remains a significant bottleneck, particularly for specialized domains like cultural garments where data is scarce.

This project addresses this challenge by:

๐Ÿ”ง Technical Approach

The Synthesis Pipeline

Our pipeline follows a structured flow from abstract sketch input to aligned multi-modal triplets:

Overall Pipeline

Key Components

  1. Attribute Sampling
    • 11 curated categories from fashion archives: Fabric, Silhouette, Neckline, Sleeve Style, Color, Pattern, Embroidery, Fit, Length, Collar Type, and Ornamental Details
    • Structured sampling ensures diverse, culturally authentic garment representations
    • Maintains semantic consistency across triplets
  2. Image Synthesis (SANA-ControlNet)
    • Generates high-quality garment images conditioned on sketch inputs and structured prompts
    • Spatial control via ControlNet ensures sketch-image semantic alignment
    • Produces diverse variations suitable for CIR training with photorealistic details
    • Maintains cultural authenticity while adding photorealistic rendering
  3. Caption Generation (Qwen2.5 3B Instruct)
    • Distills the attribute set into neutral, factual single-sentence captions
    • All captions follow the format: โ€œA photo of [garment description]โ€ฆโ€
    • Captures essential attributes (style, color, fit, cultural elements) comprehensively
    • Produces descriptions that complement sketch-image pairs for robust triplet learning
  4. Triplet Formation & Multi-Target Query Design
    • Aggregates Sketch (S), Caption (C), and Multiple Target Images (Iโ‚, Iโ‚‚, Iโ‚ƒ) into final dataset entries
    • One-to-Three mapping ($1 \rightarrow 3$) explicitly addresses false-negative supervision in CIR
    • Provides multiple valid targets for each query, reflecting real-world retrieval scenarios
    • Creates balanced datasets with diverse cultural garment categories

๐Ÿ“Š Dataset

Dataset Statistics

Data Organization

๐Ÿ“ Project Structure

BM_ICMR2026/
โ”œโ”€โ”€ generation/              # Image and caption generation pipeline
โ”‚   โ”œโ”€โ”€ sana_inference.py   # Single image synthesis script
โ”‚   โ”œโ”€โ”€ sana_inference_multi.py  # Batch synthesis script
โ”‚   โ”œโ”€โ”€ Sana/               # SANA model repository
โ”‚   โ”œโ”€โ”€ prompts_ao_dai.json # Generation prompts
โ”‚   โ””โ”€โ”€ features.json       # Feature storage
โ”‚
โ”œโ”€โ”€ notebook/               # Jupyter notebooks for analysis
โ”‚   โ””โ”€โ”€ Qwen2.5_captions.ipynb  # Caption generation notebook
โ”‚
โ”œโ”€โ”€ aodai/                  # Dataset directory
โ”‚   โ”œโ”€โ”€ origin.json         # Original sketch-text pairs
โ”‚   โ”œโ”€โ”€ output_triplet.json # Generated triplets
โ”‚   โ”œโ”€โ”€ train.json, test.json, val.json  # Split annotations
โ”‚   โ”œโ”€โ”€ captions/           # Caption data
โ”‚   โ”‚   โ”œโ”€โ”€ cap.train.json, cap.test.json, cap.val.json
โ”‚   โ”‚   โ””โ”€โ”€ triplet.json    # Final triplet annotations
โ”‚   โ”œโ”€โ”€ images/             # Synthesized garment images
โ”‚   โ”œโ”€โ”€ sketches/           # Input sketches
โ”‚   โ””โ”€โ”€ images_split/       # Image-sketch mapping files
โ”‚
โ”œโ”€โ”€ benchmark/              # Evaluation benchmarks
โ”‚   โ”œโ”€โ”€ Bi-Blip4CIR/        # BliP-based CIR model
โ”‚   โ”œโ”€โ”€ CLIP4Cir/           # CLIP-based CIR model
โ”‚   โ”œโ”€โ”€ pic2word/           # Picture-to-word baseline
โ”‚   โ””โ”€โ”€ SEARLE/             # SEARLE benchmark
โ”‚
โ”œโ”€โ”€ generated/              # Final outputs
โ”‚   โ”œโ”€โ”€ output_triplet_final.json      # Final triplet dataset
โ”‚   โ””โ”€โ”€ outputs_ao_dai_caption_refined.json
โ”‚
โ”œโ”€โ”€ metric.py               # Evaluation metrics
โ”œโ”€โ”€ process_triplets.py     # Triplet processing utilities
โ”œโ”€โ”€ triplet.py              # Triplet data structures
โ”œโ”€โ”€ requirements.txt        # Python dependencies
โ””โ”€โ”€ README.md              # This file

๐Ÿš€ Getting Started

Requirements

Installation

  1. Clone the repository and install dependencies
    cd BM_ICMR2026
    pip install -r requirements.txt
    
  2. Setup SANA model (if generating new images)
    cd generation/Sana
    # Follow SANA installation instructions in generation/Sana/README.md
    
  3. Prepare input data
    # Ensure sketches are in aodai/sketches/
    # Ensure prompts are configured in generation/prompts_ao_dai.json
    

๐Ÿ“ Usage

NOTES: AoDai Dataset can be downloaded and put in the directories aodai/

1. Image Synthesis with SANA

Single image synthesis:

cd generation
python sana_inference.py --config configs/sana_config.yaml --sketch_path path/to/sketch.png

Batch synthesis:

cd generation
bash inference_multi.sh

2. Caption Generation with Qwen2.5

Generate captions for synthesized images using the provided notebook:

notebook/Qwen2.5_captions.ipynb

Or use the caption generation script directly (if available).

3. Triplet Formation

Process and form triplets from synthesized images and captions:

python process_triplets.py --input_dir aodai/ --output_file generated/output_triplet_final.json

4. Evaluation

Evaluate CIR models on the synthesized dataset: Each model defined in benchmark/ directories has its own evaluation script which has been fine-tuned and adapted to our VietFashion dataset. Fine-tuned model will be made public soon.

๐Ÿ† Research Contributions

  1. Automated Triplet Synthesis: First pipeline combining diffusion spatial control (SANA-ControlNet) with LLM semantic distillation (Qwen2.5) for cultural garment domain-specific data generation

  2. Domain-Specific Benchmark: Establishes the first comprehensive benchmark for fine-grained composed image retrieval in the โ€œรo Dร iโ€ (Vietnamese cultural garment) domain with 20,000 high-quality triplets

  3. One-to-Many Mapping Design: Explicitly addresses false-negative supervision in composed retrieval through a structured Multi-Target Query Design ($1 \rightarrow 3$ mapping) that reflects real-world uncertainty

  4. Multi-Modal Consistency: Ensures semantic alignment across sketch, image, and text modalities through structured attribute sampling and LLM caption distillation

  5. Benchmark Evaluation: Comprehensive evaluation against state-of-the-art CIR models (Bi-Blip4CIR, CLIP4Cir, Pic2Word, SEARLE) demonstrating dataset utility

๐Ÿ“š Benchmarks

The project includes implementations and evaluations of:

Each benchmark includes pre-trained models, training and evaluation scripts in the benchmark/ directory.

Result

Example Images

๏ฟฝ Citation

Coming Soon

๐Ÿ“ž Contact

For questions, suggestions, or collaboration opportunities, please reach out to the project maintainers.

๐Ÿ“– References

๐Ÿ“„ License

This project is licensed under the MIT License. See LICENSE file for details.


Status: Research Project
Last Updated: March 2026
Version: 1.0