Official repository for VietFashion, a comprehensive pipeline for synthesizing high-quality triplet datasets (sketch-text-image) specifically for cultural garments. We address the data scarcity bottleneck in Composed Image Retrieval (CIR) by leveraging state-of-the-art generative models to produce 20,000 curated triplets of the Vietnamese รกo dร i.
This research project presents a novel pipeline combining SANA-ControlNet for high-fidelity sketch-conditioned image synthesis with Qwen2.5 3B Instruct for semantic caption generation. We address the critical challenge of obtaining diverse, high-quality training data for specialized domains like cultural garments. Our approach automates the entire triplet synthesis workflow, eliminating manual annotation and producing a large-scale, multi-modal dataset suitable for training and evaluation of composed image retrieval systems.

Composed image retrieval (CIR) is the task of finding images in a database based on a query combining a reference image with text modifications. However, obtaining diverse, high-quality training data remains a significant bottleneck, particularly for specialized domains like cultural garments where data is scarce.
This project addresses this challenge by:
Our pipeline follows a structured flow from abstract sketch input to aligned multi-modal triplets:

aodai/captions/ directory with image-sketch mapping filesBM_ICMR2026/
โโโ generation/ # Image and caption generation pipeline
โ โโโ sana_inference.py # Single image synthesis script
โ โโโ sana_inference_multi.py # Batch synthesis script
โ โโโ Sana/ # SANA model repository
โ โโโ prompts_ao_dai.json # Generation prompts
โ โโโ features.json # Feature storage
โ
โโโ notebook/ # Jupyter notebooks for analysis
โ โโโ Qwen2.5_captions.ipynb # Caption generation notebook
โ
โโโ aodai/ # Dataset directory
โ โโโ origin.json # Original sketch-text pairs
โ โโโ output_triplet.json # Generated triplets
โ โโโ train.json, test.json, val.json # Split annotations
โ โโโ captions/ # Caption data
โ โ โโโ cap.train.json, cap.test.json, cap.val.json
โ โ โโโ triplet.json # Final triplet annotations
โ โโโ images/ # Synthesized garment images
โ โโโ sketches/ # Input sketches
โ โโโ images_split/ # Image-sketch mapping files
โ
โโโ benchmark/ # Evaluation benchmarks
โ โโโ Bi-Blip4CIR/ # BliP-based CIR model
โ โโโ CLIP4Cir/ # CLIP-based CIR model
โ โโโ pic2word/ # Picture-to-word baseline
โ โโโ SEARLE/ # SEARLE benchmark
โ
โโโ generated/ # Final outputs
โ โโโ output_triplet_final.json # Final triplet dataset
โ โโโ outputs_ao_dai_caption_refined.json
โ
โโโ metric.py # Evaluation metrics
โโโ process_triplets.py # Triplet processing utilities
โโโ triplet.py # Triplet data structures
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
requirements.txt for complete dependency listcd BM_ICMR2026
pip install -r requirements.txt
cd generation/Sana
# Follow SANA installation instructions in generation/Sana/README.md
# Ensure sketches are in aodai/sketches/
# Ensure prompts are configured in generation/prompts_ao_dai.json
Single image synthesis:
cd generation
python sana_inference.py --config configs/sana_config.yaml --sketch_path path/to/sketch.png
Batch synthesis:
cd generation
bash inference_multi.sh
Generate captions for synthesized images using the provided notebook:
notebook/Qwen2.5_captions.ipynb
Or use the caption generation script directly (if available).
Process and form triplets from synthesized images and captions:
python process_triplets.py --input_dir aodai/ --output_file generated/output_triplet_final.json
Evaluate CIR models on the synthesized dataset: Each model defined in benchmark/ directories has its own evaluation script which has been fine-tuned and adapted to our VietFashion dataset. Fine-tuned model will be made public soon.
Automated Triplet Synthesis: First pipeline combining diffusion spatial control (SANA-ControlNet) with LLM semantic distillation (Qwen2.5) for cultural garment domain-specific data generation
Domain-Specific Benchmark: Establishes the first comprehensive benchmark for fine-grained composed image retrieval in the โรo Dร iโ (Vietnamese cultural garment) domain with 20,000 high-quality triplets
One-to-Many Mapping Design: Explicitly addresses false-negative supervision in composed retrieval through a structured Multi-Target Query Design ($1 \rightarrow 3$ mapping) that reflects real-world uncertainty
Multi-Modal Consistency: Ensures semantic alignment across sketch, image, and text modalities through structured attribute sampling and LLM caption distillation
Benchmark Evaluation: Comprehensive evaluation against state-of-the-art CIR models (Bi-Blip4CIR, CLIP4Cir, Pic2Word, SEARLE) demonstrating dataset utility
The project includes implementations and evaluations of:
Each benchmark includes pre-trained models, training and evaluation scripts in the benchmark/ directory.

Coming Soon
For questions, suggestions, or collaboration opportunities, please reach out to the project maintainers.
This project is licensed under the MIT License. See LICENSE file for details.
Status: Research Project
Last Updated: March 2026
Version: 1.0