VietFashion: Benchmarking Sketch–Text Composed Image Retrieval for Cultural Outfits

Figure 1: Representative sketch-photo pairs of Áo Dài from our VietFashion dataset. The top row features sketches reflecting diverse levels of abstraction and detail, while the bottom rows present corresponding photographs of authentic garments.

650

Human-drawn sketches

21K

Synthesized garment images

7,000

Composed retrieval queries

1→3

Multi-target mapping

Attribute categories

Abstract

Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch–text composed image retrieval centered on the Áo Dài, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches — which convey garment structure — and textual descriptions — which encode cultural semantics.

The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts describing detailed outfit attributes are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval.

Dataset available at: https://hng0303.github.io/VietFashion

Dataset Construction Pipeline

A two-stage generative pipeline produces aligned sketch–text–image triplets from curated cultural attributes.

Figure 2: Overview of the VietFashion dataset construction pipeline. The pipeline begins with sketches (S) and sampled garment attributes (A). We utilize SANA-ControlNet to generate multi-target images (I) under spatial constraints, while Qwen-2.5-Instruct distills attributes into concise natural language captions (C) to form the final composed retrieval triplet.

STAGE 0

Sketch Collection

650 human-drawn sketches covering diverse Áo Dài silhouettes, collar types, sleeve variants, and structural compositions. Balanced across abstraction levels (low/medium/high).

650 sketches

STAGE 1

Attribute-Driven Synthesis

Random sampling from 11 curated attribute categories (fabric, neckline, sleeves, embroidery…). SANA-ControlNet generates photorealistic targets conditioned on sketch + structured prompt.

SANA-ControlNet

STAGE 2

Caption Refinement

Qwen-2.5 3B Instruct distills structured attributes into concise, neutral single-sentence captions starting with "A photo of…" (avg. 42.46 words).

Qwen-2.5 3B

STAGE 3

Triplet Alignment

Each query (Sketch, Caption) is paired with 3 semantically consistent target images, reducing false-negative supervision inherent in single-target designs.

1 → 3 mapping

Multi-Target Query Design

In real-world fashion scenarios, multiple garments may satisfy the same semantic description. Using only one positive target during training causes valid alternatives to be misclassified as negatives — a false-negative supervision problem. VietFashion addresses this by adopting a 1→3 multi-target mapping: each query is paired with three semantically consistent but visually distinct target images.

Figure 3: Examples from the proposed VietFashion dataset. Each query contains a sketch of an Áo Dài, a natural-language caption describing garment attributes and context, and multiple valid target images.

Query format: (Sketch, Caption) → {Target Img.₁, Target Img.₂, Target Img.₃}

Data splits: 5,200 training queries · 650 validation queries · 1,150 testing queries (split at sketch level to prevent query-level leakage)

Comparison with Existing Datasets

VietFashion uniquely targets cultural outfits and employs multi-target supervision to address the ambiguity inherent in fine-grained sketch-text retrieval.

Dataset	Year	Domain	Query Modality	CIR	Multi-Target
TU-Berlin	2012	General Object	Sketch	✗	✗
Sketchy Extended	2016	General Object	Sketch	✗	✗
FashionIQ	2019	Western Fashion	Image, Text	✓	✗
QuickDraw-Ext	2019	General Object	Sketch	✗	✗
CIRR	2021	Open-Domain	Image, Text	✓	✗
CIRCO	2022	Open-Domain	Image, Text	✓	✓
FACap	2025	Fashion	Image, Text	✓	✓
CSTBIR	2025	General Object	Sketch, Text	✓	✗
FIGROTD	2026	General Object	Sketch, Text, Image	✓	✗
VietFashion (Ours)	2026	Cultural Outfit	Sketch, Text	✓	✓

Benchmark Results

Retrieval performance on the VietFashion test set. Red = best per column. Methods are categorized by learning paradigm.

Method	Paradigm	R@1	R@5	R@10	mAP	MRR
ZSE-SBIR	SBIR	0.0285	0.0623	0.1077	0.0323	0.0539
S3BIR-DINO	SBIR	0.0157	0.0565	0.0948	0.0216	0.0428
TaskFormer	ST-CIR	0.0564	0.1472	0.2067	0.0269	0.0891
VaGFeM	ST-CIR	0.0750	0.1612	0.2201	0.0356	0.1142
CLIP4CIR	Supervised	0.0313	0.1149	0.1851	0.1064	0.1908
BLIP4CIR	Supervised	0.0877	0.2672	0.3703	0.2483	0.3950
SEARLE-ViT/B	Zero-shot	0.0000	0.0200	0.0400	0.0200	0.0500
SEARLE-ViT/L	Zero-shot	0.0100	0.0300	0.0400	0.0300	0.0600
Pic2Word	Zero-shot	0.0082	0.0210	0.0364	0.0253	0.0523
Pic2Word (Fine-tuned)	Fine-tuned	0.0087	0.0221	0.0374	0.0253	0.0527

Pic2Word (Fine-tuned) was adapted using sketches in our training set.

Key Findings

Multimodal composition matters

VaGFeM achieves ~2.6× higher R@1 than the best SBIR baseline, confirming that textual attribute conditioning is essential for fine-grained cultural retrieval.

Zero-shot models struggle

Best zero-shot model reaches only R@1 = 0.01. General vision-language pretraining doesn't transfer to abstract sketches paired with cultural garment semantics.

Fine-grained retrieval is hard

Even the best model achieves R@1 below 0.09 while R@10 reaches 0.37 — many Áo Dài share nearly identical silhouettes differing only in subtle embroidery or collar details.

Architecture sensitivity

BLIP4CIR outperforms CLIP4CIR by a large margin (MRR 0.395 vs 0.191), suggesting stronger text-visual grounding is critical when visual differences are semantically subtle.

Multi-target complexity

Low mAP across all methods indicates the 1→3 design introduces genuine ambiguity that requires attribute-level discrimination rather than instance memorization.

Caption complexity trade-off

Captions average 42.46 words, ensuring cultural richness. The BLIP4CIR vs CLIP4CIR gap suggests models with stronger fine-grained grounding better parse these long descriptors.

Contributions

1
We introduce VietFashion, a benchmark for cultural preservation that pairs sketches with magazine-grounded text attributes to capture the fine-grained semantics of the Vietnamese Áo Dài.
2
We develop a generative synthesis pipeline utilizing Qwen-2.5 and SANA-ControlNet to synthesize 21,000 fashion images, bridging the gap between manual creativity and data scale.
3
We implement a multi-target query design (1→3) that provides three valid ground-truth images per sketch, reducing the triplet ambiguity problem found in previous single-target benchmarks.
4
The VietFashion dataset is publicly available at https://hng0303.github.io/VietFashion, with code, annotations, and benchmark evaluation scripts.

Citation

If you find VietFashion useful in your research, please cite our paper:

@inproceedings{cao2026vietfashion,
  author = {Hoang-Nguyen Cao and Le-Hoang Bui and Dinh-Khoi Vo
            and Minh-Triet Tran and Trung-Nghia Le},
  title = {VietFashion: Benchmarking Sketch–Text Composed Image
            Retrieval for Cultural Outfits},
  booktitle = {International Conference on Multimedia Retrieval
            (ICMR '26)},
  year = {2026},
  address = {Amsterdam, Netherlands},
  doi = {10.1145/3805622.3810590},
}

Acknowledgments
This research is funded by Vietnam National University – Ho Chi Minh City (VNU-HCM) under Grant Number B2026-18-17.