[AAAI'25] ENCODER: Entity Mining and Modification Relation Binding for Composed Image Retrieval

1School of Software, Shandong University,
2School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen),
3School of Data Science, City University of Hong Kong

*Corresponding author.

News: 2025/04/16, We have released the full ENCODER code and checkpoints.

News: 2025/09/10, We have updated the evaluation code and ENCODER checkpoints with "state_dict" version for stable evaluation.

News: 2025/10/08, Based on feedback from some researchers, we found that different versions of open_clip can impact model performance. To ensure consistent performance, we have further clarified the environment dependencies (requirements.txt, explained in README).

Abstract

MY ALT TEXT

The Phenomenon of Modification Relation Correspondence

MY ALT TEXT

(a) provides an illustrative example of the CIR task. (b) illustrates the phenomenon of modification relation correspondence in the CIR task, whereby the modification text frequently comprises a series of modification actions, each of which is associated with a visual entity in the reference image through a corresponding modification relation. For example, the modification actions “in front of the rock” and “dog stands” correspond to the visual entities “trees” and “dog” in the reference image, respectively.


Framework: Entity miNing and modifiCation relatiOn binDing nEtwoRk (ENCODER)

MY ALT TEXT

The proposed ENCODER consists of three key modules: (a) Latent Factor Filter, (b) Entity-Action Binding, and (c) Multi-scale Composition.


Experiment

MY ALT TEXT
MY ALT TEXT
MY ALT TEXT

MY ALT TEXT

Sensitivity to (a) Latent Factor Number P and (b) Query Number E of LRQ on the FashionIQ dataset.

MY ALT TEXT

Attention visualization for LRQ on (a) CIRR and (b) FashionIQ datasets.

MY ALT TEXT

Attention visualization on LRQ of ENCODER on FashionIQ and Shoes dataset. Three different examples showed that different LRQ queries focus on different entity-action.

MY ALT TEXT

Attention visualization on LRQ of ENCODER on CIRR and Fashion200K dataset. Three different examples showed that different LRQ queries focus on different entity-action.Shoes

MY ALT TEXT

Qualitative examples of ENCODER on FashionIQ, Shoes, CIRR, and Fashion200K datasets. The ground-truths are color-boxed.

BibTeX


      @inproceedings{encoder,
        title={Encoder: Entity mining and modification relation binding for composed image retrieval},
        author={Li, Zixu and Chen, Zhiwei and Wen, Haokun and Fu, Zhiheng and Hu, Yupeng and Guan, Weili},
        booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
        volume={39},
        number={5},
        pages={5101--5109},
        year={2025}
      }