Skip to content

Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval (ICMR'23 Oral)

Notifications You must be signed in to change notification settings

kinshingpoon/SWAN-pytorch

 
 

Repository files navigation

Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval (ICMR'23 Oral)

By Jiancheng Pan, Qing Ma, Cong Bai.

This repo is the official implementation of "Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval"(ICMR'23 Oral).

If you want to find more RSITR methods, you can click: https://github.com/jaychempan/Awesome-RSITR

PWC PWC

ℹ️ Introduction

Recently, remote sensing cross-modal retrieval has received incredible attention from researchers. However, the unique nature of remote-sensing images leads to many semantic confusion zones in the semantic space, which greatly affects retrieval performance. We propose a novel scene-aware aggregation network (SWAN) to reduce semantic confusion by improving scene perception capability. In visual representation, a visual multiscale fusion module (VMSF) is presented to fuse visual features with different scales as a visual representation backbone. Meanwhile, a scene fine-grained sensing module (SFGS) is proposed to establish the associations of salient features at different granularity. A scene-aware visual aggregation representation is formed by the visual information generated by these two modules. In textual representation, a textual coarse-grained enhancement module (TCGE) is designed to enhance the semantics of text and to align visual information. Furthermore, as the diversity and differentiation of remote sensing scenes weaken the understanding of scenes, a new metric, namely, scene recall is proposed to measure the perception of scenes by evaluating scene-level retrieval performance, which can also verify the effectiveness of our approach in reducing semantic confusion. By performance comparisons, ablation studies and visualization analysis, we validated the effectiveness and superiority of our approach on two datasets, RSICD and RSITMD.

pipline

🎯 Implementation

Project Files

Notice: Get the Resnet50 pre-training weights under the AID dataset [Baidu Disk]

.
├── checkpoint
├── data
│   ├── rsicd_precomp
│   └── rsitmd_precomp
├── data.py
├── engine.py
├── fix_data
│   ├── rsicd_precomp
│   └── rsitmd_precomp
├── layers
│   ├── aid_28-rsp-resnet-50-ckpt.pth
│   ├── resnet50-19c8e357.pth
│   ├── resnet.py
│   └── SWAN.py
├── main.py
├── mytools.py
├── README.md
├── save_img_text_emb.py
├── test_ave.py
├── test_local_feature.py
├── test_single.py
├── train.py
├── utils.py
├── vocab
│   ├── rsicd_splits_vocab.json
│   └── rsitmd_splits_vocab.json
└── vocab.py

Environments

python==3.8.5
torch==1.11.0
torchvision==0.12.0

Train

# RSITMD Dataset
python train.py -g 0 -m SWAN -e SWAN --data_name rsitmd  -p checkpoint/ --epochs 50 -kf 1
# RSICD Dataset
python train.py -g 0 -m SWAN -e SWAN --data_name rsicd  -p checkpoint/ --epochs 50 -kf 1

Test

python test_single.py --resume 'path to model checkpoint'

🌍 Datasets

All experiments are based on RSITMD and RSICD datasets, or you can download form [Baidu Disk].

📊 Results

result

🙏 Acknowledgement

  • Basic code to thank GaLR by Yuan et al.

📝 Citation

If you find this code useful for your work or use it in your project, please cite our paper as:

@inproceedings{pan2023reducing,
  title={Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval},
  author={Pan, Jiancheng and Ma, Qing and Bai, Cong},
  booktitle={Proceedings of the 2023 ACM International Conference on Multimedia Retrieval},
  pages={398--406},
  year={2023}
}

About

Reducing Semantic Confusion: Scene-aware Aggregation Network for Remote Sensing Cross-modal Retrieval (ICMR'23 Oral)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%