The official implementation of ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining (AAAI 2024). The ViTEraser revisits the conventional single-step one-stage framework and improves it with ViTs for feature modeling and the proposed SegMIM pretraining. Below are the frameworks of ViTEraser and SegMIM.
- Inference code and model weights
- ViTEraser training code
- SegMIM pre-training code
We recommend using Anaconda to manage environments. Run the following commands to install dependencies.
conda create -n viteraser python=3.7 -y
conda activate viteraser
pip install torch==1.8.2 torchvision==0.9.2 torchaudio==0.8.2 --extra-index-url https://download.pytorch.org/whl/lts/1.8/cu111
git clone https://github.com/shannanyinxiang/ViTEraser.git
cd ViTEraser
pip install -r requirements.txt
-
SCUT-EnsText [paper]:
- Download the training and testing sets of SCUT-EnsText at link.
- Rename
all_images
andall_labels
folders toimage
andlabel
, respectively. - Generate text masks:
# Generating masks for the training set of SCUT-EnsText python tools/generate_mask.py \ --data_root data/TextErase/SCUT-EnsText/train # Generating masks for the testing set of SCUT-EnsText # Masks are not used for inference. Just keep the same data structure as the training stage. python tools/generate_mask.py \ --data_root data/TextErase/SCUT-EnsText/test
(optional, only required by SegMIM pretraining)
- ICDAR2013 [paper][download link]
- ICDAR2015 [paper][download link]
- MLT2017 [paper][download link]
- ArT [paper][download link]
- LSVT [paper][download link]
- ReCTS [paper][download link]
- TextOCR [paper][download link]
Please prepare the above datasets into the data
folder following the file structure below.
data
├─TextErase
│ └─SCUT-EnsText
│ ├─train
│ │ ├─image
│ │ ├─label
│ │ └─mask
│ └─test
│ ├─image
│ ├─label
│ └─mask
└─SegMIMDatasets
├─ArT
├─ICDAR2013
├─ICDAR2015
├─LSVT
├─MLT2017
├─ReCTS
└─TextOCR
The download links of pre-trained ViTEraser weights are provided in the following table.
Name | BaiduNetDisk | GoogleDrive |
---|---|---|
ViTEraser-Tiny | link | link |
ViTEraser-Small | link | link |
ViTEraser-Base | link | link |
The example command for the inference with ViTEraser-Tiny is:
CUDA_VISIBLE_DEVICES=0 \
python -m torch.distributed.launch \
--master_port=3151 \
--nproc_per_node 1 \
--use_env \
main.py \
--eval \
--data_root data/TextErase/ \
--val_dataset scutens_test \
--batch_size 1 \
--encoder swinv2 \
--decoder swinv2 \
--pred_mask false \
--intermediate_erase false \
--swin_enc_embed_dim 96 \
--swin_enc_depths 2 2 6 2 \
--swin_enc_num_heads 3 6 12 24 \
--swin_enc_window_size 16 \
--swin_dec_depths 2 6 2 2 2 \
--swin_dec_num_heads 24 12 6 3 2 \
--swin_dec_window_size 16 \
--output_dir path/to/save/output/ \
--resume path/to/weights/
Argument changes for different scales of ViTEraser are as below:
Argument | Tiny | Small | Base |
---|---|---|---|
swin_enc_embed_dim | 96 | 96 | 128 |
swin_enc_depths | 2 2 6 2 | 2 2 18 2 | 2 2 18 2 |
swin_enc_num_heads | 3 6 12 24 | 3 6 12 24 | 4 8 16 32 |
swin_enc_window_size | 16 | 16 | 8 |
swin_dec_depths | 2 6 2 2 2 | 2 18 2 2 2 | 2 18 2 2 2 |
swin_dec_num_heads | 24 12 6 3 2 | 24 12 6 3 2 | 32 16 8 4 2 |
swin_dec_window_size | 16 | 8 | 8 |
The command for calculating metrics is:
python eval/evaluation.py \
--gt_path data/TextErase/SCUT-EnsText/test/label/ \
--target_path path/to/model/output/
python -m pytorch_fid \
data/TextErase/SCUT-EnsText/test/label/ \
path/to/model/output/ \
--device cuda:0
- Download the ImageNet-pretrained weights of Swin Transformer V2 (Tiny: download link, Small: download link, Base: download link, originally released at repo).
- Download the ImageNet-pretrained weights of VGG-16 (download link, originally released by PyTorch).
- Put the pretrained weights into the
pretrained
folder. - Run the example scripts in the
scripts/viteraser-training-wosegmim
folder. For instance, run the following command to train ViTEraser-Tiny without SegMIM pretraining.
bash scripts/viteraser-training-wosegmim/viteraser-tiny-train.sh
- Download the SegMIM pretraining weights for ViTEraser-Tiny (download link), ViTEraser-Small (download link), or ViTEraser-Base (download link).
- Download the ImageNet-pretrained weights of VGG-16 (download link, originally released by PyTorch).
- Put the pretrained weights into the
pretrained
folder. - Run the example scripts in the
scripts/viteraser-training-withsegmim
folder. For instance, run the following command to train ViTEraser-Tiny with SegMIM pretraining.
bash scripts/viteraser-training-withsegmim/viteraser-tiny-train-withsegmim.sh
- Download the ImageNet-pretrained weights of Swin Transformer V2 (Tiny: download link, Small: download link, Base: download link, originally released at repo) into the
pretrained
folder. - Run the example scripts in the
scripts/segmim
folder. For instance, run the following command to perform SegMIM pretraining of ViTEraser-Tiny.
# end-to-end encoder-decoder pretraining
bash scripts/segmim/viteraser-tiny-segmim.sh
# standalone encoder finetuning
bash scripts/segmim/viteraser-tiny-encoder-finetune.sh
@inproceedings{peng2024viteraser,
title={ViTEraser: Harnessing the power of vision transformers for scene text removal with SegMIM pretraining},
author={Peng, Dezhi and Liu, Chongyu and Liu, Yuliang and Jin, Lianwen},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={5},
pages={4468--4477},
year={2024}
}
This repository can only be used for non-commercial research purpose.
For commercial use, please contact Prof. Lianwen Jin (eelwjin@scut.edu.cn).
Copyright 2024, Deep Learning and Vision Computing Lab, South China University of Technology.