Skip to content

Latest commit

 

History

History
387 lines (281 loc) · 22 KB

README.md

File metadata and controls

387 lines (281 loc) · 22 KB

English | 中文

SVTR

SVTR: Scene Text Recognition with a Single Visual Model

Introduction

Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. This paper proposes a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Afterward, hierarchical stages are recurrently carried out by component-level mixing, merging and/or combining. Global and local mixing blocks are devised to perceive the inter-character and intra-character patterns, leading to a multi-grained character component perception. Thus, characters are recognized by a simple linear prediction. Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L (Large) achieves highly competitive accuracy in English and outperforms existing methods by a large margin in Chinese, while running faster. In addition, SVTR-T (Tiny) is an effective and much smaller model, which shows appealing speed at inference. [1]

Figure 1. Architecture of SVTR [1]

Requirements

mindspore ascend driver firmware cann toolkit/kernel
2.3.1 24.1.RC2 7.3.0.1.231 8.0.RC2.beta1

Quick Start

Preparation

Installation

Please refer to the installation instruction in MindOCR.

Dataset Preparation

MJSynth, validation and evaluation dataset

Part of the lmdb dataset for training and evaluation can be downloaded from here (ref: deep-text-recognition-benchmark). There're several zip files:

  • data_lmdb_release.zip contains the datasets including training data, validation data and evaluation data.
  • validation.zip: same as the validation/ within data_lmdb_release.zip
  • evaluation.zip: same as the evaluation/ within data_lmdb_release.zip
SynthText dataset

For SynthText, we do not use the given LMDB dataset in data_lmdb_release.zip, since it only contains part of the cropped images. Please download the raw dataset from here and prepare the LMDB dataset using the following command

python tools/dataset_converters/convert.py \
    --dataset_name synthtext \
    --task rec_lmdb \
    --image_dir path_to_SynthText \
    --label_dir path_to_SynthText_gt.mat \
    --output_path ST_full

the ST_full contained the full cropped images of SynthText in LMDB data format. Please replace the ST folder with the ST_full folder.

Dataset Usage

Finally, the data structure should like this.

data_lmdb_release/
├── evaluation
│   ├── CUTE80
│   │   ├── data.mdb
│   │   └── lock.mdb
│   ├── IC03_860
│   │   ├── data.mdb
│   │   └── lock.mdb
│   ├── IC03_867
│   │   ├── data.mdb
│   │   └── lock.mdb
│   ├── IC13_1015
│   │   ├── data.mdb
│   │   └── lock.mdb
│   ├── ...
├── training
│   ├── MJ
│   │   ├── MJ_test
│   │   │   ├── data.mdb
│   │   │   └── lock.mdb
│   │   ├── MJ_train
│   │   │   ├── data.mdb
│   │   │   └── lock.mdb
│   │   └── MJ_valid
│   │       ├── data.mdb
│   │       └── lock.mdb
│   └── ST_full
│       ├── data.mdb
│       └── lock.mdb
└── validation
    ├── data.mdb
    └── lock.mdb

Here we used the datasets under training/ folders for training, and the union dataset validation/ for validation. After training, we used the datasets under evaluation/ to evaluate model accuracy.

Training: (total 16,185,770 samples)

  • MJSynth (MJ)
    • Train: 21.2 GB, 7224586 samples
    • Valid: 2.36 GB, 802731 samples
    • Test: 2.61 GB, 891924 samples
  • SynthText (ST)
    • Train: 17.0 GB, 7266529 samples

Validation:

  • Valid: 138 MB, 6992 samples

Evaluation: (total 12,067 samples)

Data configuration for model training

To reproduce the training of model, it is recommended that you modify the configuration yaml as follows:

...
train:
  ...
  dataset:
    type: LMDBDataset
    dataset_root: dir/to/data_lmdb_release/                           # Root dir of training dataset
    data_dir: training/                                               # Dir of training dataset, concatenated with `dataset_root` to be the complete dir of training dataset
...
eval:
  dataset:
    type: LMDBDataset
    dataset_root: dir/to/data_lmdb_release/                           # Root dir of validation dataset
    data_dir: validation/                                             # Dir of validation dataset, concatenated with `dataset_root` to be the complete dir of validation dataset
  ...

Data configuration for model evaluation

We use the dataset under evaluation/ as the benchmark dataset. On each individual dataset (e.g. CUTE80, IC03_860, etc.), we perform a full evaluation by setting the dataset's directory to the evaluation dataset. This way, we get a list of the corresponding accuracies for each dataset, and then the reported accuracies are the average of these values.

To reproduce the reported evaluation results, you can:

  • Option 1: Repeat the evaluation step for all individual datasets: CUTE80, IC03_860, IC03_867, IC13_857, IC131015, IC15_1811, IC15_2077, IIIT5k_3000, SVT, SVTP. Then take the average score.

  • Option 2: Put all the benchmark datasets folder under the same directory, e.g. evaluation/. And use the script tools/benchmarking/multi_dataset_eval.py.

  1. Evaluate on one specific dataset

For example, you can evaluate the model on dataset CUTE80 by modifying the config yaml as follows:

...
train:
  # NO NEED TO CHANGE ANYTHING IN TRAIN SINCE IT IS NOT USED
...
eval:
  dataset:
    type: LMDBDataset
    dataset_root: dir/to/data_lmdb_release/                           # Root dir of evaluation dataset
    data_dir: evaluation/CUTE80/                                      # Dir of evaluation dataset, concatenated with `dataset_root` to be the complete dir of evaluation dataset
  ...

By running tools/eval.py as noted in section Model Evaluation with the above config yaml, you can get the accuracy performance on dataset CUTE80.

  1. Evaluate on multiple datasets under the same folder

Assume you have put all benckmark datasets under evaluation/ as shown below:

data_lmdb_release/
├── evaluation
│   ├── CUTE80
│   │   ├── data.mdb
│   │   └── lock.mdb
│   ├── IC03_860
│   │   ├── data.mdb
│   │   └── lock.mdb
│   ├── IC03_867
│   │   ├── data.mdb
│   │   └── lock.mdb
│   ├── IC13_1015
│   │   ├── data.mdb
│   │   └── lock.mdb
│   ├── ...

then you can evaluate on each dataset by modifying the config yaml as follows, and execute the script tools/benchmarking/multi_dataset_eval.py.

...
train:
  # NO NEED TO CHANGE ANYTHING IN TRAIN SINCE IT IS NOT USED
...
eval:
  dataset:
    type: LMDBDataset
    dataset_root: dir/to/data_lmdb_release/                           # Root dir of evaluation dataset
    data_dir: evaluation/                                   # Dir of evaluation dataset, concatenated with `dataset_root` to be the complete dir of evaluation dataset
  ...

Check YAML Config Files

Apart from the dataset setting, please also check the following important args: system.distribute, system.val_while_train, common.batch_size, train.ckpt_save_dir, train.dataset.dataset_root, train.dataset.data_dir, train.dataset.label_file, eval.ckpt_load_path, eval.dataset.dataset_root, eval.dataset.data_dir, eval.dataset.label_file, eval.loader.batch_size. Explanations of these important args:

system:
  distribute: True                                                    # `True` for distributed training, `False` for standalone training
  amp_level: 'O2'
  amp_level_infer: "O2"
  seed: 42
  val_while_train: True                                               # Validate while training
  drop_overflow_update: False
common:
  ...
  batch_size: &batch_size 512                                         # Batch size for training
...
train:
  ckpt_save_dir: './tmp_rec'                                          # The training result (including checkpoints, per-epoch performance and curves) saving directory
  dataset_sink_mode: False
  dataset:
    type: LMDBDataset
    dataset_root: dir/to/data_lmdb_release/                           # Root dir of training dataset
    data_dir: training/                                               # Dir of training dataset, concatenated with `dataset_root` to be the complete dir of training dataset
...
eval:
  ckpt_load_path: './tmp_rec/best.ckpt'                               # checkpoint file path
  dataset_sink_mode: False
  dataset:
    type: LMDBDataset
    dataset_root: dir/to/data_lmdb_release/                           # Root dir of validation/evaluation dataset
    data_dir: validation/                                             # Dir of validation/evaluation dataset, concatenated with `dataset_root` to be the complete dir of validation/evaluation dataset
  ...
  loader:
      shuffle: False
      batch_size: 512                                                 # Batch size for validation/evaluation
...

Notes:

  • As the global batch size (batch_size x num_devices) is important for reproducing the result, please adjust batch_size accordingly to keep the global batch size unchanged for a different number of NPUs, or adjust the learning rate linearly to a new global batch size.

Model Training

  • Distributed Training

It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please modify the configuration parameter distribute as True and run

# distributed training on multiple Ascend devices
mpirun --allow-run-as-root -n 4 python tools/train.py --config configs/rec/svtr/svtr_tiny.yaml
  • Standalone Training

If you want to train or finetune the model on a smaller dataset without distributed training, please modify the configuration parameterdistribute as False and run:

# standalone training on a CPU/Ascend device
python tools/train.py --config configs/rec/svtr/svtr_tiny.yaml

The training result (including checkpoints, per-epoch performance and curves) will be saved in the directory parsed by the arg ckpt_save_dir. The default directory is ./tmp_rec.

Model Evaluation

To evaluate the accuracy of the trained model, you can use eval.py. Please set the checkpoint path to the arg ckpt_load_path in the eval section of yaml config file, set distribute to be False, and then run:

python tools/eval.py --config configs/rec/svtr/svtr_tiny.yaml

Character Dictionary

Default Setting

To transform the groud-truth text into label ids, we have to provide the character dictionary where keys are characters and values ​​are IDs. By default, the dictionary is "0123456789abcdefghijklmnopqrstuvwxyz", which means id=0 will correspond to the charater "0". In this case, the dictionary only considers numbers and lowercase English characters, excluding spaces.

Built-in Dictionaries

There are some built-in dictionaries, which are placed in mindocr/utils/dict/, and you can choose the appropriate dictionary to use.

  • en_dict.txt is an English dictionary containing 94 characters, including numbers, common symbols, and uppercase and lowercase English letters.
  • ch_dict.txt is a Chinese dictionary containing 6623 characters, including commonly used simplified and traditional Chinese, numbers, common symbols, uppercase and lowercase English letters.

Customized Dictionary

You can also customize a dictionary file (***.txt) and place it under mindocr/utils/dict/, the format of the dictionary file should be a .txt file with one character per line.

To use a specific dictionary, set the parameter character_dict_path to the path of the dictionary, and change the parameter num_classes to the corresponding number, which is the number of characters in the dictionary + 1.

Notes:

  • You can include the space character by setting the parameter use_space_char in configuration yaml to True.
  • Remember to check the value of dataset->transform_pipeline->RecAttnLabelEncode->lower in the configuration yaml. Set it to False if you prefer case-sensitive encoding.

Chinese Text Recognition Model Training

Currently, this model supports multilingual recognition and provides pre-trained models for different languages. Details are as follows:

Chinese Dataset Preparation and Configuration

We use a public Chinese text benchmark dataset Benchmarking-Chinese-Text-Recognition for SVTR training and evaluation.

For detailed instruction of data preparation and yaml configuration, please refer to ch_dataeset.

Training

To train with the prepared datsets and config file, please run:

mpirun --allow-run-as-root -n 4 python tools/train.py --config configs/rec/svtr/svtr_tiny_ch.yaml

Training with Custom Datasets

You can train models for different languages with your own custom datasets. Loading the pretrained Chinese model to finetune on your own dataset usually yields better results than training from scratch. Please refer to the tutorial Training Recognition Network with Custom Datasets.

Performance

General Purpose Chinese Models

Experiments are tested on ascend 910* with mindspore 2.3.1 graph mode.

coming soon

Experiments are tested on ascend 910 with mindspore 2.3.1 graph mode.

model name cards batch size languages jit level graph compile ms/step img/s scene web document recipe weight
SVTR-Tiny 4 256 Chinese O2 235.1 s 37.75 1580 65.93% 69.64% 98.01% svtr_tiny_ch.yaml ckpt | mindir

Specific Purpose Models

Experiments are tested on ascend 910* with mindspore 2.3.1 graph mode.

coming soon

Experiments are tested on ascend 910 with mindspore 2.3.1 graph mode.

model name cards batch size jit level graph compile ms/step img/s accuracy recipe weight
SVTR-Tiny 4 512 O2 226.86 s 49.38 4560 90.23% yaml ckpt | mindir
SVTR-Tiny-8P 8 512 O2 230.74 s 55.16 9840 90.32% yaml ckpt | mindir

Detailed accuracy results for each benchmark dataset:

model name IC03_860 IC03_867 IC13_857 IC13_1015 IC15_1811 IC15_2077 IIIT5k_3000 SVT SVTP CUTE80 average
SVTR-Tiny 95.70% 95.50% 95.33% 93.99% 83.60% 79.83% 94.70% 91.96% 85.58% 86.11% 90.23%
SVTR-Tiny-8P 95.93% 95.62% 95.33% 93.89% 84.32% 80.55% 94.33% 90.57% 86.20% 86.46% 90.32%

Notes

  • To reproduce the result on other contexts, please ensure the global batch size is the same.
  • The characters supported by model are lowercase English characters from a to z and numbers from 0 to 9. More explanation on dictionary, please refer to 4. Character Dictionary.
  • The models are trained from scratch without any pre-training. For more dataset details of training and evaluation, please refer to Dataset Download & Dataset Usage section.
  • The input Shapes of MindIR of RARE is (1, 3, 64, 256).

References

[1] Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Tianlun Zheng, Chenxia Li, Yuning Du, Yu-Gang Jiang. SVTR: Scene Text Recognition with a Single Visual Model. arXiv preprint arXiv:2205.00159, 2022.