Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Paper List-3] Add 10 textrecog papers #1664

Open
wants to merge 2 commits into
base: dev-1.x
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
Title: 'Hamming OCR: A Locality Sensitive Hashing Neural Network for Scene Text Recognition'
Abbreviation: HammingOCR
Tasks:
- TextRecog
Venue: arXiv
Year: 2020
Lab/Company:
- School of Computer and Information Technology, Beijing Jiaotong University, China
- Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer Science, Fudan University, China
- Baidu Inc., China
URL:
Venue: N/A
Arxiv: 'https://arxiv.org/abs/2009.10874'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Recently, inspired by Transformer, self-attention-based scene text
recognition approaches have achieved outstanding performance. However, we find
that the size of model expands rapidly with the lexicon increasing. Specifically,
the number of parameters for softmax classification layer and output embedding
layer are proportional to the vocabulary size. It hinders the development of a
lightweight text recognition model especially applied for Chinese and multiple
languages. Thus, we propose a lightweight scene text recognition model named
Hamming OCR. In this model, a novel Hamming classifier, which adopts locality
sensitive hashing (LSH) algorithm to encode each character, is proposed to
replace the softmax regression and the generated LSH code is directly employed
to replace the output embedding. We also present a simplified transformer
decoder to reduce the number of parameters by removing the feed-forward network
and using cross-layer parameter sharing technique. Compared with traditional
methods, the number of parameters in both classification and embedding layers
is independent on the size of vocabulary, which significantly reduces the
storage requirement without loss of accuracy. Experimental results on several
datasets, including four public benchmaks and a Chinese text dataset synthesized
by SynthText with more than 20,000 characters, shows that Hamming OCR achieves
competitive results.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/211144293-2f94c36f-a3ec-44ac-a70c-4854ccfa90af.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: 6.6M
Experiment:
Training DataSets:
- MJ
Test DataSets:
Avg.: 74.0
IIIT5K:
WAICS: 82.6
SVT:
WAICS: 83.3
IC13:
WAICS: N/A
IC15:
WAICS: N/A
SVTP:
WAICS: 68.8
CUTE:
WAICS: 61.1
Bibtex: '@article{li2020hamming,
title={Hamming ocr: A locality sensitive hashing neural network for scene text recognition},
author={Li, Bingcong and Tang, Xin and Qi, Xianbiao and Chen, Yihao and Xiao, Rong},
journal={arXiv preprint arXiv:2009.10874},
year={2020}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
Title: 'MASTER: Multi-Aspect Non-local Network for Scene Text Recognition'
Abbreviation: MASTER
Tasks:
- TextRecog
Venue: PR
Year: 2021
Lab/Company:
- School of Computer and Information Technology, Beijing Jiaotong University, China
- Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer Science, Fudan University, China
- Baidu Inc., China
URL:
Venue: 'https://www.sciencedirect.com/science/article/pii/S0031320321001679'
Arxiv: 'https://arxiv.org/abs/1910.02562'
Paper Reading URL: N/A
Code: 'https://github.com/wenwenyu/MASTER-pytorch'
Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/master'
PaperType:
- Algorithm
Abstract: 'Attention-based scene text recognizers have gained huge success, which
leverages a more compact intermediate representation to learn 1d- or 2d- attention
by a RNN-based encoder-decoder architecture. However, such methods suffer
from attention-drift problem because high similarity among encoded features
leads to attention confusion under the RNN-based local attention mechanism.
Moreover, RNN-based methods have low efficiency due to poor parallelization.
To overcome these problems, we propose the MASTER, a self-attention based scene
text recognizer that (1) not only encodes the input-output attention but also
learns self-attention which encodes feature-feature and target-target relationships
inside the encoder and decoder and (2) learns a more powerful and robust
intermediate representation to spatial distortion, and (3) owns a great training
efficiency because of high training parallelization and a high-speed inference
because of an efficient memory-cache mechanism. Extensive experiments on various
benchmarks demonstrate the superior performance of our MASTER on both regular
and irregular scene text. Pytorch code can be found at https://github.com/wenwenyu/MASTER-pytorch,
and Tensorflow code can be found at https://github.com/jiangxiluning/MASTER-TF.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/211144560-9732023f-fb02-415e-abfe-0b0ff0ab8425.png'
FPS:
DEVICE: 'NVIDIA 1080Ti'
ITEM: 55.5
FLOPS:
DEVICE: 'NVIDIA 1080Ti'
ITEM: 6.07G
PARAMS: 38.81M
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: 88.7
IIIT5K:
WAICS: 95.0
SVT:
WAICS: 90.6
IC13:
WAICS: 95.3
IC15:
WAICS: 79.4
SVTP:
WAICS: 84.5
CUTE:
WAICS: 87.5
Bibtex: '@article{lu2021master,
title={Master: Multi-aspect non-local network for scene text recognition},
author={Lu, Ning and Yu, Wenwen and Qi, Xianbiao and Chen, Yihao and Gong, Ping and Xiao, Rong and Bai, Xiang},
journal={Pattern Recognition},
volume={117},
pages={107980},
year={2021},
publisher={Elsevier}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
Title: 'NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition'
Abbreviation: NRTR
Tasks:
- TextRecog
Venue: ICDAR
Year: 2019
Lab/Company:
- Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences
URL:
Venue: 'https://ieeexplore.ieee.org/abstract/document/8978180/'
Arxiv: 'https://arxiv.org/abs/1806.00926'
Paper Reading URL: N/A
Code: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/nrtr'
Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/nrtr'
PaperType:
- Algorithm
Abstract: 'Scene text recognition has attracted a great many researches due to
its importance to various applications. Existing methods mainly adopt recurrence
or convolution based networks. Though have obtained good performance, these
methods still suffer from two limitations: slow training speed due to the
internal recurrence of RNNs, and high complexity due to stacked convolutional
layers for long-term feature extraction. This paper, for the first time,
proposes a no-recurrence sequence-to-sequence text recognizer, named NRTR, that
dispenses with recurrences and convolutions entirely. NRTR follows the
encoder-decoder paradigm, where the encoder uses stacked self-attention to
extract image features, and the decoder applies stacked self-attention to
recognize texts based on encoder output. NRTR relies solely on self-attention
mechanism thus could be trained with more parallelization and less complexity.
Considering scene image has large variation in text and background, we further
design a modality-transform block to effectively transform 2D input images to
1D sequences, combined with the encoder to extract more discriminative features.
NRTR achieves state-of-the-art or highly competitive performance on both
regular and irregular benchmarks, while requires only a small fraction of
training time compared to the best model from the literature (at least 8
times faster).'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/211147170-f8ceb124-cde4-4323-b770-493962cdfcb0.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: 87.4
IIIT5K:
WAICS: 90.1
SVT:
WAICS: 91.5
IC13:
WAICS: 95.8
IC15:
WAICS: 79.4
SVTP:
WAICS: 86.6
CUTE:
WAICS: 80.9
Bibtex: '@inproceedings{sheng2019nrtr,
title={NRTR: A no-recurrence sequence-to-sequence model for scene text recognition},
author={Sheng, Fenfen and Chen, Zhineng and Xu, Bo},
booktitle={2019 International conference on document analysis and recognition (ICDAR)},
pages={781--786},
year={2019},
organization={IEEE}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
Title: 'Pushing the Performance Limit of Scene Text Recognizer without Human Annotation'
Abbreviation: Zheng et al
Tasks:
- TextRecog
Venue: CVPR
Year: 2022
Lab/Company:
- School of Computer Science and Ningbo Institute, Northwestern Polytechnical University, China
- Samsung Advanced Institute of Technology (SAIT), South Korea
URL:
Venue: 'https://openaccess.thecvf.com/content/CVPR2022/html/Zheng_Pushing_the_Performance_Limit_of_Scene_Text_Recognizer_Without_Human_CVPR_2022_paper.html'
Arxiv: 'https://arxiv.org/abs/2204.07714'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Scene text recognition (STR) attracts much attention over the years
because of its wide application. Most methods train STR model in a fully
supervised manner which requires large amounts of labeled data. Although
synthetic data contributes a lot to STR, it suffers from the real-tosynthetic
domain gap that restricts model performance. In this work, we aim to boost
STR models by leveraging both synthetic data and the numerous real unlabeled
images, exempting human annotation cost thoroughly. A robust consistency
regularization based semi-supervised framework is proposed for STR, which can
effectively solve the instability issue due to domain inconsistency between
synthetic and real images. A character-level consistency regularization is
designed to mitigate the misalignment between characters in sequence recognition.
Extensive experiments on standard text recognition benchmarks demonstrate
the effectiveness of the proposed method. It can steadily improve existing
STR models, and boost an STR model to achieve new state-of-the-art results.
To our best knowledge, this is the first consistency regularization based
framework that applies successfully to STR.'
MODELS:
Architecture:
- Attenion
Learning Method:
- Self-Supervised
- Semi-Supervised
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/211144099-f6db366a-e34b-401d-9b3c-13d08c1c1068.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
Test DataSets:
Avg.: 94.5
IIIT5K:
WAICS: 96.5
SVT:
WAICS: 96.3
IC13:
WAICS: 98.3
IC15:
WAICS: 89.3
SVTP:
WAICS: 93.3
CUTE:
WAICS: 93.4
Bibtex: '@inproceedings{zheng2022pushing,
title={Pushing the Performance Limit of Scene Text Recognizer without Human Annotation},
author={Zheng, Caiyuan and Li, Hui and Rhee, Seon-Min and Han, Seungju and Han, Jae-Joon and Wang, Peng},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={14116--14125},
year={2022}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
Title: 'RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition'
Abbreviation: RobustScanner
Tasks:
- TextRecog
Venue: ECCV
Year: 2020
Lab/Company:
- SenseTime Research, Hong Kong, China
- School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an, China
URL:
Venue: 'https://link.springer.com/chapter/10.1007/978-3-030-58529-7_9'
Arxiv: 'https://arxiv.org/abs/2007.07542'
Paper Reading URL: N/A
Code: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/robust_scanner'
Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/robust_scanner'
PaperType:
- Algorithm
Abstract: 'The attention-based encoder-decoder framework has recently achieved
impressive results for scene text recognition, and many variants have emerged
with improvements in recognition quality. However, it performs poorly on
contextless texts (e.g., random character sequences) which is unacceptable in
most of real application scenarios. In this paper, we first deeply investigate
the decoding process of the decoder. We empirically find that a representative
character-level sequence decoder utilizes not only context information but also
positional information. Contextual information, which the existing approaches
heavily rely on, causes the problem of attention drift. To suppress such
side-effect, we propose a novel position enhancement branch, and dynamically
fuse its outputs with those of the decoder attention module for scene text
recognition. Specifically, it contains a position aware module to enable the
encoder to output feature vectors encoding their own spatial positions, and an
attention module to estimate glimpses using the positional clue (i.e., the
current decoding time step) only. The dynamic fusion is conducted for more
robust feature via an element-wise gate mechanism. Theoretically, our proposed
method, dubbed RobustScanner, decodes individual characters with dynamic ratio
between context and positional clues, and utilizes more positional ones when
the decoding sequences with scarce context, and thus is robust and practical.
Empirically, it has achieved new state-of-the-art results on popular regular
and irregular text recognition benchmarks while without much performance drop
on contextless benchmarks, validating its robustness in both contextual and
contextless application scenarios.'
MODELS:
Architecture:
- Attention
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/211147345-0515c292-00d1-458f-b5c7-b3a940a0c12c.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- MJ
- ST
- Real
Test DataSets:
Avg.: 88.9
IIIT5K:
WAICS: 95.4
SVT:
WAICS: 89.3
IC13:
WAICS: 94.1
IC15:
WAICS: 79.2
SVTP:
WAICS: 82.9
CUTE:
WAICS: 92.4
Bibtex: '@inproceedings{yue2020robustscanner,
title={Robustscanner: Dynamically enhancing positional clues for robust text recognition},
author={Yue, Xiaoyu and Kuang, Zhanghui and Lin, Chenhao and Sun, Hongbin and Zhang, Wayne},
booktitle={European Conference on Computer Vision},
pages={135--151},
year={2020},
organization={Springer}
}'
Loading