open-mmlab · Mountchicken · Jan 7, 2023 · Jan 18, 2023
diff --git a/.../Hamming OCR: A Locality Sensitive Hashing Neural Network for Scene Text Recognition.yaml b/.../Hamming OCR: A Locality Sensitive Hashing Neural Network for Scene Text Recognition.yaml
@@ -0,0 +1,75 @@
+Title: 'Hamming OCR: A Locality Sensitive Hashing Neural Network for Scene Text Recognition'
+Abbreviation: HammingOCR
+Tasks:
+ - TextRecog
+Venue: arXiv
+Year: 2020
+Lab/Company:
+ - School of Computer and Information Technology, Beijing Jiaotong University, China
+ - Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer Science, Fudan University, China
+ - Baidu Inc., China
+URL:
+  Venue: N/A
+  Arxiv: 'https://arxiv.org/abs/2009.10874'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Recently, inspired by Transformer, self-attention-based scene text
+recognition approaches have achieved outstanding performance. However, we find
+that the size of model expands rapidly with the lexicon increasing. Specifically,
+the number of parameters for softmax classification layer and output embedding
+layer are proportional to the vocabulary size. It hinders the development of a
+lightweight text recognition model especially applied for Chinese and multiple
+languages. Thus, we propose a lightweight scene text recognition model named
+Hamming OCR. In this model, a novel Hamming classifier, which adopts locality
+sensitive hashing (LSH) algorithm to encode each character, is proposed to
+replace the softmax regression and the generated LSH code is directly employed
+to replace the output embedding. We also present a simplified transformer
+decoder to reduce the number of parameters by removing the feed-forward network
+and using cross-layer parameter sharing technique. Compared with traditional
+methods, the number of parameters in both classification and embedding layers
+is independent on the size of vocabulary, which significantly reduces the
+storage requirement without loss of accuracy. Experimental results on several
+datasets, including four public benchmaks and a Chinese text dataset synthesized
+by SynthText with more than 20,000 characters, shows that Hamming OCR achieves
+competitive results.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/211144293-2f94c36f-a3ec-44ac-a70c-4854ccfa90af.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: 6.6M
+ Experiment:
+   Training DataSets:
+     - MJ
+   Test DataSets:
+     Avg.: 74.0
+     IIIT5K:
+       WAICS: 82.6
+     SVT:
+       WAICS: 83.3
+     IC13:
+       WAICS: N/A
+     IC15:
+       WAICS: N/A
+     SVTP:
+       WAICS: 68.8
+     CUTE:
+       WAICS: 61.1
+Bibtex: '@article{li2020hamming,
+  title={Hamming ocr: A locality sensitive hashing neural network for scene text recognition},
+  author={Li, Bingcong and Tang, Xin and Qi, Xianbiao and Chen, Yihao and Xiao, Rong},
+  journal={arXiv preprint arXiv:2009.10874},
+  year={2020}
+}'
diff --git a/paper_zoo/textrecog/MASTER: Multi-Aspect Non-local Network for Scene Text Recognition.yaml b/paper_zoo/textrecog/MASTER: Multi-Aspect Non-local Network for Scene Text Recognition.yaml
@@ -0,0 +1,76 @@
+Title: 'MASTER: Multi-Aspect Non-local Network for Scene Text Recognition'
+Abbreviation: MASTER
+Tasks:
+ - TextRecog
+Venue: PR
+Year: 2021
+Lab/Company:
+ - School of Computer and Information Technology, Beijing Jiaotong University, China
+ - Shanghai Collaborative Innovation Center of Intelligent Visual Computing, School of Computer Science, Fudan University, China
+ - Baidu Inc., China
+URL:
+  Venue: 'https://www.sciencedirect.com/science/article/pii/S0031320321001679'
+  Arxiv: 'https://arxiv.org/abs/1910.02562'
+Paper Reading URL: N/A
+Code: 'https://github.com/wenwenyu/MASTER-pytorch'
+Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/master'
+PaperType:
+ - Algorithm
+Abstract: 'Attention-based scene text recognizers have gained huge success, which
+leverages a more compact intermediate representation to learn 1d- or 2d- attention
+by a RNN-based encoder-decoder architecture. However, such methods suffer
+from attention-drift problem because high similarity among encoded features
+leads to attention confusion under the RNN-based local attention mechanism.
+Moreover, RNN-based methods have low efficiency due to poor parallelization.
+To overcome these problems, we propose the MASTER, a self-attention based scene
+text recognizer that (1) not only encodes the input-output attention but also
+learns self-attention which encodes feature-feature and target-target relationships
+inside the encoder and decoder and (2) learns a more powerful and robust
+intermediate representation to spatial distortion, and (3) owns a great training
+efficiency because of high training parallelization and a high-speed inference
+because of an efficient memory-cache mechanism. Extensive experiments on various
+benchmarks demonstrate the superior performance of our MASTER on both regular
+and irregular scene text. Pytorch code can be found at https://github.com/wenwenyu/MASTER-pytorch,
+and Tensorflow code can be found at https://github.com/jiangxiluning/MASTER-TF.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/211144560-9732023f-fb02-415e-abfe-0b0ff0ab8425.png'
+ FPS:
+   DEVICE: 'NVIDIA 1080Ti'
+   ITEM: 55.5
+ FLOPS:
+   DEVICE: 'NVIDIA 1080Ti'
+   ITEM: 6.07G
+ PARAMS: 38.81M
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 88.7
+     IIIT5K:
+       WAICS: 95.0
+     SVT:
+       WAICS: 90.6
+     IC13:
+       WAICS: 95.3
+     IC15:
+       WAICS: 79.4
+     SVTP:
+       WAICS: 84.5
+     CUTE:
+       WAICS: 87.5
+Bibtex: '@article{lu2021master,
+  title={Master: Multi-aspect non-local network for scene text recognition},
+  author={Lu, Ning and Yu, Wenwen and Qi, Xianbiao and Chen, Yihao and Gong, Ping and Xiao, Rong and Bai, Xiang},
+  journal={Pattern Recognition},
+  volume={117},
+  pages={107980},
+  year={2021},
+  publisher={Elsevier}
+}'
diff --git a/...extrecog/NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition.yaml b/...extrecog/NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition.yaml
@@ -0,0 +1,76 @@
+Title: 'NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition'
+Abbreviation: NRTR
+Tasks:
+ - TextRecog
+Venue: ICDAR
+Year: 2019
+Lab/Company:
+ - Institute of Automation, Chinese Academy of Sciences University of Chinese Academy of Sciences
+URL:
+  Venue: 'https://ieeexplore.ieee.org/abstract/document/8978180/'
+  Arxiv: 'https://arxiv.org/abs/1806.00926'
+Paper Reading URL: N/A
+Code: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/nrtr'
+Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/nrtr'
+PaperType:
+ - Algorithm
+Abstract: 'Scene text recognition has attracted a great many researches due to
+its importance to various applications. Existing methods mainly adopt recurrence
+or convolution based networks. Though have obtained good performance, these
+methods still suffer from two limitations: slow training speed due to the
+internal recurrence of RNNs, and high complexity due to stacked convolutional
+layers for long-term feature extraction. This paper, for the first time,
+proposes a no-recurrence sequence-to-sequence text recognizer, named NRTR, that
+dispenses with recurrences and convolutions entirely. NRTR follows the
+encoder-decoder paradigm, where the encoder uses stacked self-attention to
+extract image features, and the decoder applies stacked self-attention to
+recognize texts based on encoder output. NRTR relies solely on self-attention
+mechanism thus could be trained with more parallelization and less complexity.
+Considering scene image has large variation in text and background, we further
+design a modality-transform block to effectively transform 2D input images to
+1D sequences, combined with the encoder to extract more discriminative features.
+ NRTR achieves state-of-the-art or highly competitive performance on both
+ regular and irregular benchmarks, while requires only a small fraction of
+ training time compared to the best model from the literature (at least 8
+ times faster).'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/211147170-f8ceb124-cde4-4323-b770-493962cdfcb0.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 87.4
+     IIIT5K:
+       WAICS: 90.1
+     SVT:
+       WAICS: 91.5
+     IC13:
+       WAICS: 95.8
+     IC15:
+       WAICS: 79.4
+     SVTP:
+       WAICS: 86.6
+     CUTE:
+       WAICS: 80.9
+Bibtex: '@inproceedings{sheng2019nrtr,
+  title={NRTR: A no-recurrence sequence-to-sequence model for scene text recognition},
+  author={Sheng, Fenfen and Chen, Zhineng and Xu, Bo},
+  booktitle={2019 International conference on document analysis and recognition (ICDAR)},
+  pages={781--786},
+  year={2019},
+  organization={IEEE}
+}'
diff --git a/...ecog/Pushing the Performance Limit of Scene Text Recognizer without Human Annotation.yaml b/...ecog/Pushing the Performance Limit of Scene Text Recognizer without Human Annotation.yaml
@@ -0,0 +1,75 @@
+Title: 'Pushing the Performance Limit of Scene Text Recognizer without Human Annotation'
+Abbreviation: Zheng et al
+Tasks:
+ - TextRecog
+Venue: CVPR
+Year: 2022
+Lab/Company:
+ - School of Computer Science and Ningbo Institute, Northwestern Polytechnical University, China
+ - Samsung Advanced Institute of Technology (SAIT), South Korea
+URL:
+  Venue: 'https://openaccess.thecvf.com/content/CVPR2022/html/Zheng_Pushing_the_Performance_Limit_of_Scene_Text_Recognizer_Without_Human_CVPR_2022_paper.html'
+  Arxiv: 'https://arxiv.org/abs/2204.07714'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Scene text recognition (STR) attracts much attention over the years
+because of its wide application. Most methods train STR model in a fully
+supervised manner which requires large amounts of labeled data. Although
+synthetic data contributes a lot to STR, it suffers from the real-tosynthetic
+domain gap that restricts model performance. In this work, we aim to boost
+STR models by leveraging both synthetic data and the numerous real unlabeled
+images, exempting human annotation cost thoroughly. A robust consistency
+regularization based semi-supervised framework is proposed for STR, which can
+effectively solve the instability issue due to domain inconsistency between
+synthetic and real images. A character-level consistency regularization is
+designed to mitigate the misalignment between characters in sequence recognition.
+Extensive experiments on standard text recognition benchmarks demonstrate
+the effectiveness of the proposed method. It can steadily improve existing
+STR models, and boost an STR model to achieve new state-of-the-art results.
+To our best knowledge, this is the first consistency regularization based
+framework that applies successfully to STR.'
+MODELS:
+ Architecture:
+  - Attenion
+ Learning Method:
+  - Self-Supervised
+  - Semi-Supervised
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/211144099-f6db366a-e34b-401d-9b3c-13d08c1c1068.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+   Test DataSets:
+     Avg.: 94.5
+     IIIT5K:
+       WAICS: 96.5
+     SVT:
+       WAICS: 96.3
+     IC13:
+       WAICS: 98.3
+     IC15:
+       WAICS: 89.3
+     SVTP:
+       WAICS: 93.3
+     CUTE:
+       WAICS: 93.4
+Bibtex: '@inproceedings{zheng2022pushing,
+  title={Pushing the Performance Limit of Scene Text Recognizer without Human Annotation},
+  author={Zheng, Caiyuan and Li, Hui and Rhee, Seon-Min and Han, Seungju and Han, Jae-Joon and Wang, Peng},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={14116--14125},
+  year={2022}
+}'
diff --git a/...og/RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition.yaml b/...og/RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition.yaml
@@ -0,0 +1,82 @@
+Title: 'RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition'
+Abbreviation: RobustScanner
+Tasks:
+ - TextRecog
+Venue: ECCV
+Year: 2020
+Lab/Company:
+ - SenseTime Research, Hong Kong, China
+ - School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an, China
+URL:
+  Venue: 'https://link.springer.com/chapter/10.1007/978-3-030-58529-7_9'
+  Arxiv: 'https://arxiv.org/abs/2007.07542'
+Paper Reading URL: N/A
+Code: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/robust_scanner'
+Supported In MMOCR: 'https://github.com/open-mmlab/mmocr/tree/1.x/configs/textrecog/robust_scanner'
+PaperType:
+ - Algorithm
+Abstract: 'The attention-based encoder-decoder framework has recently achieved
+impressive results for scene text recognition, and many variants have emerged
+with improvements in recognition quality. However, it performs poorly on
+contextless texts (e.g., random character sequences) which is unacceptable in
+most of real application scenarios. In this paper, we first deeply investigate
+the decoding process of the decoder. We empirically find that a representative
+character-level sequence decoder utilizes not only context information but also
+positional information. Contextual information, which the existing approaches
+heavily rely on, causes the problem of attention drift. To suppress such
+side-effect, we propose a novel position enhancement branch, and dynamically
+fuse its outputs with those of the decoder attention module for scene text
+recognition. Specifically, it contains a position aware module to enable the
+encoder to output feature vectors encoding their own spatial positions, and an
+attention module to estimate glimpses using the positional clue (i.e., the
+current decoding time step) only. The dynamic fusion is conducted for more
+robust feature via an element-wise gate mechanism. Theoretically, our proposed
+method, dubbed RobustScanner, decodes individual characters with dynamic ratio
+between context and positional clues, and utilizes more positional ones when
+the decoding sequences with scarce context, and thus is robust and practical.
+Empirically, it has achieved new state-of-the-art results on popular regular
+and irregular text recognition benchmarks while without much performance drop
+on contextless benchmarks, validating its robustness in both contextual and
+contextless application scenarios.'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/211147345-0515c292-00d1-458f-b5c7-b3a940a0c12c.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - MJ
+     - ST
+     - Real
+   Test DataSets:
+     Avg.: 88.9
+     IIIT5K:
+       WAICS: 95.4
+     SVT:
+       WAICS: 89.3
+     IC13:
+       WAICS: 94.1
+     IC15:
+       WAICS: 79.2
+     SVTP:
+       WAICS: 82.9
+     CUTE:
+       WAICS: 92.4
+Bibtex: '@inproceedings{yue2020robustscanner,
+  title={Robustscanner: Dynamically enhancing positional clues for robust text recognition},
+  author={Yue, Xiaoyu and Kuang, Zhanghui and Lin, Chenhao and Sun, Hongbin and Zhang, Wayne},
+  booktitle={European Conference on Computer Vision},
+  pages={135--151},
+  year={2020},
+  organization={Springer}
+}'