open-mmlab · Mountchicken · Jan 1, 2023
diff --git a/...ion Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding.yaml b/...ion Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding.yaml
@@ -0,0 +1,79 @@
+Title: 'A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding'
+Abbreviation: Qiao et al
+Tasks:
+ - TextRecog
+Venue: ICFHR
+Year: 2022
+Lab/Company:
+ - Tomorrow Advancing Life, Beijing, China
+URL:
+  Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-21648-0_14'
+  Arxiv: 'https://books.google.fr/books?hl=zh-CN&lr=&id=hvmdEAAAQBAJ&oi=fnd&pg=PA198&ots=Gg_BaAnXLm&sig=gpJ2h9NjKz1PjLWSfwDpyd8eLZE&redir_esc=y#v=onepage&q&f=false'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Recently, vision Transformer (ViT) has attracted more and more attention,
+many works introduce the ViT into concrete vision tasks and achieve impressive
+performance. However, there are only a few works focused on the applications of
+the ViT for scene text recognition. This paper takes a further step and proposes
+a strong scene text recognizer with a fully ViT-based architecture.
+Specifically, we introduce multi-grained features into both the encoder and
+decoder. For the encoder, we adopt a two-stage ViT with different grained
+patches, where the first stage extracts extent visual features with 2D
+ine-grained patches and the second stage aims at the sequence of contextual
+features with 1D coarse-grained patches. The decoder integrates Connectionist
+Temporal Classification (CTC)-based and attention-based decoding, where the
+two decoding schemes introduce different grained features into the decoder and
+benefit from each other with a deep interaction. To improve the extraction of
+fine-grained features, we additionally explore self-supervised learning for
+text recognition with masked autoencoders. Furthermore, a focusing mechanism is
+proposed to let the model target the pixel reconstruction of the text area. Our
+proposed method achieves state-of-the-art or comparable accuracies on benchmarks
+of scene text recognition with a faster inference speed and nearly 50% reduction
+of parameters compared with other recent works.'
+MODELS:
+ Architecture:
+  - CTC
+  - Attention
+  - Transformer
+ Learning Method:
+  - Self-Supervised
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210053998-385587ef-2b0e-4c9b-a8b8-d6171261c621.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+   Test DataSets:
+     Avg.: 90.5
+     IIIT5K:
+       WAICS: 96.1
+     SVT:
+       WAICS: 92.3
+     IC13:
+       WAICS: 95.0
+     IC15:
+       WAICS: 86.0
+     SVTP:
+       WAICS: 87.0
+     CUTE:
+       WAICS: 86.8
+Bibtex: '@inproceedings{qiao2022vision,
+  title={A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding},
+  author={Qiao, Zhi and Ji, Zhilong and Yuan, Ye and Bai, Jinfeng},
+  booktitle={International Conference on Frontiers in Handwriting Recognition},
+  pages={198--212},
+  year={2022},
+  organization={Springer}
+}'
diff --git a/paper_zoo/textrecog/Levenshtein OCR.yaml b/paper_zoo/textrecog/Levenshtein OCR.yaml
@@ -0,0 +1,72 @@
+Title: 'Levenshtein OCR'
+Abbreviation: Lev-OCR
+Tasks:
+ - TextRecog
+Venue: ECCV
+Year: 2022
+Lab/Company:
+ - Alibaba DAMO Academy, Beijing, China
+URL:
+  Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_19'
+  Arxiv: 'https://arxiv.org/abs/2209.03594'
+Paper Reading URL: 'https://mp.weixin.qq.com/s/Nuc8j3V5YeaXpY64SsIeCw'
+Code: 'https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/LevOCR'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'A novel scene text recognizer based on Vision-Language Transformer
+(VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the
+proposed method (named Levenshtein OCR, and LevOCR for short) explores an
+alternative way for automatically transcribing textual content from cropped
+natural images. Specifically, we cast the problem of scene text recognition as
+an iterative sequence refinement process. The initial prediction sequence
+produced by a pure vision model is encoded and fed into a cross-modal
+transformer to interact and fuse with the visual features, to progressively
+approximate the ground truth. The refinement process is accomplished via two
+basic characterlevel operations: deletion and insertion, which are learned with
+imitation learning and allow for parallel decoding, dynamic length change and
+good interpretability. The quantitative experiments clearly demonstrate that
+LevOCR achieves state-of-the-art performances on standard benchmarks and the
+qualitative analyses verify the effectiveness and advantage of the proposed
+LevOCR algorithm. Code will be released soon.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210163468-bb6c14ba-134a-4dd5-881e-a7adb4058dcd.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+   Test DataSets:
+     Avg.: 92.1
+     IIIT5K:
+       WAICS: 96.6
+     SVT:
+       WAICS: 92.9
+     IC13:
+       WAICS: 96.9
+     IC15:
+       WAICS: 86.4
+     SVTP:
+       WAICS: 88.1
+     CUTE:
+       WAICS: 91.7
+Bibtex: '@inproceedings{wang2022multi,
+  title={Multi-granularity Prediction for Scene Text Recognition},
+  author={Wang, Peng and Da, Cheng and Yao, Cong},
+  booktitle={European Conference on Computer Vision},
+  pages={339--355},
+  year={2022},
+  organization={Springer}
+}'
diff --git a/paper_zoo/textrecog/Multi-Granularity Prediction for Scene Text Recognition.yaml b/paper_zoo/textrecog/Multi-Granularity Prediction for Scene Text Recognition.yaml
@@ -0,0 +1,74 @@
+Title: 'Multi-Granularity Prediction for Scene Text Recognition'
+Abbreviation: MGP-STR
+Tasks:
+ - TextRecog
+Venue: ECCV
+Year: 2022
+Lab/Company:
+ - Alibaba DAMO Academy, Beijing, China
+URL:
+  Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_20'
+  Arxiv: 'https://arxiv.org/abs/2209.03592'
+Paper Reading URL: N/A
+Code: 'https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Scene text recognition (STR) has been an active research topic in
+computer vision for years. To tackle this challenging problem, numerous
+innovative methods have been successively proposed and incorporating linguistic
+knowledge into STR models has recently become a prominent trend. In this work,
+we first draw inspiration from the recent progress in Vision Transformer (ViT)
+to construct a conceptually simple yet powerful vision STR model, which is built
+upon ViT and outperforms previous state-of-the-art models for scene text
+recognition, including both pure vision models and language-augmented methods.
+To integrate linguistic knowledge, we further propose a Multi-Granularity
+Prediction strategy to inject information from the language modality into the
+model in an implicit way, i.e. , subword representations (BPE and WordPiece)
+widely-used in NLP are introduced into the output space, in addition to the
+conventional character level representation, while no independent language model
+(LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the
+performance envelop of STR to an even higher level. Specifically, it achieves
+an average recognition accuracy of 93.35% on standard benchmarks. Code will be
+released soon.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210163378-fc11a79b-fb7d-4a3f-947e-a8f6dfd14dd2.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+   Test DataSets:
+     Avg.: 92.8
+     IIIT5K:
+       WAICS: 96.4
+     SVT:
+       WAICS: 94.7
+     IC13:
+       WAICS: 97.3
+     IC15:
+       WAICS: 87.2
+     SVTP:
+       WAICS: 91.0
+     CUTE:
+       WAICS: 90.3
+Bibtex: '@inproceedings{wang2022multi,
+  title={Multi-granularity Prediction for Scene Text Recognition},
+  author={Wang, Peng and Da, Cheng and Yao, Cong},
+  booktitle={European Conference on Computer Vision},
+  pages={339--355},
+  year={2022},
+  organization={Springer}
+}'
diff --git a/paper_zoo/textrecog/On Vocabulary Reliance in Scene Text Recognition.yaml b/paper_zoo/textrecog/On Vocabulary Reliance in Scene Text Recognition.yaml
@@ -0,0 +1,77 @@
+Title: 'On Vocabulary Reliance in Scene Text Recognition'
+Abbreviation: Wan et al
+Tasks:
+ - TextRecog
+Venue: CVPR
+Year: 2020
+Lab/Company:
+ - Megvii
+ - China University of Mining and Technology
+ - University of Rochester
+URL:
+  Venue: 'http://openaccess.thecvf.com/content_CVPR_2020/html/Wan_On_Vocabulary_Reliance_in_Scene_Text_Recognition_CVPR_2020_paper.html'
+  Arxiv: 'https://arxiv.org/abs/2005.03959'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'The pursuit of high performance on public benchmarks has been the
+driving force for research in scene text recognition, and notable progress has
+been achieved. However, a close investigation reveals a startling fact that the
+state-ofthe-art methods perform well on images with words within vocabulary but
+generalize poorly to images with words outside vocabulary. We call this
+phenomenon “vocabulary reliance”. In this paper, we establish an analytical
+framework to conduct an in-depth study on the problem of vocabulary reliance
+in scene text recognition. Key findings include: (1) Vocabulary reliance is
+ubiquitous, i.e., all existing algorithms more or less exhibit such
+characteristic; (2) Attention-based decoders prove weak in generalizing to
+words outside vocabulary and segmentation-based decoders perform well in
+utilizing visual features; (3) Context modeling is highly coupled with the
+prediction layers. These findings provide new insights and can benefit future
+research in scene text recognition. Furthermore, we propose a simple yet
+effective mutual learning strategy to allow models of two families
+(attention-based and segmentationbased) to learn collaboratively. This remedy
+alleviates the problem of vocabulary reliance and improves the overall scene
+text recognition performance.'
+MODELS:
+ Architecture:
+  - CTC
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210054683-5d5f3117-4bee-43d6-a36c-8e645d47c2b1.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+   Test DataSets:
+     Avg.: N/A
+     IIIT5K:
+       WAICS: N/A
+     SVT:
+       WAICS: N/A
+     IC13:
+       WAICS: N/A
+     IC15:
+       WAICS: N/A
+     SVTP:
+       WAICS: N/A
+     CUTE:
+       WAICS: N/A
+Bibtex: '@inproceedings{wan2020vocabulary,
+  title={On vocabulary reliance in scene text recognition},
+  author={Wan, Zhaoyi and Zhang, Jielei and Zhang, Liang and Luo, Jiebo and Yao, Cong},
+  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
+  pages={11425--11434},
+  year={2020}
+}'
diff --git a/paper_zoo/textrecog/Parallel and Robust Text Rectifier for Scene Text Recognition.yaml b/paper_zoo/textrecog/Parallel and Robust Text Rectifier for Scene Text Recognition.yaml
@@ -0,0 +1,79 @@
+Title: 'Parallel and Robust Text Rectifier for Scene Text Recognition'
+Abbreviation: PRTR
+Tasks:
+ - TextRecog
+Venue: BMVC
+Year: 2022
+Lab/Company:
+ - Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen, China
+ - Ping An Technology (Shenzhen) Co. Ltd.
+ - School of Information and Telecommunication Engineering, Guangzhou Maritime University, Guangzhou, China
+URL:
+  Venue: 'https://bmvc2022.mpi-inf.mpg.de/0770.pdf'
+  Arxiv: 'https://bmvc2022.mpi-inf.mpg.de/0770.pdf'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Scene text recognition (STR) is to recognize text appearing in images.
+Current stateof-the-art STR methods usually adopt a multi-stage framework which
+uses a rectifier to iteratively rectify errors from previous stage. However, the
+rectifiers of those models are not proficient in addressing the misalignment
+problem. To alleviate this problem, we proposed a novel network named Parallel
+and Robust Text Rectifier (PRTR), which consists of a bi-directional position
+attention initial decoder and a sequence of stacked Robust Visual Semantic
+Rectifiers (RVSRs). In essence, PRTR is creatively designed as a coarse-to-fine
+architecture that exploits a sequence of rectifiers for repeatedly refining the
+prediction in a stage-wise manner. RVSR is a core component in the proposed
+model which comprises two key modules, Dual-Path Semantic Alignment (DPSA)
+module and Visual-Linguistic Alignment (VLA). DPSA can rectify the linguistic
+misalignment issues via the global semantic features that are derived from the
+recognized characters as a whole, while VLA re-aligns the linguistic features
+with visual features by an attention model to avoid the overfitting of
+linguistic features. All parts of PRTR are nonautoregressive (parallel), and
+its RVSR re-aligns its output according to the linguistic features and the
+visual features, so it is robust to the mis-aligned error. Extensive experiments
+on mainstream benchmarks demonstrate that the proposed model can alleviate
+the misalignment problem to a large extent and outperformed state-of-the-art
+models.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210052800-ab1f29d1-de7c-43bd-8297-b13cd83e28d3.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - SA
+     - MJ
+   Test DataSets:
+     Avg.: 93.3
+     IIIT5K:
+       WAICS: 97.0
+     SVT:
+       WAICS: 94.4
+     IC13:
+       WAICS: 95.8
+     IC15:
+       WAICS: 86.1
+     SVTP:
+       WAICS: 89.8
+     CUTE:
+       WAICS: 96.5
+Bibtex: '@article{tang2021visual,
+  title={Visual-semantic transformer for scene text recognition},
+  author={Tang, Xin and Lai, Yongquan and Liu, Ying and Fu, Yuanyuan and Fang, Rui},
+  journal={arXiv preprint arXiv:2112.00948},
+  year={2021}
+}'
diff --git a/...c GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition.yaml b/...c GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition.yaml
@@ -0,0 +1,82 @@
+Title: 'SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition'
+Abbreviation: SGBANet
+Tasks:
+ - TextRecog
+Venue: ECCV
+Year: 2022
+Lab/Company:
+ - Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai, China
+ - Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
+ - iFLYTEK Research, iFLYTEK, Hefei, China
+ - CVPR Unit, Indian Statistical Institute, Kolkata, India
+URL:
+  Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_27'
+  Arxiv: 'https://arxiv.org/abs/2207.10256'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Scene text recognition is a challenging task due to the complex
+backgrounds and diverse variations of text instances. In this paper, we
+propose a novel Semantic GAN and Balanced Attention Network (SGBANet) to
+recognize the texts in scene images. The proposed method first generates
+the simple semantic feature using Semantic GAN and then recognizes the scene
+text with the Balanced Attention Module. The Semantic GAN aims to align the
+semantic feature distribution between the support domain and target domain.
+Different from the conventional image-to-image translation methods that
+perform at the image level, the Semantic GAN performs the generation and
+discrimination on the semantic level with the Semantic Generator Module
+(SGM) and Semantic Discriminator Module (SDM). For target images (scene text
+images), the Semantic Generator Module generates simple semantic features
+that share the same feature distribution with support images (clear text
+images). The Semantic Discriminator Module is used to distinguish the semantic
+features between the support domain and target domain. In addition, a
+Balanced Attention Module is designed to alleviate the problem of attention
+drift. The Balanced Attention Module first learns a balancing parameter based
+on the visual glimpse vector and semantic glimpse vector, and then performs
+the balancing operation for obtaining a balanced glimpse vector. Experiments
+on six benchmarks, including regular datasets, i.e., IIIT5K, SVT, ICDAR2013,
+and irregular datasets, i.e., ICDAR2015, SVTP, CUTE80, validate the
+effectiveness of our proposed method.'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210163800-3ecb592b-daae-450f-907b-cd239b2af1c0.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+   Test DataSets:
+     Avg.: 88.21
+     IIIT5K:
+       WAICS: 95.4
+     SVT:
+       WAICS: 89.1
+     IC13:
+       WAICS: 95.1
+     IC15:
+       WAICS: 78.4
+     SVTP:
+       WAICS: 83.1
+     CUTE:
+       WAICS: 88.2
+Bibtex: '@inproceedings{zhong2022sgbanet,
+  title={SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition},
+  author={Zhong, Dajian and Lyu, Shujing and Shivakumara, Palaiahnakote and Yin, Bing and Wu, Jiajia and Pal, Umapada and Lu, Yue},
+  booktitle={European Conference on Computer Vision},
+  pages={464--480},
+  year={2022},
+  organization={Springer}
+}'
diff --git a/paper_zoo/textrecog/Scene Text Detection and Recognition: The Deep Learning Era.yaml b/paper_zoo/textrecog/Scene Text Detection and Recognition: The Deep Learning Era.yaml
@@ -0,0 +1,74 @@
+Title: 'Scene Text Detection and Recognition: The Deep Learning Era'
+Abbreviation: Long et al
+Tasks:
+ - TextRecog
+ - TextDet
+Venue: IJCV
+Year: 2021
+Lab/Company:
+ - Alibaba DAMO Academy, Beijing, China
+URL:
+  Venue: 'https://link.springer.com/article/10.1007/s11263-020-01369-0'
+  Arxiv: 'https://arxiv.org/abs/1811.04256'
+Paper Reading URL: N/A
+Code: 'https://github.com/Jyouhou/SceneTextPapers'
+Supported In MMOCR: N/S
+PaperType:
+ - Survey
+Abstract: 'With the rise and development of deep learning, computer vision has
+been tremendously transformed and reshaped. As an important research area in
+computer vision, scene text detection and recognition has been inevitably
+influenced by this wave of revolution, consequently entering the era of
+deep learning. In recent years, the community has witnessed substantial
+advancements in mindset, methodology and performance. This survey is aimed at
+summarizing and analyzing the major changes and significant progresses of
+scene text detection and recognition in the deep learning era. Through this
+article, we devote to: (1) introduce new insights and ideas; (2) highlight
+recent techniques and benchmarks; (3) look ahead into future trends.
+Specifically, we will emphasize the dramatic differences brought by deep
+learning and remaining grand challenges. We expect that this review paper
+would serve as a reference book for researchers in this field.'
+MODELS:
+ Architecture:
+  - CTC
+  - Attention
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+  - Implicit Language Model
+ Network Structure: N/A
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:  N/A
+   Test DataSets:
+     Avg.: N/A
+     IIIT5K:
+       WAICS: N/A
+     SVT:
+       WAICS: N/A
+     IC13:
+       WAICS: N/A
+     IC15:
+       WAICS: N/A
+     SVTP:
+       WAICS: N/A
+     CUTE:
+       WAICS: N/A
+Bibtex: '@article{long2021scene,
+  title={Scene text detection and recognition: The deep learning era},
+  author={Long, Shangbang and He, Xin and Yao, Cong},
+  journal={International Journal of Computer Vision},
+  volume={129},
+  number={1},
+  pages={161--184},
+  year={2021},
+  publisher={Springer}
+}'
diff --git a/paper_zoo/textrecog/Vision Transformer for Fast and Efficient Scene Text Recognition.yaml b/paper_zoo/textrecog/Vision Transformer for Fast and Efficient Scene Text Recognition.yaml
@@ -0,0 +1,75 @@
+Title: 'Visual-Semantic Transformer for Scene Text Recognition'
+Abbreviation: ViTSTR
+Tasks:
+ - TextRecog
+Venue: ICDAR
+Year: 2021
+Lab/Company:
+ - Electrical and Electronics Engineering Institute, University of the Philippines, Quezon City, Philippines
+URL:
+  Venue: 'https://link.springer.com/chapter/10.1007/978-3-030-86549-8_21'
+  Arxiv: 'https://arxiv.org/abs/2105.08582'
+Paper Reading URL: N/A
+Code: 'https://github.com/roatienza/deep-text-recognition-benchmark'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Scene text recognition (STR) enables computers to read text in natural
+scenes such as object labels, road signs and instructions. STR helps machines
+perform informed decisions such as what object to pick, which direction to go,
+and what is the next step of action. In the body of work on STR, the focus has
+always been on recognition accuracy. There is little emphasis placed on speed
+and computational efficiency which are equally important especially for
+energy-constrained mobile machines. In this paper we propose ViTSTR, an STR
+with a simple single stage model architecture built on a compute and parameter
+efficient vision transformer (ViT). On a comparable strong baseline method such
+as TRBA with accuracy of 84.3%, our small ViTSTR achieves a competitive accuracy
+of 82.6% (84.2% with data augmentation) at 2.4× speed up, using only 43.4% of
+the number of parameters and 42.2% FLOPS. The tiny version of ViTSTR achieves
+80.3% accuracy (82.1% with data augmentation), at 2.5× the speed, requiring
+only 10.9% of the number of parameters and 11.9% FLOPS. With data augmentation,
+our base ViTSTR outperforms TRBA at 85.2% accuracy (83.7% without augmentation)
+at 2.3× the speed but requires 73.2% more parameters and 61.5% more FLOPS. In
+terms of trade-offs, nearly all ViTSTR configurations are at or near the frontiers
+to maximize accuracy, speed and computational efficiency all at the same time.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210161050-476296e7-10e5-4ec9-9024-af6b5c5ee84b.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: 2080Ti
+   ITEM: 17.6e9
+ PARAMS: 85.8e6
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+   Test DataSets:
+     Avg.: 84.0
+     IIIT5K:
+       WAICS: 88.4
+     SVT:
+       WAICS: 87.7
+     IC13:
+       WAICS: 92.4
+     IC15:
+       WAICS: 72.6
+     SVTP:
+       WAICS: 81.8
+     CUTE:
+       WAICS: 81.3
+Bibtex: '@inproceedings{atienza2021vision,
+  title={Vision transformer for fast and efficient scene text recognition},
+  author={Atienza, Rowel},
+  booktitle={International Conference on Document Analysis and Recognition},
+  pages={319--334},
+  year={2021},
+  organization={Springer}
+}'
diff --git a/paper_zoo/textrecog/Visual-Semantic Transformer for Scene Text Recognition.yaml b/paper_zoo/textrecog/Visual-Semantic Transformer for Scene Text Recognition.yaml
@@ -0,0 +1,78 @@
+Title: 'Visual-Semantic Transformer for Scene Text Recognition'
+Abbreviation: VST
+Tasks:
+ - TextRecog
+Venue: BMVC
+Year: 2022
+Lab/Company:
+ - Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen, China
+ - Ping An Technology (Shenzhen) Co. Ltd.
+ - School of Information and Telecommunication Engineering, Guangzhou Maritime University, Guangzhou, China
+URL:
+  Venue: 'https://bmvc2022.mpi-inf.mpg.de/0772.pdf'
+  Arxiv: 'https://arxiv.org/abs/2112.00948'
+Paper Reading URL: N/A
+Code: N/A
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Semantic information plays an important role in scene text recognition
+(STR) as well as visual information. Although state-of-the-art models have
+achieved great improvement in STR, they usually rely on extra external language
+models to refine the semantic features through context information, and the
+separate utilization of semantic and visual information leads to biased
+results, which limits the performance of those models. In this paper, we
+propose a novel model called Visual-Semantic Transformer (VST) for text
+recognition. VST consists of several key modules, including a ConvNet, a visual
+module, two visual-semantic modules, a visual-semantic feature interaction
+module and a semantic module. VST is a conceptually much simpler model.
+Different from existing STR models, VST can efficiently extract semantic
+features without using external language models and it also allows visual
+features and semantic features to interact with each other parallel so that
+global information from two domains can be fully exploited and more powerful
+representations can be learned. The working mechanism of VST is highly similar
+to our cognitive system, where the visual information is first captured by our
+sensory organ, and is simultaneously transformed to semantic information by our
+brain. Extensive experiments on seven public benchmarks including regular/
+irregular text recognition datasets verify the effectiveness of VST, it
+outperformed other 14 popular models on four out of seven benchmark datasets
+and yielded competitive performance on the other three datasets.'
+MODELS:
+ Architecture:
+  - Transformer
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Explicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210052231-22092115-0eba-4c2c-9050-b8fc9aff38ca.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+   Test DataSets:
+     Avg.: 92.9
+     IIIT5K:
+       WAICS: 96.7
+     SVT:
+       WAICS: 94.0
+     IC13:
+       WAICS: 96.7
+     IC15:
+       WAICS: 85.4
+     SVTP:
+       WAICS: 89.0
+     CUTE:
+       WAICS: 95.5
+Bibtex: '@article{tang2021visual,
+  title={Visual-semantic transformer for scene text recognition},
+  author={Tang, Xin and Lai, Yongquan and Liu, Ying and Fu, Yuanyuan and Fang, Rui},
+  journal={arXiv preprint arXiv:2112.00948},
+  year={2021}
+}'
diff --git a/paper_zoo/textrecog/Why You Should Try the Real Data for the Scene Text Recognition.yaml b/paper_zoo/textrecog/Why You Should Try the Real Data for the Scene Text Recognition.yaml
@@ -0,0 +1,70 @@
+Title: 'Why You Should Try the Real Data for the Scene Text Recognition'
+Abbreviation: Loginov et al
+Tasks:
+ - TextRecog
+Venue: arXiv
+Year: 2021
+Lab/Company:
+ - Intel Corporation
+URL:
+  Venue: 'https://arxiv.org/abs/2107.13938'
+  Arxiv: 'https://arxiv.org/abs/2107.13938'
+Paper Reading URL: N/A
+Code: 'https://github.com/openvinotoolkit/training_extensions'
+Supported In MMOCR: N/S
+PaperType:
+ - Algorithm
+Abstract: 'Recent works in the text recognition area have pushed forward the
+recognition results to the new horizons. But for a long time a lack of large
+human-labeled natural text recognition datasets has been forcing researchers
+to use synthetic data for training text recognition models. Even though
+synthetic datasets are very large (MJSynth and SynthText, two most famous
+synthetic datasets, have several million images each), their diversity could
+be insufficient, compared to natural datasets like ICDAR and others.
+Fortunately, the recently released text recognition annotation for OpenImages
+V5 dataset has comparable with synthetic dataset number of instances and more
+diverse examples. We have used this annotation with a Text Recognition head
+architecture from the Yet Another Mask Text Spotter and got comparable to the
+SOTA results. On some datasets we have even outperformed previous SOTA models.
+In this paper we also introduce a text recognition model. The model’s code is
+available.'
+MODELS:
+ Architecture:
+  - Attention
+ Learning Method:
+  - Supervised
+ Language Modality:
+  - Implicit Language Model
+ Network Structure: 'https://user-images.githubusercontent.com/65173622/210163669-0848839e-185f-4d8c-9de1-ac34e957d685.png'
+ FPS:
+   DEVICE: N/A
+   ITEM: N/A
+ FLOPS:
+   DEVICE: N/A
+   ITEM: N/A
+ PARAMS: N/A
+ Experiment:
+   Training DataSets:
+     - ST
+     - MJ
+     - Real
+   Test DataSets:
+     Avg.: 91.0
+     IIIT5K:
+       WAICS: 93.5
+     SVT:
+       WAICS: 94.7
+     IC13:
+       WAICS: 96.8
+     IC15:
+       WAICS: 80.2
+     SVTP:
+       WAICS: 89.9
+     CUTE:
+       WAICS: N/A
+Bibtex: '@article{loginov2021you,
+  title={Why You Should Try the Real Data for the Scene Text Recognition},
+  author={Loginov, Vladimir},
+  journal={arXiv preprint arXiv:2107.13938},
+  year={2021}
+}'