Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
Title: 'A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding'
Abbreviation: Qiao et al
Tasks:
- TextRecog
Venue: ICFHR
Year: 2022
Lab/Company:
- Tomorrow Advancing Life, Beijing, China
URL:
Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-21648-0_14'
Arxiv: 'https://books.google.fr/books?hl=zh-CN&lr=&id=hvmdEAAAQBAJ&oi=fnd&pg=PA198&ots=Gg_BaAnXLm&sig=gpJ2h9NjKz1PjLWSfwDpyd8eLZE&redir_esc=y#v=onepage&q&f=false'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Recently, vision Transformer (ViT) has attracted more and more attention,
many works introduce the ViT into concrete vision tasks and achieve impressive
performance. However, there are only a few works focused on the applications of
the ViT for scene text recognition. This paper takes a further step and proposes
a strong scene text recognizer with a fully ViT-based architecture.
Specifically, we introduce multi-grained features into both the encoder and
decoder. For the encoder, we adopt a two-stage ViT with different grained
patches, where the first stage extracts extent visual features with 2D
ine-grained patches and the second stage aims at the sequence of contextual
features with 1D coarse-grained patches. The decoder integrates Connectionist
Temporal Classification (CTC)-based and attention-based decoding, where the
two decoding schemes introduce different grained features into the decoder and
benefit from each other with a deep interaction. To improve the extraction of
fine-grained features, we additionally explore self-supervised learning for
text recognition with masked autoencoders. Furthermore, a focusing mechanism is
proposed to let the model target the pixel reconstruction of the text area. Our
proposed method achieves state-of-the-art or comparable accuracies on benchmarks
of scene text recognition with a faster inference speed and nearly 50% reduction
of parameters compared with other recent works.'
MODELS:
Architecture:
- CTC
- Attention
- Transformer
Learning Method:
- Self-Supervised
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210053998-385587ef-2b0e-4c9b-a8b8-d6171261c621.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
Test DataSets:
Avg.: 90.5
IIIT5K:
WAICS: 96.1
SVT:
WAICS: 92.3
IC13:
WAICS: 95.0
IC15:
WAICS: 86.0
SVTP:
WAICS: 87.0
CUTE:
WAICS: 86.8
Bibtex: '@inproceedings{qiao2022vision,
title={A Vision Transformer Based Scene Text Recognizer with Multi-grained Encoding and Decoding},
author={Qiao, Zhi and Ji, Zhilong and Yuan, Ye and Bai, Jinfeng},
booktitle={International Conference on Frontiers in Handwriting Recognition},
pages={198--212},
year={2022},
organization={Springer}
}'
72 changes: 72 additions & 0 deletions paper_zoo/textrecog/Levenshtein OCR.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
Title: 'Levenshtein OCR'
Abbreviation: Lev-OCR
Tasks:
- TextRecog
Venue: ECCV
Year: 2022
Lab/Company:
- Alibaba DAMO Academy, Beijing, China
URL:
Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_19'
Arxiv: 'https://arxiv.org/abs/2209.03594'
Paper Reading URL: 'https://mp.weixin.qq.com/s/Nuc8j3V5YeaXpY64SsIeCw'
Code: 'https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/LevOCR'
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'A novel scene text recognizer based on Vision-Language Transformer
(VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the
proposed method (named Levenshtein OCR, and LevOCR for short) explores an
alternative way for automatically transcribing textual content from cropped
natural images. Specifically, we cast the problem of scene text recognition as
an iterative sequence refinement process. The initial prediction sequence
produced by a pure vision model is encoded and fed into a cross-modal
transformer to interact and fuse with the visual features, to progressively
approximate the ground truth. The refinement process is accomplished via two
basic characterlevel operations: deletion and insertion, which are learned with
imitation learning and allow for parallel decoding, dynamic length change and
good interpretability. The quantitative experiments clearly demonstrate that
LevOCR achieves state-of-the-art performances on standard benchmarks and the
qualitative analyses verify the effectiveness and advantage of the proposed
LevOCR algorithm. Code will be released soon.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Explicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210163468-bb6c14ba-134a-4dd5-881e-a7adb4058dcd.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
Test DataSets:
Avg.: 92.1
IIIT5K:
WAICS: 96.6
SVT:
WAICS: 92.9
IC13:
WAICS: 96.9
IC15:
WAICS: 86.4
SVTP:
WAICS: 88.1
CUTE:
WAICS: 91.7
Bibtex: '@inproceedings{wang2022multi,
title={Multi-granularity Prediction for Scene Text Recognition},
author={Wang, Peng and Da, Cheng and Yao, Cong},
booktitle={European Conference on Computer Vision},
pages={339--355},
year={2022},
organization={Springer}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Title: 'Multi-Granularity Prediction for Scene Text Recognition'
Abbreviation: MGP-STR
Tasks:
- TextRecog
Venue: ECCV
Year: 2022
Lab/Company:
- Alibaba DAMO Academy, Beijing, China
URL:
Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_20'
Arxiv: 'https://arxiv.org/abs/2209.03592'
Paper Reading URL: N/A
Code: 'https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/MGP-STR'
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Scene text recognition (STR) has been an active research topic in
computer vision for years. To tackle this challenging problem, numerous
innovative methods have been successively proposed and incorporating linguistic
knowledge into STR models has recently become a prominent trend. In this work,
we first draw inspiration from the recent progress in Vision Transformer (ViT)
to construct a conceptually simple yet powerful vision STR model, which is built
upon ViT and outperforms previous state-of-the-art models for scene text
recognition, including both pure vision models and language-augmented methods.
To integrate linguistic knowledge, we further propose a Multi-Granularity
Prediction strategy to inject information from the language modality into the
model in an implicit way, i.e. , subword representations (BPE and WordPiece)
widely-used in NLP are introduced into the output space, in addition to the
conventional character level representation, while no independent language model
(LM) is adopted. The resultant algorithm (termed MGP-STR) is able to push the
performance envelop of STR to an even higher level. Specifically, it achieves
an average recognition accuracy of 93.35% on standard benchmarks. Code will be
released soon.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Explicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210163378-fc11a79b-fb7d-4a3f-947e-a8f6dfd14dd2.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
Test DataSets:
Avg.: 92.8
IIIT5K:
WAICS: 96.4
SVT:
WAICS: 94.7
IC13:
WAICS: 97.3
IC15:
WAICS: 87.2
SVTP:
WAICS: 91.0
CUTE:
WAICS: 90.3
Bibtex: '@inproceedings{wang2022multi,
title={Multi-granularity Prediction for Scene Text Recognition},
author={Wang, Peng and Da, Cheng and Yao, Cong},
booktitle={European Conference on Computer Vision},
pages={339--355},
year={2022},
organization={Springer}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
Title: 'On Vocabulary Reliance in Scene Text Recognition'
Abbreviation: Wan et al
Tasks:
- TextRecog
Venue: CVPR
Year: 2020
Lab/Company:
- Megvii
- China University of Mining and Technology
- University of Rochester
URL:
Venue: 'http://openaccess.thecvf.com/content_CVPR_2020/html/Wan_On_Vocabulary_Reliance_in_Scene_Text_Recognition_CVPR_2020_paper.html'
Arxiv: 'https://arxiv.org/abs/2005.03959'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'The pursuit of high performance on public benchmarks has been the
driving force for research in scene text recognition, and notable progress has
been achieved. However, a close investigation reveals a startling fact that the
state-ofthe-art methods perform well on images with words within vocabulary but
generalize poorly to images with words outside vocabulary. We call this
phenomenon “vocabulary reliance”. In this paper, we establish an analytical
framework to conduct an in-depth study on the problem of vocabulary reliance
in scene text recognition. Key findings include: (1) Vocabulary reliance is
ubiquitous, i.e., all existing algorithms more or less exhibit such
characteristic; (2) Attention-based decoders prove weak in generalizing to
words outside vocabulary and segmentation-based decoders perform well in
utilizing visual features; (3) Context modeling is highly coupled with the
prediction layers. These findings provide new insights and can benefit future
research in scene text recognition. Furthermore, we propose a simple yet
effective mutual learning strategy to allow models of two families
(attention-based and segmentationbased) to learn collaboratively. This remedy
alleviates the problem of vocabulary reliance and improves the overall scene
text recognition performance.'
MODELS:
Architecture:
- CTC
- Attention
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210054683-5d5f3117-4bee-43d6-a36c-8e645d47c2b1.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
Test DataSets:
Avg.: N/A
IIIT5K:
WAICS: N/A
SVT:
WAICS: N/A
IC13:
WAICS: N/A
IC15:
WAICS: N/A
SVTP:
WAICS: N/A
CUTE:
WAICS: N/A
Bibtex: '@inproceedings{wan2020vocabulary,
title={On vocabulary reliance in scene text recognition},
author={Wan, Zhaoyi and Zhang, Jielei and Zhang, Liang and Luo, Jiebo and Yao, Cong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={11425--11434},
year={2020}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
Title: 'Parallel and Robust Text Rectifier for Scene Text Recognition'
Abbreviation: PRTR
Tasks:
- TextRecog
Venue: BMVC
Year: 2022
Lab/Company:
- Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen, China
- Ping An Technology (Shenzhen) Co. Ltd.
- School of Information and Telecommunication Engineering, Guangzhou Maritime University, Guangzhou, China
URL:
Venue: 'https://bmvc2022.mpi-inf.mpg.de/0770.pdf'
Arxiv: 'https://bmvc2022.mpi-inf.mpg.de/0770.pdf'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Scene text recognition (STR) is to recognize text appearing in images.
Current stateof-the-art STR methods usually adopt a multi-stage framework which
uses a rectifier to iteratively rectify errors from previous stage. However, the
rectifiers of those models are not proficient in addressing the misalignment
problem. To alleviate this problem, we proposed a novel network named Parallel
and Robust Text Rectifier (PRTR), which consists of a bi-directional position
attention initial decoder and a sequence of stacked Robust Visual Semantic
Rectifiers (RVSRs). In essence, PRTR is creatively designed as a coarse-to-fine
architecture that exploits a sequence of rectifiers for repeatedly refining the
prediction in a stage-wise manner. RVSR is a core component in the proposed
model which comprises two key modules, Dual-Path Semantic Alignment (DPSA)
module and Visual-Linguistic Alignment (VLA). DPSA can rectify the linguistic
misalignment issues via the global semantic features that are derived from the
recognized characters as a whole, while VLA re-aligns the linguistic features
with visual features by an attention model to avoid the overfitting of
linguistic features. All parts of PRTR are nonautoregressive (parallel), and
its RVSR re-aligns its output according to the linguistic features and the
visual features, so it is robust to the mis-aligned error. Extensive experiments
on mainstream benchmarks demonstrate that the proposed model can alleviate
the misalignment problem to a large extent and outperformed state-of-the-art
models.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Explicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210052800-ab1f29d1-de7c-43bd-8297-b13cd83e28d3.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- SA
- MJ
Test DataSets:
Avg.: 93.3
IIIT5K:
WAICS: 97.0
SVT:
WAICS: 94.4
IC13:
WAICS: 95.8
IC15:
WAICS: 86.1
SVTP:
WAICS: 89.8
CUTE:
WAICS: 96.5
Bibtex: '@article{tang2021visual,
title={Visual-semantic transformer for scene text recognition},
author={Tang, Xin and Lai, Yongquan and Liu, Ying and Fu, Yuanyuan and Fang, Rui},
journal={arXiv preprint arXiv:2112.00948},
year={2021}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
Title: 'SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition'
Abbreviation: SGBANet
Tasks:
- TextRecog
Venue: ECCV
Year: 2022
Lab/Company:
- Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai, China
- Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
- iFLYTEK Research, iFLYTEK, Hefei, China
- CVPR Unit, Indian Statistical Institute, Kolkata, India
URL:
Venue: 'https://link.springer.com/chapter/10.1007/978-3-031-19815-1_27'
Arxiv: 'https://arxiv.org/abs/2207.10256'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Scene text recognition is a challenging task due to the complex
backgrounds and diverse variations of text instances. In this paper, we
propose a novel Semantic GAN and Balanced Attention Network (SGBANet) to
recognize the texts in scene images. The proposed method first generates
the simple semantic feature using Semantic GAN and then recognizes the scene
text with the Balanced Attention Module. The Semantic GAN aims to align the
semantic feature distribution between the support domain and target domain.
Different from the conventional image-to-image translation methods that
perform at the image level, the Semantic GAN performs the generation and
discrimination on the semantic level with the Semantic Generator Module
(SGM) and Semantic Discriminator Module (SDM). For target images (scene text
images), the Semantic Generator Module generates simple semantic features
that share the same feature distribution with support images (clear text
images). The Semantic Discriminator Module is used to distinguish the semantic
features between the support domain and target domain. In addition, a
Balanced Attention Module is designed to alleviate the problem of attention
drift. The Balanced Attention Module first learns a balancing parameter based
on the visual glimpse vector and semantic glimpse vector, and then performs
the balancing operation for obtaining a balanced glimpse vector. Experiments
on six benchmarks, including regular datasets, i.e., IIIT5K, SVT, ICDAR2013,
and irregular datasets, i.e., ICDAR2015, SVTP, CUTE80, validate the
effectiveness of our proposed method.'
MODELS:
Architecture:
- Attention
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210163800-3ecb592b-daae-450f-907b-cd239b2af1c0.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
Test DataSets:
Avg.: 88.21
IIIT5K:
WAICS: 95.4
SVT:
WAICS: 89.1
IC13:
WAICS: 95.1
IC15:
WAICS: 78.4
SVTP:
WAICS: 83.1
CUTE:
WAICS: 88.2
Bibtex: '@inproceedings{zhong2022sgbanet,
title={SGBANet: Semantic GAN and Balanced Attention Network for Arbitrarily Oriented Scene Text Recognition},
author={Zhong, Dajian and Lyu, Shujing and Shivakumara, Palaiahnakote and Yin, Bing and Wu, Jiajia and Pal, Umapada and Lu, Yue},
booktitle={European Conference on Computer Vision},
pages={464--480},
year={2022},
organization={Springer}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Title: 'Scene Text Detection and Recognition: The Deep Learning Era'
Abbreviation: Long et al
Tasks:
- TextRecog
- TextDet
Venue: IJCV
Year: 2021
Lab/Company:
- Alibaba DAMO Academy, Beijing, China
URL:
Venue: 'https://link.springer.com/article/10.1007/s11263-020-01369-0'
Arxiv: 'https://arxiv.org/abs/1811.04256'
Paper Reading URL: N/A
Code: 'https://github.com/Jyouhou/SceneTextPapers'
Supported In MMOCR: N/S
PaperType:
- Survey
Abstract: 'With the rise and development of deep learning, computer vision has
been tremendously transformed and reshaped. As an important research area in
computer vision, scene text detection and recognition has been inevitably
influenced by this wave of revolution, consequently entering the era of
deep learning. In recent years, the community has witnessed substantial
advancements in mindset, methodology and performance. This survey is aimed at
summarizing and analyzing the major changes and significant progresses of
scene text detection and recognition in the deep learning era. Through this
article, we devote to: (1) introduce new insights and ideas; (2) highlight
recent techniques and benchmarks; (3) look ahead into future trends.
Specifically, we will emphasize the dramatic differences brought by deep
learning and remaining grand challenges. We expect that this review paper
would serve as a reference book for researchers in this field.'
MODELS:
Architecture:
- CTC
- Attention
- Transformer
Learning Method:
- Supervised
Language Modality:
- Explicit Language Model
- Implicit Language Model
Network Structure: N/A
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets: N/A
Test DataSets:
Avg.: N/A
IIIT5K:
WAICS: N/A
SVT:
WAICS: N/A
IC13:
WAICS: N/A
IC15:
WAICS: N/A
SVTP:
WAICS: N/A
CUTE:
WAICS: N/A
Bibtex: '@article{long2021scene,
title={Scene text detection and recognition: The deep learning era},
author={Long, Shangbang and He, Xin and Yao, Cong},
journal={International Journal of Computer Vision},
volume={129},
number={1},
pages={161--184},
year={2021},
publisher={Springer}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
Title: 'Visual-Semantic Transformer for Scene Text Recognition'
Abbreviation: ViTSTR
Tasks:
- TextRecog
Venue: ICDAR
Year: 2021
Lab/Company:
- Electrical and Electronics Engineering Institute, University of the Philippines, Quezon City, Philippines
URL:
Venue: 'https://link.springer.com/chapter/10.1007/978-3-030-86549-8_21'
Arxiv: 'https://arxiv.org/abs/2105.08582'
Paper Reading URL: N/A
Code: 'https://github.com/roatienza/deep-text-recognition-benchmark'
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Scene text recognition (STR) enables computers to read text in natural
scenes such as object labels, road signs and instructions. STR helps machines
perform informed decisions such as what object to pick, which direction to go,
and what is the next step of action. In the body of work on STR, the focus has
always been on recognition accuracy. There is little emphasis placed on speed
and computational efficiency which are equally important especially for
energy-constrained mobile machines. In this paper we propose ViTSTR, an STR
with a simple single stage model architecture built on a compute and parameter
efficient vision transformer (ViT). On a comparable strong baseline method such
as TRBA with accuracy of 84.3%, our small ViTSTR achieves a competitive accuracy
of 82.6% (84.2% with data augmentation) at 2.4× speed up, using only 43.4% of
the number of parameters and 42.2% FLOPS. The tiny version of ViTSTR achieves
80.3% accuracy (82.1% with data augmentation), at 2.5× the speed, requiring
only 10.9% of the number of parameters and 11.9% FLOPS. With data augmentation,
our base ViTSTR outperforms TRBA at 85.2% accuracy (83.7% without augmentation)
at 2.3× the speed but requires 73.2% more parameters and 61.5% more FLOPS. In
terms of trade-offs, nearly all ViTSTR configurations are at or near the frontiers
to maximize accuracy, speed and computational efficiency all at the same time.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210161050-476296e7-10e5-4ec9-9024-af6b5c5ee84b.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: 2080Ti
ITEM: 17.6e9
PARAMS: 85.8e6
Experiment:
Training DataSets:
- ST
- MJ
Test DataSets:
Avg.: 84.0
IIIT5K:
WAICS: 88.4
SVT:
WAICS: 87.7
IC13:
WAICS: 92.4
IC15:
WAICS: 72.6
SVTP:
WAICS: 81.8
CUTE:
WAICS: 81.3
Bibtex: '@inproceedings{atienza2021vision,
title={Vision transformer for fast and efficient scene text recognition},
author={Atienza, Rowel},
booktitle={International Conference on Document Analysis and Recognition},
pages={319--334},
year={2021},
organization={Springer}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
Title: 'Visual-Semantic Transformer for Scene Text Recognition'
Abbreviation: VST
Tasks:
- TextRecog
Venue: BMVC
Year: 2022
Lab/Company:
- Visual Computing Group, Ping An Property & Casualty Insurance Company, Shenzhen, China
- Ping An Technology (Shenzhen) Co. Ltd.
- School of Information and Telecommunication Engineering, Guangzhou Maritime University, Guangzhou, China
URL:
Venue: 'https://bmvc2022.mpi-inf.mpg.de/0772.pdf'
Arxiv: 'https://arxiv.org/abs/2112.00948'
Paper Reading URL: N/A
Code: N/A
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Semantic information plays an important role in scene text recognition
(STR) as well as visual information. Although state-of-the-art models have
achieved great improvement in STR, they usually rely on extra external language
models to refine the semantic features through context information, and the
separate utilization of semantic and visual information leads to biased
results, which limits the performance of those models. In this paper, we
propose a novel model called Visual-Semantic Transformer (VST) for text
recognition. VST consists of several key modules, including a ConvNet, a visual
module, two visual-semantic modules, a visual-semantic feature interaction
module and a semantic module. VST is a conceptually much simpler model.
Different from existing STR models, VST can efficiently extract semantic
features without using external language models and it also allows visual
features and semantic features to interact with each other parallel so that
global information from two domains can be fully exploited and more powerful
representations can be learned. The working mechanism of VST is highly similar
to our cognitive system, where the visual information is first captured by our
sensory organ, and is simultaneously transformed to semantic information by our
brain. Extensive experiments on seven public benchmarks including regular/
irregular text recognition datasets verify the effectiveness of VST, it
outperformed other 14 popular models on four out of seven benchmark datasets
and yielded competitive performance on the other three datasets.'
MODELS:
Architecture:
- Transformer
Learning Method:
- Supervised
Language Modality:
- Explicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210052231-22092115-0eba-4c2c-9050-b8fc9aff38ca.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
Test DataSets:
Avg.: 92.9
IIIT5K:
WAICS: 96.7
SVT:
WAICS: 94.0
IC13:
WAICS: 96.7
IC15:
WAICS: 85.4
SVTP:
WAICS: 89.0
CUTE:
WAICS: 95.5
Bibtex: '@article{tang2021visual,
title={Visual-semantic transformer for scene text recognition},
author={Tang, Xin and Lai, Yongquan and Liu, Ying and Fu, Yuanyuan and Fang, Rui},
journal={arXiv preprint arXiv:2112.00948},
year={2021}
}'
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
Title: 'Why You Should Try the Real Data for the Scene Text Recognition'
Abbreviation: Loginov et al
Tasks:
- TextRecog
Venue: arXiv
Year: 2021
Lab/Company:
- Intel Corporation
URL:
Venue: 'https://arxiv.org/abs/2107.13938'
Arxiv: 'https://arxiv.org/abs/2107.13938'
Paper Reading URL: N/A
Code: 'https://github.com/openvinotoolkit/training_extensions'
Supported In MMOCR: N/S
PaperType:
- Algorithm
Abstract: 'Recent works in the text recognition area have pushed forward the
recognition results to the new horizons. But for a long time a lack of large
human-labeled natural text recognition datasets has been forcing researchers
to use synthetic data for training text recognition models. Even though
synthetic datasets are very large (MJSynth and SynthText, two most famous
synthetic datasets, have several million images each), their diversity could
be insufficient, compared to natural datasets like ICDAR and others.
Fortunately, the recently released text recognition annotation for OpenImages
V5 dataset has comparable with synthetic dataset number of instances and more
diverse examples. We have used this annotation with a Text Recognition head
architecture from the Yet Another Mask Text Spotter and got comparable to the
SOTA results. On some datasets we have even outperformed previous SOTA models.
In this paper we also introduce a text recognition model. The model’s code is
available.'
MODELS:
Architecture:
- Attention
Learning Method:
- Supervised
Language Modality:
- Implicit Language Model
Network Structure: 'https://user-images.githubusercontent.com/65173622/210163669-0848839e-185f-4d8c-9de1-ac34e957d685.png'
FPS:
DEVICE: N/A
ITEM: N/A
FLOPS:
DEVICE: N/A
ITEM: N/A
PARAMS: N/A
Experiment:
Training DataSets:
- ST
- MJ
- Real
Test DataSets:
Avg.: 91.0
IIIT5K:
WAICS: 93.5
SVT:
WAICS: 94.7
IC13:
WAICS: 96.8
IC15:
WAICS: 80.2
SVTP:
WAICS: 89.9
CUTE:
WAICS: N/A
Bibtex: '@article{loginov2021you,
title={Why You Should Try the Real Data for the Scene Text Recognition},
author={Loginov, Vladimir},
journal={arXiv preprint arXiv:2107.13938},
year={2021}
}'