Skip to content

Commit

Permalink
docs: Remove doc/datasets directory and fix docs/datasets documents (#…
Browse files Browse the repository at this point in the history
  • Loading branch information
SWHL authored Aug 19, 2024
1 parent b12d4ff commit 0ca03dd
Show file tree
Hide file tree
Showing 48 changed files with 108 additions and 69 deletions.
Binary file removed doc/datasets/CASIA_0.jpg
Binary file not shown.
Binary file removed doc/datasets/CDLA_demo/val_0633.jpg
Binary file not shown.
Binary file removed doc/datasets/CDLA_demo/val_0941.jpg
Binary file not shown.
Binary file removed doc/datasets/LSVT_1.jpg
Binary file not shown.
Binary file removed doc/datasets/LSVT_2.jpg
Binary file not shown.
Binary file removed doc/datasets/VoTT.jpg
Binary file not shown.
Binary file removed doc/datasets/captcha_demo.png
Binary file not shown.
Binary file removed doc/datasets/ccpd_demo.png
Binary file not shown.
Binary file removed doc/datasets/ch_doc1.jpg
Binary file not shown.
Binary file removed doc/datasets/ch_doc3.jpg
Binary file not shown.
Binary file removed doc/datasets/ch_street_rec_1.png
Binary file not shown.
Binary file removed doc/datasets/ch_street_rec_2.png
Binary file not shown.
Binary file removed doc/datasets/cmb_demo.jpg
Binary file not shown.
Binary file removed doc/datasets/crohme_demo/hme_00.jpg
Binary file not shown.
Binary file removed doc/datasets/crohme_demo/hme_01.jpg
Binary file not shown.
Binary file removed doc/datasets/crohme_demo/hme_02.jpg
Binary file not shown.
Binary file removed doc/datasets/doc.jpg
Binary file not shown.
Binary file removed doc/datasets/funsd_demo/gt_train_00040534.jpg
Binary file not shown.
Binary file removed doc/datasets/funsd_demo/gt_train_00070353.jpg
Binary file not shown.
Binary file removed doc/datasets/ic15_location_download.png
Binary file not shown.
Binary file removed doc/datasets/icdar_rec.png
Binary file not shown.
Binary file removed doc/datasets/labelimg.jpg
Binary file not shown.
Binary file removed doc/datasets/labelme.jpg
Binary file not shown.
Binary file removed doc/datasets/nist_demo.png
Binary file not shown.
Binary file removed doc/datasets/publaynet_demo/gt_PMC3724501_00006.jpg
Binary file not shown.
Binary file removed doc/datasets/publaynet_demo/gt_PMC5086060_00002.jpg
Diff not rendered.
Binary file removed doc/datasets/rctw.jpg
Diff not rendered.
Binary file removed doc/datasets/roLabelImg.png
Diff not rendered.
Diff not rendered.
Diff not rendered.
Binary file removed doc/datasets/table_tal_demo/1.jpg
Diff not rendered.
Binary file removed doc/datasets/table_tal_demo/2.jpg
Diff not rendered.
Binary file removed doc/datasets/tablebank_demo/004.png
Diff not rendered.
Binary file removed doc/datasets/tablebank_demo/005.png
Diff not rendered.
Diff not rendered.
Binary file removed doc/datasets/wildreceipt_demo/2769.jpeg
Diff not rendered.
Binary file removed doc/datasets/xfund_demo/gt_zh_train_0.jpg
Diff not rendered.
Binary file removed doc/datasets/xfund_demo/gt_zh_train_1.jpg
Diff not rendered.
48 changes: 41 additions & 7 deletions docs/datasets/datasets.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,9 @@ In addition to opensource data, users can also use synthesis tools to synthesize
- **Introduction**:A total of 290000 pictures are included, of which 210000 are used as training sets (with labels) and 80000 are used as test sets (without labels). The dataset is collected from the Chinese street view, and is formed by by cutting out the text line area (such as shop signs, landmarks, etc.) in the street view picture. All the images are preprocessed: by using affine transform, the text area is proportionally mapped to a picture with a height of 48 pixels, as shown in the figure:

![](./images/ch_street_rec_1.png)

(a) Label: 魅派集成吊顶

![](./images/ch_street_rec_2.png)
(b) Label: 母婴用品连锁
- **Download link**
Expand All @@ -42,13 +44,15 @@ In addition to opensource data, users can also use synthesis tools to synthesize

- **Data sources**<https://github.com/YCG09/chinese_ocr>
- **Introduction**
- A total of 3.64 million pictures are divided into training set and validation set according to 99:1.
- Using Chinese corpus (news + classical Chinese), the data is randomly generated through changes in font, size, grayscale, blur, perspective, stretching, etc.
- 5990 characters including Chinese characters, English letters, numbers and punctuation(Characters set: <https://github.com/YCG09/chinese_ocr/blob/master/train/char_std_5990.txt>
- Each sample is fixed with 10 characters, and the characters are randomly intercepted from the sentences in the corpus
- Image resolution is 280x32
![](./images/ch_doc1.jpg)
![](./images/ch_doc3.jpg)
- A total of 3.64 million pictures are divided into training set and validation set according to 99:1.
- Using Chinese corpus (news + classical Chinese), the data is randomly generated through changes in font, size, grayscale, blur, perspective, stretching, etc.
- 5990 characters including Chinese characters, English letters, numbers and punctuation(Characters set: <https://github.com/YCG09/chinese_ocr/blob/master/train/char_std_5990.txt>
- Each sample is fixed with 10 characters, and the characters are randomly intercepted from the sentences in the corpus
- Image resolution is 280x32

![](./images/ch_doc1.jpg)

![](./images/ch_doc3.jpg)
- **Download link**<https://pan.baidu.com/s/1QkI7kjah8SPHwOQ40rS1Pw> (Password: lu7m)

#### 5、ICDAR2019-ArT
Expand All @@ -57,3 +61,33 @@ In addition to opensource data, users can also use synthesis tools to synthesize
- **Introduction**:It includes 10166 images, 5603 in training sets and 4563 in test sets. It is composed of three parts: total text, scut-ctw1500 and Baidu curved scene text, including text with various shapes such as horizontal, multi-directional and curved.
![](./images/ArT.jpg)
- **Download link**<https://ai.baidu.com/broad/download?dataset=art>

#### 6. Electronic seal dataset

- **Data source**: <https://aistudio.baidu.com/aistudio/datasetdetail/154271/0>
- **Data introduction**: Contains 10,000 images in total, 8,000 images in the training set, and 2,000 images in the test set. The dataset is synthesized by a program and does not involve privacy security. It is mainly used for the training and detection of seal curved text. Contributed by developer [jingsongliujing](https://github.com/jingsongliujing)
- **Download address**: <https://aistudio.baidu.com/aistudio/datasetdetail/154271/0>

## References

**ICDAR 2019-LSVT Challenge**

```bibtex
@article{sun2019icdar,
title={ICDAR 2019 Competition on Large-scale Street View Text with Partial Labeling--RRC-LSVT},
author={Sun, Yipeng and Ni, Zihan and Chng, Chee-Kheng and Liu, Yuliang and Luo, Canjie and Ng, Chun Chet and Han, Junyu and Ding, Errui and Liu, Jingtuo and Karatzas, Dimosthenis and others},
journal={arXiv preprint arXiv:1909.07741},
year={2019}
}
```

**ICDAR 2019-ArT Challenge**

```bibtex
@article{chng2019icdar2019,
title={ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text (RRC-ArT)},
author={Chng, Chee-Kheng and Liu, Yuliang and Sun, Yipeng and Ng, Chun Chet and Luo, Canjie and Ni, Zihan and Fang, ChuanMing and Zhang, Shuaitao and Han, Junyu and Ding, Errui and others},
journal={arXiv preprint arXiv:1909.07145},
year={2019}
}
```
19 changes: 12 additions & 7 deletions docs/datasets/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,11 @@ comments: true

- **数据来源**<https://aistudio.baidu.com/aistudio/competition/detail/8>
- **数据简介**:ICDAR2019-LSVT行识别任务,共包括29万张图片,其中21万张图片作为训练集(带标注),8万张作为测试集(无标注)。数据集采自中国街景,并由街景图片中的文字行区域(例如店铺标牌、地标等等)截取出来而形成。所有图像都经过一些预处理,将文字区域利用仿射变化,等比映射为一张高为48像素的图片,如图所示:

![](./images/ch_street_rec_1.png)

(a) 标注:魅派集成吊顶

![](./images/ch_street_rec_2.png)
(b) 标注:母婴用品连锁
- **下载地址**
Expand All @@ -41,13 +44,15 @@ comments: true

- **数据来源**<https://github.com/YCG09/chinese_ocr>
- **数据简介**
- 共约364万张图片,按照99:1划分成训练集和验证集。
- 数据利用中文语料库(新闻 + 文言文),通过字体、大小、灰度、模糊、透视、拉伸等变化随机生成
- 包含汉字、英文字母、数字和标点共5990个字符(字符集合:<https://github.com/YCG09/chinese_ocr/blob/master/train/char_std_5990.txt>
- 每个样本固定10个字符,字符随机截取自语料库中的句子
- 图片分辨率统一为280x32
![](./images/ch_doc1.jpg)
![](./images/ch_doc3.jpg)
- 共约364万张图片,按照99:1划分成训练集和验证集。
- 数据利用中文语料库(新闻 + 文言文),通过字体、大小、灰度、模糊、透视、拉伸等变化随机生成
- 包含汉字、英文字母、数字和标点共5990个字符(字符集合:<https://github.com/YCG09/chinese_ocr/blob/master/train/char_std_5990.txt>
- 每个样本固定10个字符,字符随机截取自语料库中的句子
- 图片分辨率统一为280x32

![](./images/ch_doc1.jpg)

![](./images/ch_doc3.jpg)
- **下载地址**<https://pan.baidu.com/s/1QkI7kjah8SPHwOQ40rS1Pw> (密码:lu7m)

#### 5、ICDAR2019-ArT
Expand Down
4 changes: 2 additions & 2 deletions docs/datasets/handwritten_datasets.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ Here we have sorted out the commonly used handwritten OCR dataset datasets, whic

- **Data source**: <http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html>
- **Data introduction**:
- It includes online and offline handwritten data,`HWDB1.0~1.2` has totally 3895135 handwritten single character samples, which belong to 7356 categories (7185 Chinese characters and 171 English letters, numbers and symbols);`HWDB2.0~2.2` has totally 5091 pages of images, which are divided into 52230 text lines and 1349414 words. All text and text samples are stored as grayscale images. Some sample words are shown below.
- It includes online and offline handwritten data,`HWDB1.0~1.2` has totally 3895135 handwritten single character samples, which belong to 7356 categories (7185 Chinese characters and 171 English letters, numbers and symbols);`HWDB2.0~2.2` has totally 5091 pages of images, which are divided into 52230 text lines and 1349414 words. All text and text samples are stored as grayscale images. Some sample words are shown below.

![](./images/CASIA_0.jpg)
![](./images/CASIA_0.jpg)

- **Download address**:<http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html>
- **使用建议**:Data for single character, white background, can form a large number of text lines for training. White background can be processed into transparent state, which is convenient to add various backgrounds. For the case of semantic needs, it is suggested to extract single character from real corpus to form text lines.
Expand Down
2 changes: 1 addition & 1 deletion docs/datasets/handwritten_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ comments: true

- **数据来源**<http://www.nlpr.ia.ac.cn/databases/handwriting/Download.html>
- **数据简介**
- 包含在线和离线两类手写数据,`HWDB1.0~1.2`总共有3895135个手写单字样本,分属7356类(7185个汉字和171个英文字母、数字、符号);`HWDB2.0~2.2`总共有5091页图像,分割为52230个文本行和1349414个文字。所有文字和文本样本均存为灰度图像。部分单字样本图片如下所示。
- 包含在线和离线两类手写数据,`HWDB1.0~1.2`总共有3895135个手写单字样本,分属7356类(7185个汉字和171个英文字母、数字、符号);`HWDB2.0~2.2`总共有5091页图像,分割为52230个文本行和1349414个文字。所有文字和文本样本均存为灰度图像。部分单字样本图片如下所示。

![](./images/CASIA_0.jpg)

Expand Down
File renamed without changes
4 changes: 2 additions & 2 deletions docs/datasets/kie_datasets.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,5 +43,5 @@ Here are the common datasets key information extraction, which are being updated
**Note:** Boxes with category `Ignore` or `Others` are not visualized here.

- **Download address**
- Offical dataset: [link](https://download.openmmlab.com/mmocr/data/wildreceipt.tar)
- Dataset converted for PaddleOCR training process: [link](https://paddleocr.bj.bcebos.com/ppstructure/dataset/wildreceipt.tar)
- Offical dataset: [link](https://download.openmmlab.com/mmocr/data/wildreceipt.tar)
- Dataset converted for PaddleOCR training process: [link](https://paddleocr.bj.bcebos.com/ppstructure/dataset/wildreceipt.tar)
4 changes: 2 additions & 2 deletions docs/datasets/kie_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,5 +45,5 @@ comments: true
**注:** 这里对于类别为`Ignore`或者`Others`的文本,没有进行可视化。

- **下载地址**
- 原始数据下载地址:[链接](https://download.openmmlab.com/mmocr/data/wildreceipt.tar)
- 数据格式转换后适配于PaddleOCR训练的数据下载地址:[链接](https://paddleocr.bj.bcebos.com/ppstructure/dataset/wildreceipt.tar)
- 原始数据下载地址:[链接](https://download.openmmlab.com/mmocr/data/wildreceipt.tar)
- 数据格式转换后适配于PaddleOCR训练的数据下载地址:[链接](https://paddleocr.bj.bcebos.com/ppstructure/dataset/wildreceipt.tar)
2 changes: 1 addition & 1 deletion docs/datasets/ocr_datasets.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Here is a list of public datasets commonly used in OCR, which are being continuo
The annotation file formats supported by the PaddleOCR text detection algorithm are as follows, separated by "\t":

```text linenums="1"
" Image file name Image annotation information encoded by json.dumps"
"Image file name Image annotation information encoded by json.dumps"
ch4_test_images/img_61.jpg [{"transcription": "MASA", "points": [[310, 104], [416, 141], [418, 216], [312, 179]]}, {...}]
```

Expand Down
48 changes: 24 additions & 24 deletions docs/datasets/vertical_and_multilingual_datasets.en.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,38 +18,38 @@ Here we have sorted out the commonly used vertical multi-language OCR dataset da

- **Data introduction**: It contains more than 250000 vehicle license plate images and vehicle license plate detection and recognition information labeling. It contains the following license plate image information in different scenes.

- CCPD-Base: General license plate picture
- CCPD-DB: The brightness of license plate area is bright, dark or uneven
- CCPD-FN: The license plate is farther or closer to the camera location
- CCPD-Rotate: License plate includes rotation (horizontal 20\~50 degrees, vertical-10\~10 degrees)
- CCPD-Tilt: License plate includes rotation (horizontal 15\~45 degrees, vertical 15\~45 degrees)
- CCPD-Blur: The license plate contains blurring due to camera lens jitter
- CCPD-Weather: The license plate is photographed on rainy, snowy or foggy days
- CCPD-Challenge: So far, some of the most challenging images in license plate detection and recognition tasks
- CCPD-NP: Pictures of new cars without license plates.

![](./images/ccpd_demo.png)
- CCPD-Base: General license plate picture
- CCPD-DB: The brightness of license plate area is bright, dark or uneven
- CCPD-FN: The license plate is farther or closer to the camera location
- CCPD-Rotate: License plate includes rotation (horizontal 20\~50 degrees, vertical-10\~10 degrees)
- CCPD-Tilt: License plate includes rotation (horizontal 15\~45 degrees, vertical 15\~45 degrees)
- CCPD-Blur: The license plate contains blurring due to camera lens jitter
- CCPD-Weather: The license plate is photographed on rainy, snowy or foggy days
- CCPD-Challenge: So far, some of the most challenging images in license plate detection and recognition tasks
- CCPD-NP: Pictures of new cars without license plates.

![](./images/ccpd_demo.png)

- **Download address**
- Baidu cloud download address (extracted code is hm0U): [https://pan.baidu.com/s/1i5AOjAbtkwb17Zy-NQGqkw](https://pan.baidu.com/s/1i5AOjAbtkwb17Zy-NQGqkw)
- Google drive download address:[https://drive.google.com/file/d/1rdEsCUcIUaYOVRkx5IMTRNA7PcGMmSgc/view](https://drive.google.com/file/d/1rdEsCUcIUaYOVRkx5IMTRNA7PcGMmSgc/view)
- Baidu cloud download address (extracted code is hm0U): [https://pan.baidu.com/s/1i5AOjAbtkwb17Zy-NQGqkw](https://pan.baidu.com/s/1i5AOjAbtkwb17Zy-NQGqkw)
- Google drive download address:[https://drive.google.com/file/d/1rdEsCUcIUaYOVRkx5IMTRNA7PcGMmSgc/view](https://drive.google.com/file/d/1rdEsCUcIUaYOVRkx5IMTRNA7PcGMmSgc/view)

## Bank credit card dataset

- **Data source**: [source](https://www.kesci.com/home/dataset/5954cf1372ead054a5e25870)

- **Data introduction**: There are three types of training data
- 1.Sample card data of China Merchants Bank: including card image data and annotation data, a total of 618 pictures
- 2.Single character data: including pictures and annotation data, 37 pictures in total.
- 3.There are only other bank cards, no more detailed information, a total of 50 pictures.
- 1.Sample card data of China Merchants Bank: including card image data and annotation data, a total of 618 pictures
- 2.Single character data: including pictures and annotation data, 37 pictures in total.
- 3.There are only other bank cards, no more detailed information, a total of 50 pictures.

- The demo image is shown as follows. The annotation information is stored in excel, and the demo image below is marked as
- Top 8 card number: 62257583
- Card type: card of our bank
- End of validity: 07/41
- Chinese phonetic alphabet of card users: MICHAEL
- The demo image is shown as follows. The annotation information is stored in excel, and the demo image below is marked as
- Top 8 card number: 62257583
- Card type: card of our bank
- End of validity: 07/41
- Chinese phonetic alphabet of card users: MICHAEL

![](./images/cmb_demo.jpg)
![](./images/cmb_demo.jpg)

- **Download address**: [cmb2017-2.zip](https://cdn.kesci.com/cmb2017-2.zip)

Expand All @@ -66,7 +66,7 @@ Here we have sorted out the commonly used vertical multi-language OCR dataset da

- **Data source**: [source](https://rrc.cvc.uab.es/?ch=15&com=downloads)
- **Data introduction**: Multi language detection dataset MLT contains both language recognition and detection tasks.
- In the detection task, the training set contains 10000 images in 10 languages, and each language contains 1000 training images. The test set contains 10000 images.
- In the recognition task, the training set contains 111998 samples.
- In the detection task, the training set contains 10000 images in 10 languages, and each language contains 1000 training images. The test set contains 10000 images.
- In the recognition task, the training set contains 111998 samples.
- **Download address**: The training set is large and can be downloaded in two parts. It can only be downloaded after registering on the website:
[source](https://rrc.cvc.uab.es/?ch=15&com=downloads)
46 changes: 23 additions & 23 deletions docs/datasets/vertical_and_multilingual_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,36 +11,36 @@ comments: true

- **数据来源**[CCPD](https://github.com/detectRecog/CCPD)
- **数据简介**: 包含超过25万张中国城市车牌图片及车牌检测、识别信息的标注。包含以下几种不同场景中的车牌图片信息。
- CCPD-Base: 通用车牌图片
- CCPD-DB: 车牌区域亮度较亮、较暗或者不均匀
- CCPD-FN: 车牌离摄像头拍摄位置相对更远或者更近
- CCPD-Rotate: 车牌包含旋转(水平20\~50度,竖直-10\~10度)
- CCPD-Tilt: 车牌包含旋转(水平15\~45度,竖直15\~45度)
- CCPD-Blur: 车牌包含由于摄像机镜头抖动导致的模糊情况
- CCPD-Weather: 车牌在雨天、雪天或者雾天拍摄得到
- CCPD-Challenge: 至今在车牌检测识别任务中最有挑战性的一些图片
- CCPD-NP: 没有安装车牌的新车图片。

![](./images/ccpd_demo.png)
- CCPD-Base: 通用车牌图片
- CCPD-DB: 车牌区域亮度较亮、较暗或者不均匀
- CCPD-FN: 车牌离摄像头拍摄位置相对更远或者更近
- CCPD-Rotate: 车牌包含旋转(水平20\~50度,竖直-10\~10度)
- CCPD-Tilt: 车牌包含旋转(水平15\~45度,竖直15\~45度)
- CCPD-Blur: 车牌包含由于摄像机镜头抖动导致的模糊情况
- CCPD-Weather: 车牌在雨天、雪天或者雾天拍摄得到
- CCPD-Challenge: 至今在车牌检测识别任务中最有挑战性的一些图片
- CCPD-NP: 没有安装车牌的新车图片。

![](./images/ccpd_demo.png)

- **下载地址**
- 百度云下载地址(提取码是hm0U): [link](https://pan.baidu.com/s/1i5AOjAbtkwb17Zy-NQGqkw)
- Google drive下载地址:[link](https://drive.google.com/file/d/1rdEsCUcIUaYOVRkx5IMTRNA7PcGMmSgc/view)
- 百度云下载地址(提取码是hm0U): [link](https://pan.baidu.com/s/1i5AOjAbtkwb17Zy-NQGqkw)
- Google drive下载地址:[link](https://drive.google.com/file/d/1rdEsCUcIUaYOVRkx5IMTRNA7PcGMmSgc/view)

## 银行信用卡数据集

- **数据来源**: [source](https://www.kesci.com/home/dataset/5954cf1372ead054a5e25870)

- **数据简介**: 训练数据共提供了三类数据
- 1.招行样卡数据: 包括卡面图片数据及标注数据,总共618张图片
- 2.单字符数据: 包括图片及标注数据,总共37张图片。
- 3.仅包含其他银行卡面,不具有更细致的信息,总共50张图片。
- 1.招行样卡数据: 包括卡面图片数据及标注数据,总共618张图片
- 2.单字符数据: 包括图片及标注数据,总共37张图片。
- 3.仅包含其他银行卡面,不具有更细致的信息,总共50张图片。

- demo图片展示如下,标注信息存储在excel表格中,下面的demo图片标注为
- 前8位卡号:62257583
- 卡片种类:本行卡
- 有效期结束:07/41
- 卡用户拼音:MICHAEL
- demo图片展示如下,标注信息存储在excel表格中,下面的demo图片标注为
- 前8位卡号:62257583
- 卡片种类:本行卡
- 有效期结束:07/41
- 卡用户拼音:MICHAEL

![](./images/cmb_demo.jpg)

Expand All @@ -59,7 +59,7 @@ comments: true

- **数据来源**: [source](https://rrc.cvc.uab.es/?ch=15&com=downloads)
- **数据简介**: 多语言检测数据集MLT同时包含了语种识别和检测任务。
- 在检测任务中,训练集包含10000张图片,共有10种语言,每种语言包含1000张训练图片。测试集包含10000张图片。
- 在识别任务中,训练集包含111998个样本。
- 在检测任务中,训练集包含10000张图片,共有10种语言,每种语言包含1000张训练图片。测试集包含10000张图片。
- 在识别任务中,训练集包含111998个样本。
- **下载地址**: 训练集较大,分2部分下载,需要在网站上注册之后才能下载:
[link](https://rrc.cvc.uab.es/?ch=15&com=downloads)

0 comments on commit 0ca03dd

Please sign in to comment.