Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing line Error: list index out of range #5101

Closed
WZMIAOMIAO opened this issue Dec 28, 2021 · 2 comments
Closed

Parsing line Error: list index out of range #5101

WZMIAOMIAO opened this issue Dec 28, 2021 · 2 comments
Assignees

Comments

@WZMIAOMIAO
Copy link
Contributor

WZMIAOMIAO commented Dec 28, 2021

  • 系统环境/System Environment:Ubuntu18.04
  • 版本号/Version:Paddle 2.2.1
  • PaddleOCR:release2.4
  • 问题相关组件/Related components:pyclipper
  • 数据集: https://paddleocr.bj.bcebos.com/dataset/det_data_lesson_demo.tar
  • 运行指令/Command Code:!python tools/train.py -c configs/det/det_mv3_db.yml
  • 完整报错/Complete Error Message:
[2021/12/23 12:41:26] root INFO: train dataloader has 94 iters
[2021/12/23 12:41:26] root INFO: valid dataloader has 250 iters
[2021/12/23 12:41:26] root INFO: During the training process, after the 0th iteration, an evaluation is run every 500 iterations
[2021/12/23 12:41:26] root INFO: Initialize indexs of datasets:['/home/aistudio/work/data/det_data_lesson_demo/train.txt']
[2021/12/23 12:41:54] root INFO: epoch: [1/100], iter: 10, lr: 0.000027, loss: 9.582685, loss_shrink_maps: 4.681584, loss_threshold_maps: 3.961636, loss_binary_maps: 0.939466, reader_cost: 1.81348 s, batch_cost: 2.77511 s, samples: 88, ips: 3.17105
[2021/12/23 12:41:58] root ERROR: When parsing line mtwi/train/TB1_5H8n3vD8KJjy0FlXXagBFXa_!!0-item_pic.jpg.jpg	[{"transcription": "\u6d53\u7f29\u9664\u81ed\u6db2", "points": [[473.55, 99.64], [456.18, 41.82], [778.73, 39.82], [777.73, 105.82]]}, {"transcription": "1000ml", "points": [[476.27, 158.73], [477.27, 129.09], [618.55, 124.09], [618.55, 158.73]]}, {"transcription": "\u62b510\u74f6", "points": [[647.55, 121.64], [652.55, 165.64], [771.09, 166.64], [773.09, 121.64]]}, {"transcription": "\u9001", "points": [[691.82, 347.45], [690.82, 437.36], [768.55, 426.36], [777.0, 345.36]]}, {"transcription": "YaHo\u4e9a\u79be", "points": [[94.0, 289.0], [94.0, 305.73], [164.73, 305.73], [164.73, 287.0]]}, {"transcription": "YaHo\u4e9a\u79be", "points": [[242.55, 290.0], [242.55, 303.27], [317.45, 303.27], [316.45, 287.0]]}, {"transcription": "YaHo\u4e9a\u79be", "points": [[650.55, 476.36], [651.55, 485.82], [694.91, 486.82], [695.91, 477.36]]}, {"transcription": "Disiuf", "points": [[48.36, 325.55], [46.55, 359.09], [154.55, 363.45], [156.55, 330.55]]}, {"transcription": "spray", "points": [[61.45, 362.73], [61.45, 378.91], [123.27, 377.0], [121.27, 361.91]]}, {"transcription": "spray", "points": [[211.73, 360.27], [214.73, 377.73], [272.64, 377.73], [269.64, 360.27]]}, {"transcription": "Disiufectaut", "points": [[198.73, 324.55], [199.73, 361.82], [387.0, 357.82], [390.0, 328.55]]}, {"transcription": "\u5ba0\u7269\u9664\u81ed\u6db2", "points": [[271.64, 379.64], [272.64, 400.0], [369.82, 399.0], [371.82, 381.64]]}, {"transcription": "\u5ba0\u7269", "points": [[125.73, 379.64], [122.73, 399.27], [153.82, 401.27], [154.82, 381.64]]}, {"transcription": "\u6d53\u7f29\u578b", "points": [[63.82, 380.82], [67.82, 397.55], [116.73, 399.55], [114.73, 382.82]]}, {"transcription": "\u6d53\u7f29\u578b", "points": [[216.91, 382.55], [214.91, 401.27], [267.27, 399.27], [269.27, 383.55]]}, {"transcription": "\u8309\u8389\u82b1\u82ac\u82b3", "points": [[63.18, 422.09], [61.18, 429.36], [104.82, 429.36], [105.82, 422.09]]}, {"transcription": "\u8309\u8389\u82b1\u82ac\u82b3", "points": [[211.73, 421.09], [211.73, 429.82], [256.64, 430.82], [254.64, 421.09]]}, {"transcription": "\u51c0\u542b\u91cf\uff1a1000ML", "points": [[63.18, 442.91], [61.18, 454.55], [144.36, 453.55], [143.36, 442.91]]}, {"transcription": "\u51c0\u542b\u91cf\uff1a1000ML", "points": [[216.18, 444.64], [213.18, 457.27], [294.91, 454.27], [296.91, 445.64]]}, {"transcription": "\u51c0\u542b\u91cf", "points": [[703.8, 615.47], [705.8, 621.8], [721.4, 620.8], [723.4, 615.47]]}, {"transcription": "500ML", "points": [[704.2, 622.4], [702.2, 626.67], [720.2, 627.67], [720.2, 622.4]]}, {"transcription": "\u8309\u8389\u82b1\u82ac\u82b3", "points": [[661.07, 546.13], [661.07, 553.53], [689.93, 553.53], [691.93, 547.13]]}, {"transcription": "\u5ba0\u7269\u795b\u5473\u55b7\u96fe", "points": [[640.8, 537.13], [642.8, 526.87], [713.2, 526.87], [710.2, 537.13]]}, {"transcription": "Healthy", "points": [[62.27, 408.93], [64.27, 419.6], [101.07, 417.6], [103.07, 408.93]]}, {"transcription": "Antiscptic", "points": [[104.27, 408.53], [104.27, 420.67], [152.33, 420.67], [154.33, 408.53]]}, {"transcription": "###", "points": [[213.13, 406.93], [215.13, 417.0], [249.33, 417.0], [249.33, 406.93]]}, {"transcription": "Antiscptic&DeodorantForPet", "points": [[253.13, 417.0], [253.6, 408.47], [408.2, 407.93], [401.27, 418.07]]}, {"transcription": "Healthy", "points": [[224.2, 406.4], [224.2, 407.93], [221.67, 406.93], [223.67, 407.4]]}, {"transcription": "DEODORANT", "points": [[627.0, 505.07], [627.0, 496.47], [724.53, 495.47], [724.53, 505.07]]}, {"transcription": "SPRAY", "points": [[651.47, 519.87], [651.47, 509.2], [698.0, 508.2], [702.0, 518.87]]}, {"transcription": "\u4e70\u4e00\u9001\u4e00", "points": [[27.07, 790.8], [12.93, 645.2], [484.67, 631.93], [450.8, 786.27]]}, {"transcription": "###", "points": [[123.93, 700.2], [121.93, 700.73], [124.47, 700.73], [124.47, 700.2]]}, {"transcription": "\u5206\u89e3\u81ed\u5473", "points": [[515.0, 793.07], [514.0, 710.53], [786.0, 700.53], [793.0, 784.07]]}, {"transcription": "\u9001\u9664\u81ed\u55b7\u96fe500ml", "points": [[522.2, 698.0], [522.2, 666.47], [786.87, 662.47], [788.87, 697.0]]}]
, error happened with msg: Traceback (most recent call last):
  File "/home/aistudio/work/PaddleOCR/ppocr/data/simple_dataset.py", line 119, in __getitem__
    outs = transform(data, self.ops)
  File "/home/aistudio/work/PaddleOCR/ppocr/data/imaug/__init__.py", line 43, in transform
    data = op(data)
  File "/home/aistudio/work/PaddleOCR/ppocr/data/imaug/make_border_map.py", line 60, in __call__
    self.draw_border_map(text_polys[i], canvas, mask=mask)
  File "/home/aistudio/work/PaddleOCR/ppocr/data/imaug/make_border_map.py", line 81, in draw_border_map
    padded_polygon = np.array(padding.Execute(distance)[0])
IndexError: list index out of range

问题描述

之前我有提过一个相关issue#5029,里面提到了两个问题:

  • 第一个是Paddle读取图片的问题(之前也提过PR)
  • 第二个就是解析标签数据时遇到的问题,这两天才定位到原因

通过定位我发现对于有些特别小的目标区域(或者说标注有问题的数据),通过pyclipper收缩后得到的result是一个空列表。下面是一个简易的测试脚本:

import pyclipper

subject = [(179.67479414146868, 330.7079846112311),
           (179.72345774838013, 331.9309867710583),
           (177.66925958423326, 331.2121355291577),
           (179.2829373650108, 331.5241968142549)]
distance = 0.07327365999299185

padding = pyclipper.PyclipperOffset()
padding.AddPath(subject, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
result = padding.Execute(distance)
print(result)  # []
padded_polygon = np.array(padding.Execute(distance)[0])

在PaddleOCR官方源码中,是没有对padding.Execute的结果做判断的,当结果为空列表时,就会引发Error

padding = pyclipper.PyclipperOffset()
padding.AddPath(subject, pyclipper.JT_ROUND, pyclipper.ET_CLOSEDPOLYGON)
padded_polygon = np.array(padding.Execute(distance)[0])


简单修复方案

将:

padded_polygon = np.array(padding.Execute(distance)[0])

改成:

result = padding.Execute(distance)
if len(result) == 0:
    return
padded_polygon = np.array(result[0])

如果官方觉得可行,我可以再提个PR,如果有更好的解决办法,就等官方人员进行修复。


复现错误过程

  • 首先在项目根目录创建了一个test.txt标签文件,文件里就一行信息(就是把解析报错的那行信息拿了过来)
./test.jpg	[{"transcription": "\u6d53\u7f29\u9664\u81ed\u6db2", "points": [[473.55, 99.64], [456.18, 41.82], [778.73, 39.82], [777.73, 105.82]]}, {"transcription": "1000ml", "points": [[476.27, 158.73], [477.27, 129.09], [618.55, 124.09], [618.55, 158.73]]}, {"transcription": "\u62b510\u74f6", "points": [[647.55, 121.64], [652.55, 165.64], [771.09, 166.64], [773.09, 121.64]]}, {"transcription": "\u9001", "points": [[691.82, 347.45], [690.82, 437.36], [768.55, 426.36], [777.0, 345.36]]}, {"transcription": "YaHo\u4e9a\u79be", "points": [[94.0, 289.0], [94.0, 305.73], [164.73, 305.73], [164.73, 287.0]]}, {"transcription": "YaHo\u4e9a\u79be", "points": [[242.55, 290.0], [242.55, 303.27], [317.45, 303.27], [316.45, 287.0]]}, {"transcription": "YaHo\u4e9a\u79be", "points": [[650.55, 476.36], [651.55, 485.82], [694.91, 486.82], [695.91, 477.36]]}, {"transcription": "Disiuf", "points": [[48.36, 325.55], [46.55, 359.09], [154.55, 363.45], [156.55, 330.55]]}, {"transcription": "spray", "points": [[61.45, 362.73], [61.45, 378.91], [123.27, 377.0], [121.27, 361.91]]}, {"transcription": "spray", "points": [[211.73, 360.27], [214.73, 377.73], [272.64, 377.73], [269.64, 360.27]]}, {"transcription": "Disiufectaut", "points": [[198.73, 324.55], [199.73, 361.82], [387.0, 357.82], [390.0, 328.55]]}, {"transcription": "\u5ba0\u7269\u9664\u81ed\u6db2", "points": [[271.64, 379.64], [272.64, 400.0], [369.82, 399.0], [371.82, 381.64]]}, {"transcription": "\u5ba0\u7269", "points": [[125.73, 379.64], [122.73, 399.27], [153.82, 401.27], [154.82, 381.64]]}, {"transcription": "\u6d53\u7f29\u578b", "points": [[63.82, 380.82], [67.82, 397.55], [116.73, 399.55], [114.73, 382.82]]}, {"transcription": "\u6d53\u7f29\u578b", "points": [[216.91, 382.55], [214.91, 401.27], [267.27, 399.27], [269.27, 383.55]]}, {"transcription": "\u8309\u8389\u82b1\u82ac\u82b3", "points": [[63.18, 422.09], [61.18, 429.36], [104.82, 429.36], [105.82, 422.09]]}, {"transcription": "\u8309\u8389\u82b1\u82ac\u82b3", "points": [[211.73, 421.09], [211.73, 429.82], [256.64, 430.82], [254.64, 421.09]]}, {"transcription": "\u51c0\u542b\u91cf\uff1a1000ML", "points": [[63.18, 442.91], [61.18, 454.55], [144.36, 453.55], [143.36, 442.91]]}, {"transcription": "\u51c0\u542b\u91cf\uff1a1000ML", "points": [[216.18, 444.64], [213.18, 457.27], [294.91, 454.27], [296.91, 445.64]]}, {"transcription": "\u51c0\u542b\u91cf", "points": [[703.8, 615.47], [705.8, 621.8], [721.4, 620.8], [723.4, 615.47]]}, {"transcription": "500ML", "points": [[704.2, 622.4], [702.2, 626.67], [720.2, 627.67], [720.2, 622.4]]}, {"transcription": "\u8309\u8389\u82b1\u82ac\u82b3", "points": [[661.07, 546.13], [661.07, 553.53], [689.93, 553.53], [691.93, 547.13]]}, {"transcription": "\u5ba0\u7269\u795b\u5473\u55b7\u96fe", "points": [[640.8, 537.13], [642.8, 526.87], [713.2, 526.87], [710.2, 537.13]]}, {"transcription": "Healthy", "points": [[62.27, 408.93], [64.27, 419.6], [101.07, 417.6], [103.07, 408.93]]}, {"transcription": "Antiscptic", "points": [[104.27, 408.53], [104.27, 420.67], [152.33, 420.67], [154.33, 408.53]]}, {"transcription": "###", "points": [[213.13, 406.93], [215.13, 417.0], [249.33, 417.0], [249.33, 406.93]]}, {"transcription": "Antiscptic&DeodorantForPet", "points": [[253.13, 417.0], [253.6, 408.47], [408.2, 407.93], [401.27, 418.07]]}, {"transcription": "Healthy", "points": [[224.2, 406.4], [224.2, 407.93], [221.67, 406.93], [223.67, 407.4]]}, {"transcription": "DEODORANT", "points": [[627.0, 505.07], [627.0, 496.47], [724.53, 495.47], [724.53, 505.07]]}, {"transcription": "SPRAY", "points": [[651.47, 519.87], [651.47, 509.2], [698.0, 508.2], [702.0, 518.87]]}, {"transcription": "\u4e70\u4e00\u9001\u4e00", "points": [[27.07, 790.8], [12.93, 645.2], [484.67, 631.93], [450.8, 786.27]]}, {"transcription": "###", "points": [[123.93, 700.2], [121.93, 700.73], [124.47, 700.73], [124.47, 700.2]]}, {"transcription": "\u5206\u89e3\u81ed\u5473", "points": [[515.0, 793.07], [514.0, 710.53], [786.0, 700.53], [793.0, 784.07]]}, {"transcription": "\u9001\u9664\u81ed\u55b7\u96fe500ml", "points": [[522.2, 698.0], [522.2, 666.47], [786.87, 662.47], [788.87, 697.0]]}]
  • 接着执行以下代码
from PIL import Image
from tools.program import load_config, get_logger
from ppocr.data.simple_dataset import SimpleDataSet

# 创建一个和mtwi/train/TB1_5H8n3vD8KJjy0FlXXagBFXa_!!0-item_pic.jpg.jpg一样大小的图片,这样就不用去下载数据集
img = Image.new('RGB', size=(800, 800))
img.save("test.jpg")

config_path = "./configs/det/det_mv3_db.yml"
global_config = load_config(config_path)
global_config["Train"]["dataset"]["data_dir"] = "./"
global_config["Train"]["dataset"]["label_file_list"] = ["test.txt"]
logging = get_logger(name="root")
dataset = SimpleDataSet(global_config, "Train", logging, seed=0)
result = dataset[0]
  • 在执行上述代码时,会发现有时会报错,有时不会报错,猜测是数据增强引入的随机性,进一步分析:
transforms:
      - DecodeImage: # load image
          img_mode: BGR
          channel_first: False
      - DetLabelEncode: # Class handling label
      - IaaAugment:
#          augmenter_args:
#            - { 'type': Fliplr, 'args': { 'p': 0.5 } }
#            - { 'type': Affine, 'args': { 'rotate': [-10, 10] } }
#            - { 'type': Resize, 'args': { 'size': [0.5, 3] } }
      - EastRandomCropData:
          size: [640, 640]
          max_tries: 50
          keep_ratio: true
      - MakeBorderMap:
          shrink_ratio: 0.4
          thresh_min: 0.3
          thresh_max: 0.7
      - MakeShrinkMap:
          shrink_ratio: 0.4
          min_text_size: 8
      - NormalizeImage:
          scale: 1./255.
          mean: [0.485, 0.456, 0.406]
          std: [0.229, 0.224, 0.225]
          order: 'hwc'

首先将augmenter_args关闭,发现问题依旧存在。但当我把EastRandomCropData给关闭后,问题就无法复现了。可以确定是EastRandomCropData的随机性导致的。查看EastRandomCropData__call__方法里所采用的np.random方法并没有固定随机数种子,接着我在random_crop_data.py文件中把随机数种子固定成1,这样就能保证问题每次都能复现了。顺便建议下,希望在代码里提供固定随机数种子的方法,方便大家复现错误。

import numpy as np
import cv2
import random

np.random.seed(1)

@vineethbabu
Copy link

Hi, facing this issue currently, was able run table structure recognition without any errors few days back but now getting this error:

Traceback (most recent call last): File "table/predict_table.py", line 221, in <module> main(args) File "table/predict_table.py", line 197, in main pred_html = text_sys(img) File "table/predict_table.py", line 88, in __call__ rec_res, elapse = self.text_recognizer(img_crop_list) File "/home/vineeth/Documents/PaddleOCR/tools/infer/predict_rec.py", line 368, in __call__ rec_result = self.postprocess_op(preds) File "/home/vineeth/Documents/PaddleOCR/ppocr/postprocess/rec_postprocess.py", line 97, in __call__ text = self.decode(preds_idx, preds_prob, is_remove_duplicate=True) File "/home/vineeth/Documents/PaddleOCR/ppocr/postprocess/rec_postprocess.py", line 69, in decode idx])]) IndexError: list index out of range

Command ran:
python3 table/predict_table.py --det_model_dir=inference/en_ppocr_mobile_v2.0_table_det_infer --rec_model_dir=inference/en_ppocr_mobile_v2.0_table_rec_infer --table_model_dir=inference/en_ppocr_mobile_v2.0_table_structure_infer --image_dir=/home/vineeth/Downloads/J01_crop1.jpg --rec_char_dict_path=../ppocr/utils/dict/table_dict.txt --table_char_dict_path=../ppocr/utils/dict/table_structure_dict.txt --rec_char_dict_path=../ppocr/utils/dict/en_dict.txt --det_limit_side_len=736 --det_limit_type=min --output ../output/table

Thanks in advance.

This was referenced Dec 29, 2021
@paddle-bot-old
Copy link

Since you haven't replied for more than 3 months, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
It is recommended to pull and try the latest code first.
由于您超过三个月未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开(建议先拉取最新代码进行尝试),我们会继续跟进。

an1018 pushed a commit to an1018/PaddleOCR that referenced this issue Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants