Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processor terminates with exception: ocrd-tesserocr-segment-region exited with non-zero #139

Closed
stweil opened this issue Aug 15, 2020 · 7 comments · Fixed by #145
Closed

Comments

@stweil
Copy link
Contributor

stweil commented Aug 15, 2020

Physical page 129 of http://nbn-resolving.de/urn:nbn:de:bsz:180-digad-10044 raises a fatal exception in the standard workflow for slower processors:

20:13:52.948 INFO processor.TesserocrSegmentRegion - INPUT FILE 128 / IMG_OCR-D-BIN-DENOISE-DESKEW_460017
20:13:52.971 INFO processor.TesserocrSegmentRegion - Page 'IMG_OCR-D-BIN-DENOISE-DESKEW_460017' images will use 300 DPI from image meta-data
20:13:52.971 INFO processor.TesserocrSegmentRegion - Detecting regions in page 'IMG_OCR-D-BIN-DENOISE-DESKEW_460017'
20:13:53.352 INFO processor.TesserocrSegmentRegion - Detected region 'region0000': 1425,976 1421,976 1424,2217 1425,2217 (HORZ_LINE)
20:13:53.353 INFO processor.TesserocrSegmentRegion - Detected region 'region0001': 1425,1788 1423,976 1413,976 1416,2194 1425,2194 (HORZ_LINE)
20:13:53.354 INFO processor.TesserocrSegmentRegion - Detected region 'region0002': 523,97 525,967 958,966 956,97 (TABLE)
20:13:53.354 INFO processor.TesserocrSegmentRegion - Detected region 'region0003': 392,292 432,292 434,1160 394,1160 (FLOWING_TEXT)
20:13:53.355 INFO processor.TesserocrSegmentRegion - Detected region 'region0004': 378,162 401,162 401,307 378,307 (FLOWING_TEXT)
20:13:53.355 INFO processor.TesserocrSegmentRegion - Detected region 'region0005': 92,110 330,109 332,1157 94,1158 (TABLE)
Traceback (most recent call last):
  File "/home/stweil/src/github/OCR-D/venv-20200815/bin/ocrd-tesserocr-segment-region", line 8, in <module>
    sys.exit(ocrd_tesserocr_segment_region())
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 16, in ocrd_tesserocr_segment_region
    return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/ocrd/processor/base.py", line 61, in run_processor
    processor.process()
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 168, in process
    self._process_page(layout, page, page_image, page_coords, input_file.pageId)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 221, in _process_page
    polygon = polygon_for_parent(polygon, page)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 327, in polygon_for_parent
    raise Exception("intersection of would-be segment with parent is empty")
Exception: intersection of would-be segment with parent is empty
Traceback (most recent call last):
  File "/home/stweil/src/github/OCR-D/venv-20200815/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/ocrd/cli/process.py", line 27, in process_cli
    run_tasks(mets, log_level, page_id, tasks, overwrite)
  File "/home/stweil/src/github/OCR-D/venv-20200815/lib/python3.7/site-packages/ocrd/task_sequence.py", line 148, in run_tasks
    raise Exception("%s exited with non-zero return value %s" % (task.executable, returncode))
Exception: ocrd-tesserocr-segment-region exited with non-zero return value 1
@stweil
Copy link
Contributor Author

stweil commented Aug 15, 2020

See input image.

stweil added a commit to stweil/ocrd_tesserocr that referenced this issue Aug 15, 2020
Signed-off-by: Stefan Weil <sw@weilnetz.de>
@bertsky
Copy link
Collaborator

bertsky commented Aug 16, 2020

I cannot tell for sure without checking the PAGE input (esp. Border), but my guess is that Tesseract detected some region entirely outside the annotated Border polygon here (which judging by the workflow is a rotated rectangle touching the edges of the image fed to Tesseract). Because we always mask/clip away non-rectangular background to white when cropping in core, there's nothing in that area of the image which Tesseract could actually "see".

So to mitigate, we could simply ignore these regions I believe, fixing the cause of the exception directly.

@stweil @kba @wrznr opinions?

@bertsky
Copy link
Collaborator

bertsky commented Aug 16, 2020

@stweil could you please run with -l DEBUG so we can see the Border and last region coordinates in the log (just to be sure about above analysis)?

@bertsky
Copy link
Collaborator

bertsky commented Aug 24, 2020

@stweil I think #145 is a valid fix, but please give me -l DEBUG (before and after this change) just to be on the safe side!

@stweil
Copy link
Contributor Author

stweil commented Sep 4, 2020

Before (workflow with image https://digi.bib.uni-mannheim.de/fileadmin/vl/ubmadesb/66726/max/66726_0139.jpg):

20:53:05.525 INFO processor.TesserocrSegmentRegion - INPUT FILE 139 / phys396193
20:53:05.525 DEBUG ocrd.workspace - download_file <OcrdFile fileGrp=OCR-D-BIN-DENOISE-DESKEW, ID=IMG_OCR-D-BIN-DENOISE-DESKEW_396193, mimetype=application/vnd.prima.page+xml, url=OCR-D-BIN-DENOISE-DESKEW/IMG_OCR-D-BIN-DENOISE-DESKEW_396193.xml, local_filename=OCR-D-BIN-DENOISE-DESKEW/IMG_OCR-D-BIN-DENOISE-DESKEW_396193.xml]/>  [_recursion_count=0]
20:53:05.541 DEBUG ocrd.workspace - download_file <OcrdFile fileGrp=MAX, ID=IMG_MAX_396193, mimetype=image/jpeg, url=MAX/IMG_MAX_396193.jpg, local_filename=MAX/IMG_MAX_396193.jpg]/>  [_recursion_count=0]
20:53:05.616 DEBUG ocrd.workspace - page 'phys396193' has border, orientation=0 skew=-2.11
20:53:05.616 DEBUG ocrd.workspace - Using AlternativeImage 5 (cropped,binarized,despeckled,deskewed) for page 'phys396193'
20:53:05.633 DEBUG ocrd.workspace - download_file <OcrdFile fileGrp=OCR-D-BIN-DENOISE-DESKEW, ID=IMG_OCR-D-BIN-DENOISE-DESKEW_396193.IMG-DESKEW, mimetype=image/png, url=OCR-D-BIN-DENOISE-DESKEW/IMG_OCR-D-BIN-DENOISE-DESKEW_396193.IMG-DESKEW.png, local_filename=OCR-D-BIN-DENOISE-DESKEW/IMG_OCR-D-BIN-DENOISE-DESKEW_396193.IMG-DESKEW.png]/>  [_recursion_count=0]
20:53:05.634 DEBUG PIL.PngImagePlugin - STREAM b'IHDR' 16 13
20:53:05.634 DEBUG PIL.PngImagePlugin - STREAM b'IDAT' 41 44204
20:53:05.659 DEBUG ocrd.workspace - Using explicitly set page border '109,78 1419,78 1419,2215 109,2215' for page 'phys396193'
20:53:05.660 DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [-109  -78]
20:53:05.660 DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [ -655.  -1068.5]
20:53:05.660 DEBUG ocrd_utils.coords.rotate_coordinates - rotating coordinates by -2.11° around [ 655.  1068.5]
20:53:05.661 DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [ 693.81581848 1091.84410477]
20:53:05.661 INFO processor.TesserocrSegmentRegion - Page 'phys396193' images will use DPI estimated from segmentation
20:53:05.661 INFO processor.TesserocrSegmentRegion - Detecting regions in page 'phys396193'
20:53:06.013 INFO processor.TesserocrSegmentRegion - Detected region 'region0000': 589,182 1080,164 1083,245 592,263 (VERTICAL_TEXT)
20:53:06.014 INFO processor.TesserocrSegmentRegion - Detected region 'region0001': 277,214 316,213 316,236 277,238 (VERTICAL_TEXT)
20:53:06.014 INFO processor.TesserocrSegmentRegion - Detected region 'region0002': 278,196 491,188 493,259 281,267 (VERTICAL_TEXT)
Traceback (most recent call last):
  File "/home/stweil/src/github/OCR-D/venv-20200904/bin/ocrd-tesserocr-segment-region", line 8, in <module>
    sys.exit(ocrd_tesserocr_segment_region())
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/cli.py", line 16, in ocrd_tesserocr_segment_region
    return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd/decorators.py", line 102, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 69, in run_processor
    processor.process()
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 172, in process
    self._process_page(layout, page, page_image, page_coords, input_file.pageId)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 222, in _process_page
    polygon = polygon_for_parent(polygon, page)
  File "/home/stweil/src/github/OCR-D/venv-20200904/lib/python3.7/site-packages/ocrd_tesserocr/segment_region.py", line 328, in polygon_for_parent
    raise Exception("intersection of would-be segment with parent is empty")
Exception: intersection of would-be segment with parent is empty
(venv-20200904) stweil@ub-backup:~/src/github/OCR-D/ocrd_all/pilot/20200904/urn:nbn:de:bsz:180-digad-8419$ 

@stweil
Copy link
Contributor Author

stweil commented Sep 4, 2020

After 9b30ee4 was applied:

21:00:39.059 INFO processor.TesserocrSegmentRegion - INPUT FILE 139 / phys396193
21:00:39.060 DEBUG ocrd.workspace - download_file <OcrdFile fileGrp=OCR-D-BIN-DENOISE-DESKEW, ID=IMG_OCR-D-BIN-DENOISE-DESKEW_396193, mimetype=application/vnd.prima.page+xml, url=OCR-D-BIN-DENOISE-DESKEW/IMG_OCR-D-BIN-DENOISE-DESKEW_396193.xml, local_filename=OCR-D-BIN-DENOISE-DESKEW/IMG_OCR-D-BIN-DENOISE-DESKEW_396193.xml]/>  [_recursion_count=0]
21:00:39.073 DEBUG ocrd.workspace - download_file <OcrdFile fileGrp=MAX, ID=IMG_MAX_396193, mimetype=image/jpeg, url=MAX/IMG_MAX_396193.jpg, local_filename=MAX/IMG_MAX_396193.jpg]/>  [_recursion_count=0]
21:00:39.135 DEBUG ocrd.workspace - page 'phys396193' has border, orientation=0 skew=-2.11
21:00:39.135 DEBUG ocrd.workspace - Using AlternativeImage 5 (cropped,binarized,despeckled,deskewed) for page 'phys396193'
21:00:39.149 DEBUG ocrd.workspace - download_file <OcrdFile fileGrp=OCR-D-BIN-DENOISE-DESKEW, ID=IMG_OCR-D-BIN-DENOISE-DESKEW_396193.IMG-DESKEW, mimetype=image/png, url=OCR-D-BIN-DENOISE-DESKEW/IMG_OCR-D-BIN-DENOISE-DESKEW_396193.IMG-DESKEW.png, local_filename=OCR-D-BIN-DENOISE-DESKEW/IMG_OCR-D-BIN-DENOISE-DESKEW_396193.IMG-DESKEW.png]/>  [_recursion_count=0]
21:00:39.149 DEBUG PIL.PngImagePlugin - STREAM b'IHDR' 16 13
21:00:39.149 DEBUG PIL.PngImagePlugin - STREAM b'IDAT' 41 44204
21:00:39.176 DEBUG ocrd.workspace - Using explicitly set page border '109,78 1419,78 1419,2215 109,2215' for page 'phys396193'
21:00:39.176 DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [-109  -78]
21:00:39.177 DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [ -655.  -1068.5]
21:00:39.177 DEBUG ocrd_utils.coords.rotate_coordinates - rotating coordinates by -2.11° around [ 655.  1068.5]
21:00:39.177 DEBUG ocrd_utils.coords.shift_coordinates - shifting coordinates by [ 693.81581848 1091.84410477]
21:00:39.177 INFO processor.TesserocrSegmentRegion - Page 'phys396193' images will use DPI estimated from segmentation
21:00:39.177 INFO processor.TesserocrSegmentRegion - Detecting regions in page 'phys396193'
21:00:39.537 INFO processor.TesserocrSegmentRegion - Detected region 'region0000': 589,182 1080,164 1083,245 592,263 (VERTICAL_TEXT)
21:00:39.537 INFO processor.TesserocrSegmentRegion - Detected region 'region0001': 277,214 316,213 316,236 277,238 (VERTICAL_TEXT)
21:00:39.537 INFO processor.TesserocrSegmentRegion - Detected region 'region0002': 278,196 491,188 493,259 281,267 (VERTICAL_TEXT)
21:00:39.538 INFO processor.TesserocrSegmentRegion - Ignoring extant region: 72,1214 72,1214 85,1565 85,1565
21:00:39.538 INFO processor.TesserocrSegmentRegion - Detected region 'region0003': 859,413 1057,405 1098,1504 900,1511 (PULLOUT_TEXT)
21:00:39.538 INFO processor.TesserocrSegmentRegion - Detected region 'region0004': 782,557 880,553 883,612 784,616 (FLOWING_TEXT)
21:00:39.539 INFO processor.TesserocrSegmentRegion - Ignoring extant region: 72,1214 72,1214 84,1546 84,1546
21:00:39.539 INFO processor.TesserocrSegmentRegion - Detected region 'region0005': 791,617 1053,608 1092,1681 830,1691 (FLOWING_TEXT)
21:00:39.539 INFO processor.TesserocrSegmentRegion - Detected region 'region0006': 791,616 814,615 846,1500 823,1501 (FLOWING_TEXT)
21:00:39.539 INFO processor.TesserocrSegmentRegion - Detected region 'region0007': 510,532 772,522 812,1616 551,1626 (PULLOUT_IMAGE)
21:00:39.542 DEBUG ocrd.workspace - outputfile file_grp=OCR-D-SEG-REG local_filename=OCR-D-SEG-REG/IMG_OCR-D-SEG-REG_396193.xml content=True

@bertsky
Copy link
Collaborator

bertsky commented Sep 4, 2020

Thanks @stweil – I feel confident to merge this now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants