Skip to content
This repository has been archived by the owner on Oct 3, 2022. It is now read-only.

crashes on non-UTF-8 file identifiers #4

Closed
jwilk opened this issue Apr 26, 2013 · 3 comments
Closed

crashes on non-UTF-8 file identifiers #4

jwilk opened this issue Apr 26, 2013 · 3 comments
Labels

Comments

@jwilk
Copy link
Member

jwilk commented Apr 26, 2013

Issue reported by GStager at Bitbucket:

stager@stager-laptop:~/massocr$ ~/ocrodjvu-0.7.15/ocrodjvu --in-place -e tesseract -t words --html5 --clear-text -lrus+eng 9af500e27db4351d7391f463c0e3f017.djvu
Processing '9af500e27db4351d7391f463c0e3f017.djvu':
Intermediate files were left in the '/tmp/ocrodjvu.z1KiDA' directory.
Traceback (most recent call last):
  File "/home/stager1/ocrodjvu-0.7.15/ocrodjvu", line 7, in <module>
    _.main(sys.argv)
  File "/home/stager1/ocrodjvu-0.7.15/lib/cli/ocrodjvu.py", line 533, in main
    context.process(options.path, options.pages)
  File "/home/stager1/ocrodjvu-0.7.15/lib/cli/ocrodjvu.py", line 515, in process
    self._process(*args, **kwargs)
  File "/home/stager1/ocrodjvu-0.7.15/lib/cli/ocrodjvu.py", line 471, in _process
    file_id = page.file.id.encode(system_encoding)
  File "decode.pyx", line 840, in djvu.decode.File.id.__get__ (djvu/decode.c:7605)
  File "common.pxi", line 128, in djvu.decode.decode_utf8 (djvu/decode.c:2802)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf2 in position 0: invalid continuation byte

Original file

@jwilk
Copy link
Member Author

jwilk commented Apr 26, 2013

Thanks for the bug report.

This is really a problem with the DjVu file in question. Its pages identifies are not in UTF-8, but in a some legacy 8-bit encoding instead.

I'll add a work-around in ocrodjvu for this, but you should fix the DjVu file. You can do that by converting it to indirect (with djvm), and then perhaps back to bundled.

@jwilk
Copy link
Member Author

jwilk commented Apr 26, 2013

Fixed in 9a11f6a.

@jwilk
Copy link
Member Author

jwilk commented Apr 28, 2013

Fixed in 0.7.16.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Development

No branches or pull requests

1 participant