You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 3, 2022. It is now read-only.
stager@stager-laptop:~/massocr$ ~/ocrodjvu-0.7.15/ocrodjvu --in-place -e tesseract -t words --html5 --clear-text -lrus+eng 9af500e27db4351d7391f463c0e3f017.djvu
Processing '9af500e27db4351d7391f463c0e3f017.djvu':
Intermediate files were left in the '/tmp/ocrodjvu.z1KiDA' directory.
Traceback (most recent call last):
File "/home/stager1/ocrodjvu-0.7.15/ocrodjvu", line 7, in <module>
_.main(sys.argv)
File "/home/stager1/ocrodjvu-0.7.15/lib/cli/ocrodjvu.py", line 533, in main
context.process(options.path, options.pages)
File "/home/stager1/ocrodjvu-0.7.15/lib/cli/ocrodjvu.py", line 515, in process
self._process(*args, **kwargs)
File "/home/stager1/ocrodjvu-0.7.15/lib/cli/ocrodjvu.py", line 471, in _process
file_id = page.file.id.encode(system_encoding)
File "decode.pyx", line 840, in djvu.decode.File.id.__get__ (djvu/decode.c:7605)
File "common.pxi", line 128, in djvu.decode.decode_utf8 (djvu/decode.c:2802)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf2 in position 0: invalid continuation byte
This is really a problem with the DjVu file in question. Its pages identifies are not in UTF-8, but in a some legacy 8-bit encoding instead.
I'll add a work-around in ocrodjvu for this, but you should fix the DjVu file. You can do that by converting it to indirect (with djvm), and then perhaps back to bundled.
Issue reported by
GStager
at Bitbucket:Original file
The text was updated successfully, but these errors were encountered: