uncaught decoder failure #356

herrold · 2020-01-09T15:36:40Z

Describe the bug
The decoder hits a character it cannot decode and segfaults, rather than gracefully erroring

To Reproduce
target file creating the issue is at:
http://gallery.herrold.com/stuff/Murray93DrunkAndDog.pdf

CentOS 7 with EPEL

[herrold@localhost prices]$ rpm -V python36-pdfminer
[herrold@localhost prices]$ rpm -q python36-pdfminer
python36-pdfminer-20160614-5.el7.noarch

[herrold@localhost prices]$ /usr/bin/latin2ascii Murray93DrunkAndDog.pdf
Traceback (most recent call last):
File "/usr/bin/latin2ascii", line 130, in
if name == 'main': sys.exit(main(sys.argv))
File "/usr/bin/latin2ascii", line 125, in main
for line in fileinput.input(args):
File "/usr/lib64/python3.6/fileinput.py", line 250, in next
line = self._readline()
File "/usr/lib64/python3.6/fileinput.py", line 364, in _readline
return self._readline()
File "/usr/lib64/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 10: ordinal not in range(128)
[herrold@localhost prices]$

ask if you need more information to reproduce, but this should suffice

Murray93DrunkAndDog.pdf

pietermarsman · 2020-01-09T18:04:13Z

The latin2ascii.py file did not have any meaningful changes since 2010. I guess it is outdated and there are (much) better ways to achieve what you want. What is it that you want to do?

For you information; the latin2ascii command does not extract the content of the pdf. It just prints the all the bytes from the file (in your case the pdf) in ascii notation.

pietermarsman added the type: bug label Jan 9, 2020

pietermarsman mentioned this issue Jan 14, 2020

Remove latin2ascii.py because it converts the latin-interpreted bytes of a file to ascii, but this has not much to do with PDF's. #360

Merged

2 tasks

pietermarsman closed this as completed in #360 Jan 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uncaught decoder failure #356

uncaught decoder failure #356

herrold commented Jan 9, 2020

pietermarsman commented Jan 9, 2020

uncaught decoder failure #356

uncaught decoder failure #356

Comments

herrold commented Jan 9, 2020

pietermarsman commented Jan 9, 2020