Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uncaught decoder failure #356

Closed
herrold opened this issue Jan 9, 2020 · 1 comment · Fixed by #360
Closed

uncaught decoder failure #356

herrold opened this issue Jan 9, 2020 · 1 comment · Fixed by #360

Comments

@herrold
Copy link

herrold commented Jan 9, 2020

Describe the bug
The decoder hits a character it cannot decode and segfaults, rather than gracefully erroring

To Reproduce
target file creating the issue is at:
http://gallery.herrold.com/stuff/Murray93DrunkAndDog.pdf

CentOS 7 with EPEL

[herrold@localhost prices]$ rpm -V python36-pdfminer
[herrold@localhost prices]$ rpm -q python36-pdfminer
python36-pdfminer-20160614-5.el7.noarch

[herrold@localhost prices]$ /usr/bin/latin2ascii Murray93DrunkAndDog.pdf
Traceback (most recent call last):
File "/usr/bin/latin2ascii", line 130, in
if name == 'main': sys.exit(main(sys.argv))
File "/usr/bin/latin2ascii", line 125, in main
for line in fileinput.input(args):
File "/usr/lib64/python3.6/fileinput.py", line 250, in next
line = self._readline()
File "/usr/lib64/python3.6/fileinput.py", line 364, in _readline
return self._readline()
File "/usr/lib64/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 10: ordinal not in range(128)
[herrold@localhost prices]$

ask if you need more information to reproduce, but this should suffice

Murray93DrunkAndDog.pdf

@pietermarsman
Copy link
Member

The latin2ascii.py file did not have any meaningful changes since 2010. I guess it is outdated and there are (much) better ways to achieve what you want. What is it that you want to do?

For you information; the latin2ascii command does not extract the content of the pdf. It just prints the all the bytes from the file (in your case the pdf) in ascii notation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants