Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

Problem converting pdf to txt with pdf2txt.py #104

Open
JSB97 opened this issue Apr 4, 2015 · 3 comments
Open

Problem converting pdf to txt with pdf2txt.py #104

JSB97 opened this issue Apr 4, 2015 · 3 comments

Comments

@JSB97
Copy link

JSB97 commented Apr 4, 2015

I am trying to convert the following pdf to txt.
http://www.kabupro.jp/edp/20140529/S1001UPO.pdf

Using the following command
pdf2txt.py -o text.txt S1001UPO.pdf

The document is encrypted so i remove this first; however, even after doing this i get the below error.

I suspect the issue is with "TypeError: must be encoded string without NULL bytes, not str", to which this seems to offer a solution -
http://stackoverflow.com/questions/18265084/typeerror-must-be-string-without-null-bytes-not-str

Could someone point me to a work around? Thank you!!

Traceback (most recent call last):
File "/Users/JB1/anaconda/bin/pdf2txt.py", line 115, in
if name == 'main': sys.exit(main(sys.argv))
File "/Users/JB1/anaconda/bin/pdf2txt.py", line 109, in main
interpreter.process_page(page)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 833, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 844, in render_contents
self.init_resources(resources)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 348, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 196, in get_font
font = self.get_font(None, subspec)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 187, in get_font
font = PDFCIDFont(self, spec)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdffont.py", line 668, in init
self.unicode_map = CMapDB.get_unicode_map(self.cidcoding, self.cmap.is_vertical())
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/cmapdb.py", line 276, in get_unicode_map
data = klass._load_data('to-unicode-%s' % name)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/cmapdb.py", line 247, in _load_data
if os.path.exists(path):
File "/Users/JB1/anaconda/lib/python2.7/genericpath.py", line 18, in exists
os.stat(path)
TypeError: must be encoded string without NULL bytes, not str

@tataganesh
Copy link

tataganesh commented Aug 17, 2017

@JSB97 I have also encountered the same error. The problematic snippet in cmapdb.py seems to be -

    def _load_data(klass, name):
        filename = '%s.pickle.gz' % name
        if klass.debug:
            print >>sys.stderr, 'loading:', name
        cmap_paths = (os.environ.get('CMAP_PATH', '/usr/share/pdfminer/'),
                      os.path.join(os.path.dirname(__file__), 'cmap'),)
        for directory in cmap_paths:
            path = os.path.join(directory, filename)

Printing the variable "filename" gives me -
to-unicode-PDFXC30-Identity.pickle.gz
Printing "repr(filename)" yields -
'to-unicode-PDFXC30-Identity\x00\x00.pickle.gz'
Apparently, these \x00 characters are causing the issue. One fix that solved this issue for me was -
filename = filename.replace('\0', '')
I am not sure what is causing this issue, though.
@euske Is there a way to make a permanent fix for this?

@tataganesh
Copy link

A fork of the repository pdfminer.six has been created at - https://github.com/strideai/pdfminer.six . This issue has been fixed in this fork, and we will now be maintaining the forked repository.

@softboy99
Copy link

Hi
@tataganesh ,
after test still failed.
simple1.pdf

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants