Problem converting pdf to txt with pdf2txt.py #104

JSB97 · 2015-04-04T07:58:35Z

I am trying to convert the following pdf to txt.
http://www.kabupro.jp/edp/20140529/S1001UPO.pdf

Using the following command
pdf2txt.py -o text.txt S1001UPO.pdf

The document is encrypted so i remove this first; however, even after doing this i get the below error.

I suspect the issue is with "TypeError: must be encoded string without NULL bytes, not str", to which this seems to offer a solution -
http://stackoverflow.com/questions/18265084/typeerror-must-be-string-without-null-bytes-not-str

Could someone point me to a work around? Thank you!!

Traceback (most recent call last):
File "/Users/JB1/anaconda/bin/pdf2txt.py", line 115, in
if name == 'main': sys.exit(main(sys.argv))
File "/Users/JB1/anaconda/bin/pdf2txt.py", line 109, in main
interpreter.process_page(page)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 833, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 844, in render_contents
self.init_resources(resources)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 348, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 196, in get_font
font = self.get_font(None, subspec)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdfinterp.py", line 187, in get_font
font = PDFCIDFont(self, spec)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/pdffont.py", line 668, in init
self.unicode_map = CMapDB.get_unicode_map(self.cidcoding, self.cmap.is_vertical())
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/cmapdb.py", line 276, in get_unicode_map
data = klass._load_data('to-unicode-%s' % name)
File "/Users/JB1/anaconda/lib/python2.7/site-packages/pdfminer/cmapdb.py", line 247, in _load_data
if os.path.exists(path):
File "/Users/JB1/anaconda/lib/python2.7/genericpath.py", line 18, in exists
os.stat(path)
TypeError: must be encoded string without NULL bytes, not str

tataganesh · 2017-08-17T05:38:01Z

@JSB97 I have also encountered the same error. The problematic snippet in cmapdb.py seems to be -

    def _load_data(klass, name):
        filename = '%s.pickle.gz' % name
        if klass.debug:
            print >>sys.stderr, 'loading:', name
        cmap_paths = (os.environ.get('CMAP_PATH', '/usr/share/pdfminer/'),
                      os.path.join(os.path.dirname(__file__), 'cmap'),)
        for directory in cmap_paths:
            path = os.path.join(directory, filename)

Printing the variable "filename" gives me -
to-unicode-PDFXC30-Identity.pickle.gz
Printing "repr(filename)" yields -
'to-unicode-PDFXC30-Identity\x00\x00.pickle.gz'
Apparently, these \x00 characters are causing the issue. One fix that solved this issue for me was -
filename = filename.replace('\0', '')
I am not sure what is causing this issue, though.
@euske Is there a way to make a permanent fix for this?

tataganesh · 2017-11-07T05:43:03Z

A fork of the repository pdfminer.six has been created at - https://github.com/strideai/pdfminer.six . This issue has been fixed in this fork, and we will now be maintaining the forked repository.

softboy99 · 2023-04-28T13:44:07Z

Hi
@tataganesh ,
after test still failed.
simple1.pdf

tataganesh mentioned this issue Oct 29, 2017

Problem converting pdf to txt with pdf2txt.py pdfminer/pdfminer.six#100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem converting pdf to txt with pdf2txt.py #104

Problem converting pdf to txt with pdf2txt.py #104

JSB97 commented Apr 4, 2015

tataganesh commented Aug 17, 2017 •

edited

Loading

tataganesh commented Nov 7, 2017

softboy99 commented Apr 28, 2023

Problem converting pdf to txt with pdf2txt.py #104

Problem converting pdf to txt with pdf2txt.py #104

Comments

JSB97 commented Apr 4, 2015

Could someone point me to a work around? Thank you!!

tataganesh commented Aug 17, 2017 • edited Loading

tataganesh commented Nov 7, 2017

softboy99 commented Apr 28, 2023

tataganesh commented Aug 17, 2017 •

edited

Loading