Skip to content
This repository has been archived by the owner on Jun 14, 2018. It is now read-only.

pyocr with latest Tesseract fails with pyocr.error.TesseractError: "Error, unknown command line argument '-psm'\n") #99

Open
ddddavidmartin opened this issue Jun 7, 2018 · 5 comments
Labels

Comments

@ddddavidmartin
Copy link
Contributor

Good day,

I'm using pyocr through Paperless on a Ubuntu setup. I'm using the tesseract-ocr PPA [0] and on the latest version [1] pyocr throws an error.

[0]

cat /etc/apt/sources.list.d/alex-p-ubuntu-tesseract-ocr-artful.list
deb http://ppa.launchpad.net/alex-p/tesseract-ocr/ubuntu artful main

[1]

tesseract --version
tesseract 4.0.0-beta.1-302-g3aa9
 leptonica-1.76.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 1.6.34 : libtiff 4.0.8 : zlib 1.2.11 : libwebp 0.6.0 : libopenjp2 2.3.0

Traceback:

littlebig@littlebig:~/Dev/paperless$ python3 /home/littlebig/Dev/paperless/src/manage.py document_consumer
Starting document consumer at /home/littlebig/paperless_consumption_dir with inotify
Parsers available: RasterisedDocumentParser
Consuming /home/littlebig/paperless_consumption_dir/BRW90CDB68D60F5_000798.pdf
Processing sheet #1: /tmp/paperless/paperless-b5bgnwtm/convert-0000.pnm -> /tmp/paperless/paperless-b5bgnwtm/convert-0000.unpaper.pnm
[pgm_pipe @ 0x55cbcbdfb980] Stream #0: not enough frames to estimate rate; consider increasing probesize
[image2 @ 0x55cbcbe00140] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55cbcbe00140] Encoder did not produce proper pts, making some up.
OCRing the document
Parsing for eng
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 290, in image_to_string
    return ocr.image_to_string(f, lang=lang)
  File "/home/littlebig/.local/lib/python3.6/site-packages/pyocr/tesseract.py", line 367, in image_to_string
    raise TesseractError(status, errors)
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\n")
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/littlebig/Dev/paperless/src/manage.py", line 18, in <module>
    execute_from_command_line(sys.argv)
  File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/home/littlebig/.local/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 98, in handle
    self.loop_inotify(mail_delta)
  File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 131, in loop_inotify
    self.loop_step(mail_delta)
  File "/home/littlebig/Dev/paperless/src/documents/management/commands/document_consumer.py", line 123, in loop_step
    self.file_consumer.consume_new_files()
  File "/home/littlebig/Dev/paperless/src/documents/consumer.py", line 107, in consume_new_files
    if not self.try_consume_file(file):
  File "/home/littlebig/Dev/paperless/src/documents/consumer.py", line 145, in try_consume_file
    date = parsed_document.get_date()
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 209, in get_date
    text = self.get_text()
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 80, in get_text
    self._text = self._get_ocr(images)
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 140, in _get_ocr
    raw_text = self._ocr([imgs[middle]], self.DEFAULT_OCR_LANGUAGE)
  File "/home/littlebig/Dev/paperless/src/paperless_tesseract/parsers.py", line 189, in _ocr
    r = pool.map(image_to_string, itertools.product(imgs, [lang]))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
pyocr.error.TesseractError: (1, b"Error, unknown command line argument '-psm'\n")
littlebig@littlebig:~/Dev/paperless$

Has anyone else come across this? Thanks!

@ddddavidmartin ddddavidmartin changed the title pyocr with latest Tesseract fails with pyocr.error.TesseractError: "Error, unknown command line argument '-psm'\n") pyocr with latest Tesseract fails with pyocr.error.TesseractError: "Error, unknown command line argument '-psm'\n") Jun 7, 2018
@ddddavidmartin
Copy link
Contributor Author

Having a look through the pyocr sources this stands out to me:

src/pyocr/builders.py
307-        file_ext = ["txt"]
308:        tess_flags = ["-psm", str(tesseract_layout)]
309-        cun_args = ["-f", "text"]
--
564-        file_ext = ["html", "hocr"]
565:        tess_flags = ["-psm", str(tesseract_layout)]
566-        tess_conf = ["hocr"]
--
640-        file_ext = ["html", "hocr"]
641:        tess_flags = ["-psm", str(tesseract_layout)]
642-        tess_conf = ["hocr"]

Does pyocr just use -psm instead of --psm as the parameter? I'm wondering whether that is not accepted anymore now.

@ddddavidmartin
Copy link
Contributor Author

ddddavidmartin commented Jun 7, 2018

Does pyocr just use -psm instead of --psm as the parameter? I'm wondering whether that is not accepted anymore now.

It looks like this is the problem. I have changed the passed options in builds.py to provide --psm instead of -psm and it works fine now. I might create a pull request for this though I'm not sure whether there are any other implications of this.

The commit in question in tesseract is the following:
tesseract-ocr/tesseract@ee201e1

@simonm3
Copy link

simonm3 commented Jun 10, 2018

I also came across this today. I note that -psm is used not just in builders.py but also in tesseract.py.

@jflesch
Copy link
Member

jflesch commented Jun 10, 2018

#100

@ddddavidmartin
Copy link
Contributor Author

I haven't had a chance yet to work out the circular import statements that I introduced in https://github.com/ddddavidmartin/pyocr/tree/update_deprecated_psm_option_string. If anyone wants to step in, feel free to give it a go.

For now, a quick and dirty fix is to just apply c136838.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants