Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurring error #20

Closed
thiswillbeyourgithub opened this issue May 10, 2021 · 7 comments
Closed

Recurring error #20

thiswillbeyourgithub opened this issue May 10, 2021 · 7 comments

Comments

@thiswillbeyourgithub
Copy link

thiswillbeyourgithub commented May 10, 2021

Hi, I frequently have this error message :

Cancelled OCR processing with message :
['Traceback (most recent call last):\n', ' File "/home/USERNAME/.local/share/Anki2/addons21/450181164/gui.py", line 56, in on_run_ocr\n ocr.run_ocr_on_notes(note_ids=selected_nids)\n', ' File "/home/USERNAME/.local/share/Anki2/addons21/450181164/ocr.py", line 306, in run_ocr_on_notes\n notes_query = self.run_ocr_on_query(note_ids=note_ids)\n', ' File "/home/USERNAME/.local/share/Anki2/addons21/450181164/ocr.py", line 285, in run_ocr_on_query\n raw_results = self._ocr_unbatched_process(image_paths=image_paths)\n', ' File "/home/USERNAME/.local/share/Anki2/addons21/450181164/ocr.py", line 149, in _ocr_unbatched_process\n raw_results[image_path] = future.result()\n', ' File "concurrent/futures/_base.py", line 432, in result\n', ' File "concurrent/futures/_base.py", line 388, in __get_result\n', ' File "concurrent/futures/thread.py", line 57, in run\n', ' File "/home/USERNAME/.local/share/Anki2/addons21/450181164/ocr.py", line 262, in _ocr_img\n return pytesseract.image_to_string(str(img_pth), lang="+".join(languages or ["eng"]),\n', ' File "/home/USERNAME/.local/share/Anki2/addons21/450181164/_vendor/pytesseract/pytesseract.py", line 368, in image_to_string\n return {\n', ' File "/home/USERNAME/.local/share/Anki2/addons21/450181164/_vendor/pytesseract/pytesseract.py", line 371, in \n Output.STRING: lambda: run_and_get_output(*args),\n', ' File "/home/USERNAME/.local/share/Anki2/addons21/450181164/_vendor/pytesseract/pytesseract.py", line 280, in run_and_get_output\n run_tesseract(**kwargs)\n', ' File "/home/USERNAME/.local/share/Anki2/addons21/450181164/_vendor/pytesseract/pytesseract.py", line 257, in run_tesseract\n raise TesseractError(proc.returncode, get_errors(error_string))\n', '450181164._vendor.pytesseract.pytesseract.TesseractError: (-11, "read_params_file: Can't open txt read_params_file: Can't open txt Tesseract Open Source OCR Engine v4.0.0-beta.1 with Leptonica Detected 33 diacritics contains_unichar_id(unichar_id):Error:Assert failed:in file ../ccutil/unicharset.h, line 513")\n']

My OS is ubuntu 18, anki version is 2.1.35. I have the latest ankiOCR available for this anki version.

It happens without me understanding what kind of notes causes it. I had it with some text of my lesson that used weird characters (like the female and male thingie like ♂️ ) but it also happenned to random image occlusions.

Unfortunately, this error can happen after having ankiocr running through hundreds of cards in a row without saving the changes :/ So I have to redo it over and over again until I find the culprit cards.

Let me know if I can help you troubleshoot that :)

@thiswillbeyourgithub
Copy link
Author

Apparently I found a fix. I have no idea what it does, I just followed advice there : tesseract-ocr/tesseract#1205

I'll make a PR right away

@thiswillbeyourgithub
Copy link
Author

Feel free to close this

@cfculhane
Copy link
Owner

cfculhane commented May 21, 2021 via email

@cfculhane
Copy link
Owner

cfculhane commented May 21, 2021 via email

@thiswillbeyourgithub
Copy link
Author

thiswillbeyourgithub commented May 21, 2021

Flat out with uni too :) good luck with your exams!

Beware, I have no idea if this fix causes issue on other setups. Maybe it's just my installation of pytesseract or linux integration, who knows...

@cfculhane
Copy link
Owner

Closed with new release

cfculhane added a commit that referenced this issue Sep 4, 2021
@thiswillbeyourgithub
Copy link
Author

Just a heads up: it seems I've been having consistently better results using the following tesseract config : --oem 2 --psm 12 -c preserve_interword_spaces=1

relevant documentation :

 --psm N
     Set Tesseract to only run a subset of layout analysis and assume a
     certain form of image. The options for N are:
         0 = Orientation and script detection (OSD) only.
         1 = Automatic page segmentation with OSD.
         2 = Automatic page segmentation, but no OSD, or OCR. (not implemented)
         3 = Fully automatic page segmentation, but no OSD. (Default)
         4 = Assume a single column of text of variable sizes.
         5 = Assume a single uniform block of vertically aligned text.
         6 = Assume a single uniform block of text.
         7 = Treat the image as a single text line.
         8 = Treat the image as a single word.
         9 = Treat the image as a single word in a circle.
         10 = Treat the image as a single character.
         11 = Sparse text. Find as much text as possible in no particular order.
         12 = Sparse text with OSD.
         13 = Raw line. Treat the image as a single text line,
              bypassing hacks that are Tesseract-specific.
 --oem N
           Specify OCR Engine mode. The options for N are:
               0 = Original Tesseract only.
               1 = Neural nets LSTM only.
               2 = Tesseract + LSTM.
               3 = Default, based on what is available.

I think that changing the oem value can be a breaking change because it depends on which engine is present on the user's computer but the 2 other flags seem like a free meal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants