-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extra space in the result pdf when the input pdf is in Chinese #715
Comments
The equivalent to For the WinError, try running with the argument You can also try running |
Thanks for your reply!
The details are in the Reply.pdf as there's a lot of things come out after i use the argument --verbose 2.
The zip contains my test material. My tesseract engine is tesseract-ocr-w64-setup-v5.0.0-alpha.20201127. I hope these information will help you to reproduce my problem.
Please let me know if you need any further information.
Thanks for your time again!!!
|
Unfortunately your attachment did not come through. I don't think Github will post email attachments. I believe you have to use the web interface to provide attached files. |
--oem 1 --psm 6 -l chi_sim -c preserve_interword_spaces=1
pdfocr.ocr(inputpath,'ocr-'+filename,language=language0,tesseract_oem=1,tesseract_pagesegmode=6) |
I have encountered the same problems, the version of ocrmypdf is |
I can't support or fix versions as old as 9.6.0. |
@jbarlow83 I have just installed the newest version ( |
I encountered the same issue. The "--sidecar" txt was correct but output pdf contained extra spaces. My environment is Windows 11, ocrmypdf version 13.4.7, tesseract v5.1.0.20220510. |
Extra spaces in words is usually a PDF viewer issue. This is partly because PDF viewers have to decide where word breaks are - and sometimes they don't do this well. Try a different PDF viewer. In particular check Adobe Reader. |
@jbarlow83 Thanks for your advice, but I have tried |
I am experiencing the same issue of additional spaces in chi_sim text on mac running version 13.7.0 on a mac.
边疆既是一个地域概念,也是一个政治概念。就地域层面而
边疆 既是 一 个 地 域 概念 , 也 是 一 个 政治 概念 。 就 地 域 层 面 而 I guess the issue has something to do with tokenization, as the characters connected without spaces are valid tokens. |
Any workaround to get rid of spaces? 👀 |
I wrote an article in Chinese describing almost possible solutions, but not completely solved. Non-native Chinese speakers can use translation software to convert and read it. link:https://www.cnblogs.com/issacnew/p/17468697.html |
The gist of the article above is that creating a tesseract config file with the contents @ZetaLin Please understand that the issue is due to Tesseract producing PDFs that some PDF readers do not interpret correctly, and no one has a solution at this time. |
Yes, I tested tesseract v5.3.1.20230401 like this: I get the same result as with ocrmypdf: The output txt has no Spaces, but the text copied from the pdf still has Spaces. Thus, this problem occurs from Tesseract NOT ocrmypdf. This conclusion needs to be known by more users. |
Currently, it seems that the only and not particularly good solution for ocrmypdf to make the copied text from the output pdf with no Spaces is to use oem 0 (which takes a non-LSTM model, but does not recognize well).
This person's test confirmed my claim: tesseract-ocr/tesseract#2814 (comment) |
Is there some way to solve it? |
#1191 input requested |
Hi.
First, sorry for my poor English.
Description
Recently I upgraded my tesseract engine from v4.0.0.20181030 to v5.0.0-alpha.20201127 and two things happened.
One is there is space between every single words when i OCR a pdf with pure English text which is good and i didn't get those extra space when my engine was v4.0. That means i got text like "thereisnospacebetweenwords " before, and now it becomes "there is no space between words ". However, with the v5.0 engine, it went wrong when my input pdf is in Chinese, as there is extra space between every single letter. The result now is like 每 个 字 之 间 都 有 多 余 的 空 格 。 (FYI, i didn't get those extra space when using ocrmypdf to OCR Chinese pdf with tesseract v4.0)
To Reproduce
my tesseract engines are the following downloaded from https://digi.bib.uni-mannheim.de/tesseract/
tesseract-ocr-w64-setup-v4.0.0.20181030.exe
tesseract-ocr-w64-setup-v5.0.0-alpha.20201127.exe
I just typed
ocrmypdf input_name.pdf OCR-output_name.pdf -l chi_sim
in CLI.Expected behavior
I wish to keep the space in the english text, while omit the extra space in the chinese text.
System (please complete the following information):
Additional context
I tested the tesseract engine v5.0, and the output text is just fine after i used parameter --psm 6. But the parameter seems doesn't work well for ocrmypdf. (The parameter does work a little bit in ocrmypdf, the text layer changed from 每 个 字 都 有 多 余 的 空 格 to 每 个 字 都有 多余 的 空格.
FYI: The solution for the extra space in CJK in tesseract tesseract-ocr/tesseract#991
Please let me know if you need any further information. Thanks!
The text was updated successfully, but these errors were encountered: