Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extra space in the result pdf when the input pdf is in Chinese #715

Open
Eyxxxxx opened this issue Jan 15, 2021 · 20 comments
Open

extra space in the result pdf when the input pdf is in Chinese #715

Eyxxxxx opened this issue Jan 15, 2021 · 20 comments
Labels
third party issue Problem with a third party dependency

Comments

@Eyxxxxx
Copy link

Eyxxxxx commented Jan 15, 2021

Hi.
First, sorry for my poor English.

Description
Recently I upgraded my tesseract engine from v4.0.0.20181030 to v5.0.0-alpha.20201127 and two things happened.
One is there is space between every single words when i OCR a pdf with pure English text which is good and i didn't get those extra space when my engine was v4.0. That means i got text like "thereisnospacebetweenwords " before, and now it becomes "there is no space between words ". However, with the v5.0 engine, it went wrong when my input pdf is in Chinese, as there is extra space between every single letter. The result now is like 每 个 字 之 间 都 有 多 余 的 空 格 。 (FYI, i didn't get those extra space when using ocrmypdf to OCR Chinese pdf with tesseract v4.0)

To Reproduce
my tesseract engines are the following downloaded from https://digi.bib.uni-mannheim.de/tesseract/
tesseract-ocr-w64-setup-v4.0.0.20181030.exe
tesseract-ocr-w64-setup-v5.0.0-alpha.20201127.exe

I just typed ocrmypdf input_name.pdf OCR-output_name.pdf -l chi_sim in CLI.

Expected behavior
I wish to keep the space in the english text, while omit the extra space in the chinese text.

System (please complete the following information):

  • OS: windows
  • Python version: 3.8.2
  • OCRmyPDF version: I upgraded my ocrmypdf from 11.2.1 to 11.5.0 today, and another problem occured:

[WinError 2] The system can not find the file specified.
The warning appeared two times as shown in the picture.
winerror 2
But i still get a result, which is the same as the one before i upgraded ocrmypdf.

Additional context
I tested the tesseract engine v5.0, and the output text is just fine after i used parameter --psm 6. But the parameter seems doesn't work well for ocrmypdf. (The parameter does work a little bit in ocrmypdf, the text layer changed from 每 个 字 都 有 多 余 的 空 格 to 每 个 字 都有 多余 的 空格.

FYI: The solution for the extra space in CJK in tesseract tesseract-ocr/tesseract#991

Please let me know if you need any further information. Thanks!

@jbarlow83
Copy link
Collaborator

The equivalent to --psm 6 in ocrmypdf is --tesseract-psm 6.

For the WinError, try running with the argument --verbose 2. That should allow us to see what is happening immediately before this exception to resolve that issue.

You can also try running ocrmypdf --sidecar output.txt. If there are extra spaces in the sidecar file, then the problem lies with tesseract.

@Eyxxxxx
Copy link
Author

Eyxxxxx commented Jan 15, 2021 via email

@jbarlow83
Copy link
Collaborator

Unfortunately your attachment did not come through. I don't think Github will post email attachments. I believe you have to use the web interface to provide attached files.

@Eyxxxxx
Copy link
Author

Eyxxxxx commented Jan 16, 2021

Reply.pdf
test.zip
Sorry, I'm new to github.

@jbarlow83
Copy link
Collaborator

The parameter is actually --tesseract-pagesegmode, not --tesseract-psm.

If you create a pure image version of the file, Tesseract also inserts spaces when it should not. I cannot resolve the issue, because I rely on Tesseract to properly insert spaces.

For example, using the following file:
input

And Tesseract

tesseract -l chi_sim input.png output pdf

Will give you a file with similar issues.

Please report the issue to github.com/tesseract-ocr/tesseract.

@jbarlow83 jbarlow83 added the third party issue Problem with a third party dependency label Jan 17, 2021
@woaidianqian
Copy link

woaidianqian commented May 27, 2021

--oem 1 --psm 6 -l chi_sim -c preserve_interword_spaces=1
parameter preserve_interword_spaces=1 can fix this problem。

  • but How to set this parameter in OCRMYPDF?
    and -c means what?
    ocrmypdf Python code

pdfocr.ocr(inputpath,'ocr-'+filename,language=language0,tesseract_oem=1,tesseract_pagesegmode=6)

@SimonZh1234
Copy link

I have encountered the same problems, the version of ocrmypdf is 9.6.0+dfsg on ubuntu. I use
ocrmypdf -l chi_sim --sidecar test.txt test.pdf test.pdf.pdf
as suggested, the texts in test.txt is correct, but unexpected spaces exist in test.pdf.pdf.
test.zip

@jbarlow83
Copy link
Collaborator

I can't support or fix versions as old as 9.6.0.

@SimonZh1234
Copy link

@jbarlow83 I have just installed the newest version (13.4.5) of ocrmypdf via pip on ubuntu. But the problems persists:
ocrmypdf -l chi_sim --sidecar test.txt test.pdf test.pdf.pdf
gives correct test.txt but test.pdf.pdf contains extra spaces.
test.zip

@Kder
Copy link

Kder commented Jun 12, 2022

@jbarlow83 I have just installed the newest version (13.4.5) of ocrmypdf via pip on ubuntu. But the problems persists: ocrmypdf -l chi_sim --sidecar test.txt test.pdf test.pdf.pdf gives correct test.txt but test.pdf.pdf contains extra spaces. test.zip

I encountered the same issue. The "--sidecar" txt was correct but output pdf contained extra spaces. My environment is Windows 11, ocrmypdf version 13.4.7, tesseract v5.1.0.20220510.

@jbarlow83
Copy link
Collaborator

Extra spaces in words is usually a PDF viewer issue. This is partly because PDF viewers have to decide where word breaks are - and sometimes they don't do this well. Try a different PDF viewer. In particular check Adobe Reader.

@SimonZh1234
Copy link

@jbarlow83 Thanks for your advice, but I have tried Adobe Reader, Foxit Reader, Xodo and evince in
this example, all the above software CANNOT copy the text WITHOUT spaces. Is it convenient for you to have a try with the zip file I uploaded?

@cliveparkinson
Copy link

cliveparkinson commented Aug 27, 2022

I am experiencing the same issue of additional spaces in chi_sim text on mac running version 13.7.0 on a mac.

  • Sidecar text output:

边疆既是一个地域概念,也是一个政治概念。就地域层面而
言,是指国家毗连边界线、与内地 〈内陆、内海) 相对而言的区
域。一般而言,历史上中国的边疆是在秦统一中原、其重心部分
形成之后确立的,有着两千多年的历史沿革。相应地,中国的边
疆研究也有着悠久的历史和优良的传统,并与国家和边疆的安危
息息相关。

  • ocrmypdf output pasted from Adobe Reader (spacing is identical in Microsoft Edge):

边疆 既是 一 个 地 域 概念 , 也 是 一 个 政治 概念 。 就 地 域 层 面 而
言 , 是 指 国家 毗连 边界 线 、 与 内 地 〈 内 陆 、 内 海 ) 相对 而 言 的 区
域 。 一 般 而 言 , 历 史上 中 国 的 边疆 是 在 秦 统 一 中 原 、 其 重心 部 分
形成 之 后 确立 的 , 有 着 两 千 多 年 的 历史 沿革 。 相 应 地 , 中 国 的 边
疆 研 究 也 有 着 悠久 的 历史 和 优良 的 传统 , 并 与 国家 和 边疆 的 安危
息息相关 。

I guess the issue has something to do with tokenization, as the characters connected without spaces are valid tokens.

@liblaf
Copy link

liblaf commented Mar 20, 2023

Any workaround to get rid of spaces? 👀

@ZetaLin
Copy link

ZetaLin commented Jun 9, 2023

I wrote an article in Chinese describing almost possible solutions, but not completely solved. Non-native Chinese speakers can use translation software to convert and read it. link:https://www.cnblogs.com/issacnew/p/17468697.html

@jbarlow83
Copy link
Collaborator

The gist of the article above is that creating a tesseract config file with the contents preserve_interword_spaces 1 will improve output in some situations.

@ZetaLin Please understand that the issue is due to Tesseract producing PDFs that some PDF readers do not interpret correctly, and no one has a solution at this time.

@ZetaLin
Copy link

ZetaLin commented Jun 9, 2023

The gist of the article above is that creating a tesseract config file with the contents preserve_interword_spaces 1 will improve output in some situations.

@ZetaLin Please understand that the issue is due to Tesseract producing PDFs that some PDF readers do not interpret correctly, and no one has a solution at this time.

Yes, I tested tesseract v5.3.1.20230401 like this:
tesseract input.png out -l chi_sim --oem 1 --psm 6 -c preserve_interword_spaces=1 pdf

I get the same result as with ocrmypdf: The output txt has no Spaces, but the text copied from the pdf still has Spaces.

Thus, this problem occurs from Tesseract NOT ocrmypdf. This conclusion needs to be known by more users.

@ZetaLin
Copy link

ZetaLin commented Jun 9, 2023

Currently, it seems that the only and not particularly good solution for ocrmypdf to make the copied text from the output pdf with no Spaces is to use oem 0 (which takes a non-LSTM model, but does not recognize well).

ocrmypdf -l chi_sim --tesseract-oem 0 input.pdf output.pdf
This method directly copies the text of the pdf, there will be no Spaces, but some of the copied text is not correctly identified.

This person's test confirmed my claim: tesseract-ocr/tesseract#2814 (comment)

@hhiyorimi
Copy link

Is there some way to solve it?

@jbarlow83
Copy link
Collaborator

#1191 input requested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
third party issue Problem with a third party dependency
Projects
None yet
Development

No branches or pull requests

9 participants