extra space in the result pdf when the input pdf is in Chinese #715

Eyxxxxx · 2021-01-15T04:23:08Z

Hi.
First, sorry for my poor English.

Description
Recently I upgraded my tesseract engine from v4.0.0.20181030 to v5.0.0-alpha.20201127 and two things happened.
One is there is space between every single words when i OCR a pdf with pure English text which is good and i didn't get those extra space when my engine was v4.0. That means i got text like "thereisnospacebetweenwords " before, and now it becomes "there is no space between words ". However, with the v5.0 engine, it went wrong when my input pdf is in Chinese, as there is extra space between every single letter. The result now is like 每个字之间都有多余的空格。 (FYI, i didn't get those extra space when using ocrmypdf to OCR Chinese pdf with tesseract v4.0)

To Reproduce
my tesseract engines are the following downloaded from https://digi.bib.uni-mannheim.de/tesseract/
tesseract-ocr-w64-setup-v4.0.0.20181030.exe
tesseract-ocr-w64-setup-v5.0.0-alpha.20201127.exe

I just typed ocrmypdf input_name.pdf OCR-output_name.pdf -l chi_sim in CLI.

Expected behavior
I wish to keep the space in the english text, while omit the extra space in the chinese text.

System (please complete the following information):

OS: windows
Python version: 3.8.2
OCRmyPDF version: I upgraded my ocrmypdf from 11.2.1 to 11.5.0 today, and another problem occured:

[WinError 2] The system can not find the file specified.
The warning appeared two times as shown in the picture.

But i still get a result, which is the same as the one before i upgraded ocrmypdf.

Additional context
I tested the tesseract engine v5.0, and the output text is just fine after i used parameter --psm 6. But the parameter seems doesn't work well for ocrmypdf. （The parameter does work a little bit in ocrmypdf, the text layer changed from 每个字都有多余的空格 to 每个字都有多余的空格.

FYI: The solution for the extra space in CJK in tesseract tesseract-ocr/tesseract#991

Please let me know if you need any further information. Thanks!

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2021-01-15T09:58:28Z

The equivalent to --psm 6 in ocrmypdf is --tesseract-psm 6.

For the WinError, try running with the argument --verbose 2. That should allow us to see what is happening immediately before this exception to resolve that issue.

You can also try running ocrmypdf --sidecar output.txt. If there are extra spaces in the sidecar file, then the problem lies with tesseract.

Eyxxxxx · 2021-01-15T15:16:40Z

Thanks for your reply! The details are in the Reply.pdf as there's a lot of things come out after i use the argument --verbose 2. The zip contains my test material.  My tesseract engine is tesseract-ocr-w64-setup-v5.0.0-alpha.20201127. I hope these information will help you to reproduce my problem.  Please let me know if you need any further information.  Thanks for your time again!!!

jbarlow83 · 2021-01-15T19:33:06Z

Unfortunately your attachment did not come through. I don't think Github will post email attachments. I believe you have to use the web interface to provide attached files.

Eyxxxxx · 2021-01-16T03:35:23Z

Reply.pdf
test.zip
Sorry, I'm new to github.

jbarlow83 · 2021-01-17T21:14:14Z

The parameter is actually --tesseract-pagesegmode, not --tesseract-psm.

If you create a pure image version of the file, Tesseract also inserts spaces when it should not. I cannot resolve the issue, because I rely on Tesseract to properly insert spaces.

For example, using the following file:

And Tesseract

tesseract -l chi_sim input.png output pdf

Will give you a file with similar issues.

Please report the issue to github.com/tesseract-ocr/tesseract.

woaidianqian · 2021-05-27T07:57:16Z

--oem 1 --psm 6 -l chi_sim -c preserve_interword_spaces=1
parameter preserve_interword_spaces=1 can fix this problem。

but How to set this parameter in OCRMYPDF？
and -c means what？
ocrmypdf Python code

pdfocr.ocr(inputpath,'ocr-'+filename,language=language0,tesseract_oem=1,tesseract_pagesegmode=6)

SimonZh1234 · 2022-05-26T07:51:10Z

I have encountered the same problems, the version of ocrmypdf is 9.6.0+dfsg on ubuntu. I use
ocrmypdf -l chi_sim --sidecar test.txt test.pdf test.pdf.pdf
as suggested, the texts in test.txt is correct, but unexpected spaces exist in test.pdf.pdf.
test.zip

jbarlow83 · 2022-05-26T07:57:02Z

I can't support or fix versions as old as 9.6.0.

SimonZh1234 · 2022-05-26T14:11:56Z

@jbarlow83 I have just installed the newest version (13.4.5) of ocrmypdf via pip on ubuntu. But the problems persists:
ocrmypdf -l chi_sim --sidecar test.txt test.pdf test.pdf.pdf
gives correct test.txt but test.pdf.pdf contains extra spaces.
test.zip

Kder · 2022-06-12T06:30:43Z

@jbarlow83 I have just installed the newest version (13.4.5) of ocrmypdf via pip on ubuntu. But the problems persists: ocrmypdf -l chi_sim --sidecar test.txt test.pdf test.pdf.pdf gives correct test.txt but test.pdf.pdf contains extra spaces. test.zip

I encountered the same issue. The "--sidecar" txt was correct but output pdf contained extra spaces. My environment is Windows 11, ocrmypdf version 13.4.7, tesseract v5.1.0.20220510.

jbarlow83 · 2022-06-12T06:51:24Z

Extra spaces in words is usually a PDF viewer issue. This is partly because PDF viewers have to decide where word breaks are - and sometimes they don't do this well. Try a different PDF viewer. In particular check Adobe Reader.

SimonZh1234 · 2022-06-12T07:04:49Z

@jbarlow83 Thanks for your advice, but I have tried Adobe Reader, Foxit Reader, Xodo and evince in
this example, all the above software CANNOT copy the text WITHOUT spaces. Is it convenient for you to have a try with the zip file I uploaded?

cliveparkinson · 2022-08-27T09:15:43Z

I am experiencing the same issue of additional spaces in chi_sim text on mac running version 13.7.0 on a mac.

Sidecar text output:

边疆既是一个地域概念，也是一个政治概念。就地域层面而
言，是指国家毗连边界线、与内地〈内陆、内海) 相对而言的区
域。一般而言，历史上中国的边疆是在秦统一中原、其重心部分
形成之后确立的，有着两千多年的历史沿革。相应地，中国的边
疆研究也有着悠久的历史和优良的传统，并与国家和边疆的安危
息息相关。

ocrmypdf output pasted from Adobe Reader (spacing is identical in Microsoft Edge):

边疆既是一个地域概念，也是一个政治概念。就地域层面而
言，是指国家毗连边界线、与内地〈内陆、内海 ) 相对而言的区
域。一般而言，历史上中国的边疆是在秦统一中原、其重心部分
形成之后确立的，有着两千多年的历史沿革。相应地，中国的边
疆研究也有着悠久的历史和优良的传统，并与国家和边疆的安危
息息相关。

I guess the issue has something to do with tokenization, as the characters connected without spaces are valid tokens.

liblaf · 2023-03-20T06:48:44Z

Any workaround to get rid of spaces? 👀

ZetaLin · 2023-06-09T03:21:45Z

I wrote an article in Chinese describing almost possible solutions, but not completely solved. Non-native Chinese speakers can use translation software to convert and read it. link：https://www.cnblogs.com/issacnew/p/17468697.html

jbarlow83 · 2023-06-09T06:37:13Z

The gist of the article above is that creating a tesseract config file with the contents preserve_interword_spaces 1 will improve output in some situations.

@ZetaLin Please understand that the issue is due to Tesseract producing PDFs that some PDF readers do not interpret correctly, and no one has a solution at this time.

ZetaLin · 2023-06-09T07:22:43Z

The gist of the article above is that creating a tesseract config file with the contents preserve_interword_spaces 1 will improve output in some situations.

@ZetaLin Please understand that the issue is due to Tesseract producing PDFs that some PDF readers do not interpret correctly, and no one has a solution at this time.

Yes, I tested tesseract v5.3.1.20230401 like this:
tesseract input.png out -l chi_sim --oem 1 --psm 6 -c preserve_interword_spaces=1 pdf

I get the same result as with ocrmypdf: The output txt has no Spaces, but the text copied from the pdf still has Spaces.

Thus, this problem occurs from Tesseract NOT ocrmypdf. This conclusion needs to be known by more users.

ZetaLin · 2023-06-09T09:08:16Z

Currently, it seems that the only and not particularly good solution for ocrmypdf to make the copied text from the output pdf with no Spaces is to use oem 0 (which takes a non-LSTM model, but does not recognize well).

ocrmypdf -l chi_sim --tesseract-oem 0 input.pdf output.pdf
This method directly copies the text of the pdf, there will be no Spaces, but some of the copied text is not correctly identified.

This person's test confirmed my claim: tesseract-ocr/tesseract#2814 (comment)

hhiyorimi · 2023-10-11T11:21:55Z

Is there some way to solve it?

jbarlow83 · 2023-11-21T08:33:20Z

#1191 input requested

jbarlow83 added the third party issue Problem with a third party dependency label Jan 17, 2021

hhiyorimi mentioned this issue Oct 11, 2023

How can I remove extra space between every characters #1166

Closed

tenpai-git mentioned this issue Mar 20, 2024

Add Japanese Vertical Support Branch for Tesseract and Ocrmypdf OCR eikek/docspell#2505

Merged

wywzxxz mentioned this issue Sep 21, 2024

当使用ocrmypdf输入 PDF 为中文时，结果复制PDF 中有额外的空格 #1391

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extra space in the result pdf when the input pdf is in Chinese #715

extra space in the result pdf when the input pdf is in Chinese #715

Eyxxxxx commented Jan 15, 2021 •

edited

Loading

jbarlow83 commented Jan 15, 2021

Eyxxxxx commented Jan 15, 2021 via email •

edited

Loading

jbarlow83 commented Jan 15, 2021

Eyxxxxx commented Jan 16, 2021

jbarlow83 commented Jan 17, 2021

woaidianqian commented May 27, 2021 •

edited

Loading

SimonZh1234 commented May 26, 2022

jbarlow83 commented May 26, 2022

SimonZh1234 commented May 26, 2022

Kder commented Jun 12, 2022

jbarlow83 commented Jun 12, 2022

SimonZh1234 commented Jun 12, 2022

cliveparkinson commented Aug 27, 2022 •

edited

Loading

liblaf commented Mar 20, 2023

ZetaLin commented Jun 9, 2023

jbarlow83 commented Jun 9, 2023

ZetaLin commented Jun 9, 2023 •

edited

Loading

ZetaLin commented Jun 9, 2023 •

edited

Loading

hhiyorimi commented Oct 11, 2023

jbarlow83 commented Nov 21, 2023

extra space in the result pdf when the input pdf is in Chinese #715

extra space in the result pdf when the input pdf is in Chinese #715

Comments

Eyxxxxx commented Jan 15, 2021 • edited Loading

jbarlow83 commented Jan 15, 2021

Eyxxxxx commented Jan 15, 2021 via email • edited Loading

jbarlow83 commented Jan 15, 2021

Eyxxxxx commented Jan 16, 2021

jbarlow83 commented Jan 17, 2021

woaidianqian commented May 27, 2021 • edited Loading

SimonZh1234 commented May 26, 2022

jbarlow83 commented May 26, 2022

SimonZh1234 commented May 26, 2022

Kder commented Jun 12, 2022

jbarlow83 commented Jun 12, 2022

SimonZh1234 commented Jun 12, 2022

cliveparkinson commented Aug 27, 2022 • edited Loading

liblaf commented Mar 20, 2023

ZetaLin commented Jun 9, 2023

jbarlow83 commented Jun 9, 2023

ZetaLin commented Jun 9, 2023 • edited Loading

ZetaLin commented Jun 9, 2023 • edited Loading

hhiyorimi commented Oct 11, 2023

jbarlow83 commented Nov 21, 2023

Eyxxxxx commented Jan 15, 2021 •

edited

Loading

Eyxxxxx commented Jan 15, 2021 via email •

edited

Loading

woaidianqian commented May 27, 2021 •

edited

Loading

cliveparkinson commented Aug 27, 2022 •

edited

Loading

ZetaLin commented Jun 9, 2023 •

edited

Loading

ZetaLin commented Jun 9, 2023 •

edited

Loading