Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Issue in text extraction (spaces) (#1153) #2882

Merged
merged 62 commits into from
Oct 3, 2024
Merged

Conversation

ssjkamei
Copy link
Contributor

Closes #1153

The change if abs(moved_height) > 0.8 * f: at line 129 of the crlf_space_check function changes the function to look at both the bottom and top misalignment, but if you prefer not to change it here, I commit the reverted code.

The test for hello-world.pdf fails.
I think a hidden bug has surfaced, but I don't seem to have read enough of the documentation to determine how to address it from within the PDF 1.7 specification.

I am judging by the position specified by Td and Tm, but perhaps I am judging both as if they were absolute coordinates, causing a misalignment, but I don't know how to correct it correctly.

Target String: สวัสดีชาวโลก
iss1153_2024-09-28

@ssjkamei
Copy link
Contributor Author

I forgot one more thing. I don't understand what the 15 specified here means after all.

and abs(delta_x) > spacewidth * f * 15

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 28, 2024

I forgot one more thing. I don't understand what the 15 specified here means after all.

and abs(delta_x) > spacewidth * f * 15

this is a long time ago and I must admit that I cannot remember why this value. will try to gather my memories

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Sep 28, 2024

Sorry, I added the code for CIDFont, but it seems to need mapping to cmap[1].

Also, I am calculating the font each time, which is very inefficient.

It doesn't seem to support vertical type characters, but since it seems to be from the original, I don't intend to include that in this PR.

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Sep 28, 2024

CIDFont is supported based on the PDF in the following Issue.

Spaces (that do not exist in the original PDF) appear in the output of extract_text() #2336
I think we can close #2336 as well.

The following will remain, but the spaces are moving correctly (because the letters are shaved, the distance traveled is greater than the letters, and the spaces are showing)

Expected: R. Ottokar Doerffel, 1489, Atiradores
Actual: R. O okar Doerff el, 1489, Ar adores

This is probably analogous to the following problem.
Ligature issue when converting PDF to text #1351

Dealing with the calculation of fonts each time will be revised later.

ssjkamei and others added 2 commits October 1, 2024 23:07
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 1, 2024

I don't know how to handle PDFObject, but I would like some help. Most of the old code is not taken into account and can be fixed if you get one pattern.

Are there any rules that should take over for casting PdfObjects?
If I refer to _data_structures.py, does it have all the types listed and do I just check all the types that can be cast?
Or is there a document that can determine what types /W and /Widths can have?

I would be grateful for tips on how to get started.

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 2, 2024

I noticed that when creating a combination of character map and width length, there is a pattern where the value after character map conversion is not represented by a single character. Current tests pass, but this is probably a bug.

To fix this, we need to change from checking the length of the string length after conversion, which is what we are doing in this fix, to checking the length of the string length before conversion. In conjunction with this, I think including width in the contents of the cmap will speed up the process and save memory.

We do not plan to include it in this modification along with the vertical character support (for /W2 and /DW2).

@stefan6419846
Copy link
Collaborator

Or is there a document that can determine what types /W and /Widths can have?

This is documented inside the PDF reference and mapped into actual classes by pypdf. As a general rule of thumb (at least for new code): Besides the expected types, we might have additional IndirectObject references as well.

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 2, 2024

@stefan6419846 Thanks a lot!

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 2, 2024

@stefan6419846 I have completed the corrections you indicated.
Other omissions regarding single point heights have been added.

if abs(moved_height) > 0.8 * str_height * scale_prev_y:

Consider the case of a move for a new font: if abs(moved_height) > 0.8 * min(str_height * scale_prev_y, font_size * scale_y):

I have confirmed that the following five issues have been fixed in this bug fix.
#1153, #1362, #1974, #2336, #2777

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 2, 2024

Sorry, there was a coverage error, so I will delete the unnecessary lines.

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 2, 2024

I fixed it.

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 3, 2024

@stefan6419846 @pubpub-zz
I am very sorry that you reviewed the code, but we have come up with a form of support for the following without including it in the cmap.
The code is easier to understand and the process is more efficient. Would it be better to include this as well?

To fix this, we need to change from checking the length of the string length after conversion, which is what we are doing in this fix, to checking the length of the string length before conversion. In conjunction with this, I think including width in the contents of the cmap will speed up the process and save memory.

@stefan6419846
Copy link
Collaborator

How much of the existing code from this PR would stay the same? If it is just an extension of the code from this PR, I would propose to merge this PR first (which already is an improvement) and do the further improvements in a separate PR to make reviewing easier.

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 3, 2024

@stefan6419846 Thank you! I even tried to test the changes, but there were too many changes. I will issue a PR once the merge is complete.

@ssjkamei
Copy link
Contributor Author

ssjkamei commented Oct 3, 2024

It looked like #234 could be closed too.

Copy link
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I am going to merge this as-is for now and defer possible updates to later PR. For now, all tests pass and at first look the performance did not degrade substantially.

@stefan6419846 stefan6419846 merged commit d5233a0 into py-pdf:main Oct 3, 2024
16 checks passed
stefan6419846 pushed a commit that referenced this pull request Oct 4, 2024
This is a fix for the problem that occurred when #2882 was changed.

The string length of characters was checked after conversion by cmap, but after cmap conversion, there is a pattern where the string length is more than one character, and it cannot be measured accurately.

This is necessary, for example, when considering whether to measure the distance from the ligature or the base character corresponding to the ligature in fixing #1351.

The change in handle_tj is because it cannot pass Ruff's check.
Error: PLR0915 Too many statements (nnn > 176)

The following code is only used to get the character code for a space.
However, I think it would be better to split the code into parts for obtaining the character code.
Style changes are considered in another PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue in text extraction (spaces)
3 participants