Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find feature produces some wrong highlightments #15094

Closed
calixteman opened this issue Jun 24, 2022 · 3 comments · Fixed by #15105
Closed

Find feature produces some wrong highlightments #15094

calixteman opened this issue Jun 24, 2022 · 3 comments · Fixed by #15105
Assignees

Comments

@calixteman
Copy link
Contributor

Attach (recommended) or Link to PDF file here:
cd.pdf

Configuration:

  • Web browser and its version: Firefox nightly
  • Operating system and its version: Windows 11

Steps to reproduce the problem:

  1. Search for e and Highlight all

image

Something is wrong around the B(cb, cs).
From the find_controler:
...of the blend function B (c b , cs ) shall be the so...

and from devtools:
image

The problem is the space between c and b: it's either an extra space in the searched string or a missing one in the html.

@calixteman
Copy link
Contributor Author

In both cases, we call getTextContent but in the TextLayer case we pass includeMarkedContent = true when in the search case we pass a false.
Hence, in the TextLayer case we hit these path:

flushTextContentItem();

flushTextContentItem();

then
textContentItem.initialized = false;

but not in the search case.
And finally it makes a difference here:
https://github.com/mozilla/pdf.js/blob/master/src/core/evaluator.js#L2471 which is called with the operator OPS.setTextMatrix.

@calixteman
Copy link
Contributor Author

calixteman commented Jun 24, 2022

So in the TextLayer case we have a forced flush between the c and the b.
In the other one, we compare the positions of the two glyphs but with different font size and since the b is a bit smaller then a white space is guessed.
Having a space between c and b isn't a big deal, the real problem is that we've different behaviors which lead to something really wrong.
So fixing the extra space is another problem, but I think that in the search case we must flush the text chunk exactly at the same time as in the TextLayer case, just to have something consistent.

@Snuffleupagus, do you have any thoughts here ?

@Snuffleupagus
Copy link
Collaborator

but I think that in the search case we must flush the text chunk exactly at the same time as in the TextLayer case, just to have something consistent.

That sounds reasonable, since we obviously must return consistent textContent-data regardless of the includeMarkedContent value used.

Looking at the surrounding code, I can't help wondering why we don't also "flush" in the following case?

pdf.js/src/core/evaluator.js

Lines 3292 to 3299 in cd35b9b

case OPS.beginMarkedContent:
if (includeMarkedContent) {
textContent.items.push({
type: "beginMarkedContent",
tag: args[0] instanceof Name ? args[0].name : null,
});
}
break;

@calixteman calixteman self-assigned this Jun 25, 2022
calixteman added a commit to calixteman/pdf.js that referenced this issue Jun 25, 2022
calixteman added a commit to calixteman/pdf.js that referenced this issue Jun 25, 2022
calixteman added a commit to calixteman/pdf.js that referenced this issue Jun 25, 2022
Snuffleupagus added a commit that referenced this issue Jun 25, 2022
Always flush the current item with MarkedContent stuff when getting text (#15094)
rousek pushed a commit to signosoft/pdf.js that referenced this issue Aug 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants