Find feature produces some wrong highlightments #15094

calixteman · 2022-06-24T12:53:23Z

Attach (recommended) or Link to PDF file here:
cd.pdf

Configuration:

Web browser and its version: Firefox nightly
Operating system and its version: Windows 11

Steps to reproduce the problem:

Search for e and Highlight all

Something is wrong around the B(cb, cs).
From the find_controler:
...of the blend function B (c b , cs ) shall be the so...

and from devtools:

The problem is the space between c and b: it's either an extra space in the searched string or a missing one in the html.

The text was updated successfully, but these errors were encountered:

calixteman · 2022-06-24T14:24:18Z

In both cases, we call getTextContent but in the TextLayer case we pass includeMarkedContent = true when in the search case we pass a false.
Hence, in the TextLayer case we hit these path:

pdf.js/src/core/evaluator.js

Line 3302 in b5fea8f

flushTextContentItem();

pdf.js/src/core/evaluator.js

Line 3318 in b5fea8f

flushTextContentItem();

then

pdf.js/src/core/evaluator.js

Line 2921 in b5fea8f

textContentItem.initialized = false;

but not in the search case.
And finally it makes a difference here:
https://github.com/mozilla/pdf.js/blob/master/src/core/evaluator.js#L2471 which is called with the operator OPS.setTextMatrix.

calixteman · 2022-06-24T14:35:21Z

So in the TextLayer case we have a forced flush between the c and the b.
In the other one, we compare the positions of the two glyphs but with different font size and since the b is a bit smaller then a white space is guessed.
Having a space between c and b isn't a big deal, the real problem is that we've different behaviors which lead to something really wrong.
So fixing the extra space is another problem, but I think that in the search case we must flush the text chunk exactly at the same time as in the TextLayer case, just to have something consistent.

@Snuffleupagus, do you have any thoughts here ?

Snuffleupagus · 2022-06-25T10:33:29Z

but I think that in the search case we must flush the text chunk exactly at the same time as in the TextLayer case, just to have something consistent.

That sounds reasonable, since we obviously must return consistent textContent-data regardless of the includeMarkedContent value used.

Looking at the surrounding code, I can't help wondering why we don't also "flush" in the following case?

pdf.js/src/core/evaluator.js

Lines 3292 to 3299 in cd35b9b

 case OPS.beginMarkedContent: 

 if (includeMarkedContent) { 

 textContent.items.push({ 

 type: "beginMarkedContent", 

 tag: args[0] instanceof Name ? args[0].name : null, 

 }); 

 } 

 break;

…ext (mozilla#15094)

Always flush the current item with MarkedContent stuff when getting text (#15094)

…ext (mozilla#15094)

Snuffleupagus added the text-selection label Jun 24, 2022

calixteman self-assigned this Jun 25, 2022

calixteman added a commit to calixteman/pdf.js that referenced this issue Jun 25, 2022

Always flush the current item with MarkedContent stuff when getting t…

6c67c65

…ext (mozilla#15094)

calixteman linked a pull request Jun 25, 2022 that will close this issue

Always flush the current item with MarkedContent stuff when getting text (#15094) #15105

Merged

calixteman added a commit to calixteman/pdf.js that referenced this issue Jun 25, 2022

Always flush the current item with MarkedContent stuff when getting t…

f161929

…ext (mozilla#15094)

calixteman added a commit to calixteman/pdf.js that referenced this issue Jun 25, 2022

Always flush the current item with MarkedContent stuff when getting t…

3789dab

…ext (mozilla#15094)

Snuffleupagus closed this as completed in #15105 Jun 25, 2022

Snuffleupagus added a commit that referenced this issue Jun 25, 2022

Merge pull request #15105 from calixteman/15094

4e025e1

Always flush the current item with MarkedContent stuff when getting text (#15094)

rousek pushed a commit to signosoft/pdf.js that referenced this issue Aug 10, 2022

Always flush the current item with MarkedContent stuff when getting t…

4e4d019

…ext (mozilla#15094)

calixteman mentioned this issue May 18, 2023

getTextContent doesn't always return the right fontRef #14755

Closed

ZeroXClem mentioned this issue Aug 12, 2024

[Snyk] Upgrade pdfjs-dist from 2.9.359 to 2.16.105 ZeroXClem/metamesa#3

Closed

earthywh mentioned this issue Sep 24, 2024

[Snyk] Upgrade pdfjs-dist from 2.6.347 to 2.16.105 earthywh/filestash#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find feature produces some wrong highlightments #15094

Find feature produces some wrong highlightments #15094

calixteman commented Jun 24, 2022

calixteman commented Jun 24, 2022

calixteman commented Jun 24, 2022 •

edited

Loading

Snuffleupagus commented Jun 25, 2022

Find feature produces some wrong highlightments #15094

Find feature produces some wrong highlightments #15094

Comments

calixteman commented Jun 24, 2022

calixteman commented Jun 24, 2022

calixteman commented Jun 24, 2022 • edited Loading

Snuffleupagus commented Jun 25, 2022

calixteman commented Jun 24, 2022 •

edited

Loading