Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getTextContent doesn't always return the right fontRef #14755

Closed
karlak opened this issue Apr 6, 2022 · 6 comments · Fixed by #16439
Closed

getTextContent doesn't always return the right fontRef #14755

karlak opened this issue Apr 6, 2022 · 6 comments · Fixed by #16439

Comments

@karlak
Copy link

karlak commented Apr 6, 2022

Attach (recommended) or Link to PDF file here: test8.pdf

Configuration:

  • Web browser and its version: Chrome Version 98.0.4758.102 (64 bits)
  • Operating system and its version: Windows 10 (64bits)
  • PDF.js version: v2.13.216
  • Is a browser extension: No

Steps to reproduce the problem:

  1. getTextContent on the only page of the document
  2. check the fontRef of the item that contains "ou une liturgie"
let textContent = await page.getTextContent();
textContent.items.forEach((item) => {
  if (item.str.includes('ou une liturgie')) console.log(item);
});

What is the expected behavior?
On this particular example when using the chrome debugger I can see the fontRef used to render the glyphs of (part of) this item is d_d0_f3

What went wrong?
The item that get logged out has a fontRef g_d0_f4.

I think I isolated the problem to this particular pattern of OPS: (ignoring irrelevant OPS here)

- setFont d_d0_f3
- [...do stuff, showText...]
- save
- setFont d_d0_f4
- showText ") "
- restore
- showText "ou une liturgie de clairvoyance (par exemple"

This should return two different items with different fonts in my opinion (or is this PDF file broken in some way ? It renders fine though). What I get is the item with the str value of ) ou une liturgie de clairvoyance (par exemple and fontRef g_d0_f4.


I found a fix, but I'm not sure it's very elegant the way I did it.
In the function buildTextContentItem in the file src/core/evaluator.js:2633 I added at the top :

if (textContentItem.initialized && textContentItem.fontName != textState.font.loadedName) {
  flushTextContentItem();
}

With this, I get to separate items, each with it's own correct fontRef.

I think handling the restore case in the switch (same file, line 2861) would be a lot cleaner but I don't know enough of the specifics of this library to be sure. Is always trying to flushTextContentItem() when reading a restore OPS valid ?

Anyway, thanks a lot for your time !

@Snuffleupagus
Copy link
Collaborator

Is this a regression from some of the later textLayer changes, or has it always been broken in this way?

I think handling the restore case in the switch (same file, line 2861) would be a lot cleaner but I don't know enough of the specifics of this library to be sure. Is always trying to flushTextContentItem() when reading a restore OPS valid?

Always flushing the textContent on restore could lead to overall worse text-selection behaviour (even if it's technically "correct"), since there's nothing that says that the active font must change when the state is restored, hence that is really not a desirable solution here (as far as I'm concerned).

@karlak
Copy link
Author

karlak commented Apr 6, 2022

I'm pretty new to this lib (or pdf specs for that matter) but I spent a few days wrapping my head around it. Especially the text retrieval part. So I can't really tell if that is a newly introduced bug.

Oh, if you want to not split textContentItems if possible, it probably would be better to put my fix in the first condition of the function ensureTextContentItem.
So that if only spaces are in an other font just after the restore it would not create a separate item. Those spaces could be added from buildTextContentItem() before it calls ensureTextContentItem()

Wouldn't having a fontAgnostic argument to the page.getTextContent() be the way to go for search and text selection ?

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Apr 6, 2022

So I can't really tell if that is a newly introduced bug.

Would you be able to help with testing a few older releases, see https://github.com/mozilla/pdf.js/releases, to help answer this?
Knowing if this is a (recent) regression, or a bug that's existed for years, would definitely help here!

Oh, if you want to not split textContentItems if possible, it probably would be better to put my fix in the first condition of the function ensureTextContentItem.

Considering where that function is being called, I'm really not sure that it'd be correct to flush the textContent there!?

@calixteman Any ideas for a good solution to this issue, one that doesn't cause us to speculatively break apart text-runs when not actually necessary?


Wouldn't having a fontAgnostic argument to the page.getTextContent() be the way to go for search and text selection?

That sound pretty orthogonal to this issue, and anyway it'd not actually be correct since the textLayer (that's used in the viewer) obviously need to change depending on the current font (and its size).

@karlak
Copy link
Author

karlak commented Apr 6, 2022

Ok here are my results:
Possible wanted result ✅

[...,
{fontName: 'g_d0_f4', str: 'odem'}
{fontName: 'g_d0_f4', str: ' '},
{fontName: 'g_d0_f4', str: ')'},
{fontName: 'g_d0_f4', str: ' '},
{fontName: 'g_d0_f3', str: 'ou une liturgie de clairvoyance (par exemple'},
...]

pdfjs-2.13.216 ❌ (reason of this post)

[...,
{fontName: 'g_d0_f4', str: 'odem'},
{fontName: 'g_d0_f4', str: ' '},
{fontName: 'g_d0_f4', str: ') ou une liturgie de clairvoyance (par exemple'},
...]

pdfjs-2.12.313 ❌ (not exactly the same results, the item's string is even longer)

[...,
{fontName: 'g_d0_f4', str: 'odem) ou une liturgie de clairvoyance (par exemple'},
...]

pdfjs-2.11.338 ✅ (the text does not split in the same way as the last release but there's no obvious problem in the fontRefs)

[...,
{fontName: 'g_d0_f4', str: ''},
{fontName: 'g_d0_f4', str: 'odem) '},
{fontName: 'g_d0_f4', str: ' '},
{fontName: 'g_d0_f3', str: 'ou  une  liturgie  de  clairvoyance  (par  exemple  '},
{fontName: 'g_d0_f3', str: ''},
{fontName: 'g_d0_f3', str: ' '},
...]

@marco-c
Copy link
Contributor

marco-c commented Apr 19, 2022

Do we know which PR caused this?

@calixteman
Copy link
Contributor

For this specific pdf, the issue has been fixed thanks to:
#15094
But in general we should flush when the font changed after a restore: it's finally a kind of setFont.

@calixteman calixteman self-assigned this May 18, 2023
calixteman added a commit to calixteman/pdf.js that referenced this issue May 18, 2023
calixteman added a commit to calixteman/pdf.js that referenced this issue May 18, 2023
calixteman added a commit to calixteman/pdf.js that referenced this issue May 18, 2023
calixteman added a commit to calixteman/pdf.js that referenced this issue May 18, 2023
calixteman added a commit that referenced this issue May 18, 2023
Flush the current chunk when the font changed because of a restore op (issue #14755)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants