[api-minor] Don't normalize the text used in the text layer. #16200

calixteman · 2023-03-23T14:51:11Z

Some arabic chars like \ufe94 could be searched in a pdf, hence it must be normalized when creating the search query. So to avoid to duplicate the normalization code, everything is moved in the find controller.
The previous code to normalize text was using NFKC but with a hardcoded map, hence it has been replaced by the use of normalize("NFKC") (it helps to reduce the bundle size by 30kb).
In playing with this \ufe94 char, I noticed that the bidi algorithm wasn't taking into account some RTL unicode ranges, the generated font wasn't embedding the mapping this char and the unicode ranges in the OS/2 table weren't up-to-date.

When normalized some chars can be replaced by several ones and it induced to have some extra chars in the text layer. To avoid any regression, when copying some text from the text layer, a copied string is normalized (NFKC) before being put in the clipboard (it works like this in either Acrobat or Chrome).

Snuffleupagus · 2023-03-24T08:56:32Z

Given that we should be pretty close to the next PDF.js release, this is probably not something that we want to land just prior.
In the mean-time, I've got a couple of questions and comments based on a quick look:

The bundle-size decrease is really nice :-)
This is probably too much of an API-break as-is, and I do believe that we'll need to maintain the old behaviour (which has existed since "forever") by normalizing by default.
So can we please add a new getTextContent/streamTextContent parameter, e.g. named disableNormalization = false and thus off by default, that causes the worker-thread to call normalize("NFKC") on the text-content it returns?
The relevant viewer call-sites, and the browsertest Driver, would then simply call these API-methods with disableNormalization = true set to get the new behaviour.
How does text-selection work in documents that mix LTR and RTL text, especially on the same line?
Previously we'd reverse e.g. arabic text inside of text-runs that mostly contains LTR text, although I suppose that probably never worked perfectly.
How does this work with a11y-software?
Will it be able to "make sense" of any ligatures found in the textLayer DOM, since they were previously always available in their "expanded" form in the DOM?
Does this work in Safari?
Please note: I don't mean that attempting to support Safari should in any way block or complicate things here, however if this doesn't work in Safari we might simply need to update (or possibly even remove) its support status from the FAQ.

calixteman · 2023-03-24T10:49:59Z

Given that we should be pretty close to the next PDF.js release, this is probably not something that we want to land just prior. In the mean-time, I've got a couple of questions and comments based on a quick look:

* The bundle-size decrease is _really_ nice :-)

* This is probably too much of an API-break as-is, and I do believe that we'll need to maintain the old behaviour (which has existed since "forever") by normalizing by _default_.
  So can we please add a new `getTextContent`/`streamTextContent` parameter, e.g. named `disableNormalization = false` and thus off by default, that causes the worker-thread to call `normalize("NFKC")` on the text-content it returns?
  The relevant viewer call-sites, and the browsertest `Driver`, would then simply call these API-methods with `disableNormalization = true` set to get the new behaviour.

Yep I thought about that too in order to avoid too much breaking change.
About the release: of course this patch can wait.

* How does text-selection work in documents that mix LTR and RTL text, especially on the same line?
  Previously we'd reverse e.g. arabic text inside of text-runs that _mostly_ contains LTR text, although I suppose that probably never worked perfectly.

We'll let the browser deal with that so I'd expect to have something at least identical or maybe better.

* How does this work with a11y-software?
  Will it be able to "make sense" of any ligatures found in the textLayer DOM, since they were previously always available in their "expanded" form in the DOM?

Good question, I tried with a fi ligature and NVDA and it doesn't work correctly:
nvaccess/nvda#14740

* Does this work in Safari?

I tested in Safari the with ArabicCIDTrueType.pdf and with tracemonkey.pdf: I don't see anything wrong.
What could be wrong according to you ?

  _Please note:_ I don't mean that attempting to support Safari should in any way block or complicate things here, however if this doesn't work in Safari we might simply need to update (or possibly even remove) its support status from the FAQ.

I'm not 100% sure that nothing will be broken but we'll fix the issues if we meet any.

For the context, I began to write a patch to use the font we generated for the canvas in the text layer. And in this case I need to have the char for the fi ligature instead of having a f and a i (and I found the search issue with the arabic ligature).
The results are pretty good (the text in the text layer is in transparent red):

Snuffleupagus · 2023-03-24T11:05:07Z

I tested in Safari the with ArabicCIDTrueType.pdf and with tracemonkey.pdf: I don't see anything wrong.
What could be wrong according to you ?

Thanks for checking!

I didn't have anything particular in mind, it's just that I've noticed over time that users seem to run into much more trouble with Safari than any other browser. Additionally, it also seems that Safari is sometimes lagging behind when it comes to implementing new web-platform features; one example that comes to mind is OffscreenCanvas.

Snuffleupagus · 2023-04-11T14:03:31Z

Good question, I tried with a fi ligature and NVDA and it doesn't work correctly:
nvaccess/nvda#14740

The discussion in that issue seem to suggest that the actual bug lies elsewhere; what's the overall status of a11y support for ligatures?

The previous code to normalize text was using NFKC but with a hardcoded map, hence it has been replaced by the use of normalize("NFKC") (it helps to reduce the bundle size by 30kb).

This actually seems problematic as-is, since we'll now end up normalizing a lot more than before. Take e.g. the fraction-highlight.pdf document as an example, where copy-and-paste currently gives the following result:

some text before a fraction
½
some text after a fraction
½¼¾
more fractions!

With this patch you'll instead get:

some text before a fraction
1⁄2
some text after a fraction
1⁄21⁄43⁄4
more fractions!

This seems problematic for a couple of reasons:

We'd then have different copy-and-paste behaviour compared to both Adobe Reader and PDFium.
Normalizing those kind of combined-characters feels, in my opinion, to be neither particularly helpful nor what the user likely wants here.
It's not clear how/if this works with searching, if you copy one of those "expanded"-characters and then use that as input when searching?

To me this feels like opening the door to problems both now and down the line, by us essentially relinquishing any control over "when" normalization happens. To that end I wonder if we could perhaps combine the current hard-coded map with using standard normalization (which should still be a reduction in code-size)?

const NORMALIZE_UNICODES = new Set([
  "\u00A8",
  "\u00AF",
  // The rest of the "keys" from `getNormalizedUnicodes` goes here...
]);

function normalizeTextContent(str) {
  const buf = [];
  for (const char of str) {
    buf.push(NORMALIZE_UNICODES.has(char) ? char.normalize("NFKC") : char);
  }
  return buf.join("");
}

You'd then call that helper-function on copy-and-paste, and we'd still have control of normalization without having to hard-code everything like previously.

calixteman · 2023-04-12T15:51:12Z

All the chars we've in getNormalizedUnicodes aren't normalized when they're copied in Acrobat.
For example \u0132 is normalized when a string is searched but it isn't when it's copied.

web/text_layer_builder.js

src/core/evaluator.js

src/display/api.js

test/unit/unicode_spec.js

test/unit/api_spec.js

Snuffleupagus

Would it be feasible to add an integration-test, using a PDF document with ligatures, that checks that the copied text is actually being normalized as expected?

test/unit/api_spec.js

src/shared/util.js

web/text_layer_builder.js

web/pdf_find_controller.js

test/integration/copy_paste_spec.js

Snuffleupagus

With the latest round of comments addressed, let's start running tests to see what the "fallout" looks like :-)

test/integration/copy_paste_spec.js

web/pdf_viewer.js

Some arabic chars like \ufe94 could be searched in a pdf, hence it must be normalized when creating the search query. So to avoid to duplicate the normalization code, everything is moved in the find controller. The previous code to normalize text was using NFKC but with a hardcoded map, hence it has been replaced by the use of normalize("NFKC") (it helps to reduce the bundle size by 30kb). In playing with this \ufe94 char, I noticed that the bidi algorithm wasn't taking into account some RTL unicode ranges, the generated font wasn't embedding the mapping this char and the unicode ranges in the OS/2 table weren't up-to-date. When normalized some chars can be replaced by several ones and it induced to have some extra chars in the text layer. To avoid any regression, when copying some text from the text layer, a copied string is normalized (NFKC) before being put in the clipboard (it works like this in either Acrobat or Chrome).

calixteman · 2023-04-17T12:32:36Z

/botio test

pdfjsbot · 2023-04-17T12:32:37Z

From: Bot.io (Linux m4)

Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/cb7302008d5c91c/output.txt

pdfjsbot · 2023-04-17T12:32:38Z

From: Bot.io (Windows)

Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/0bc1fe0eb14436b/output.txt

pdfjsbot · 2023-04-17T12:58:44Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/0bc1fe0eb14436b/output.txt

Total script time: 26.08 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 53

Image differences available at: http://54.193.163.58:8877/0bc1fe0eb14436b/reftest-analyzer.html#web=eq.log

pdfjsbot · 2023-04-17T12:59:57Z

From: Bot.io (Linux m4)

Failed

Full output at http://54.241.84.105:8877/cb7302008d5c91c/output.txt

Total script time: 27.31 mins

Font tests: Passed
Unit tests: Passed
Integration Tests: FAILED
Regression tests: FAILED

  different ref/snapshot: 200
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/cb7302008d5c91c/reftest-analyzer.html#web=eq.log

Snuffleupagus · 2023-04-17T13:40:03Z

From a brief look, the textLayer movement seem fine and expected as far as I can tell.

The movement in regular eq-tests on Linux seem strange, however it actually looks like slight improvements to me.

/botio integrationtest

pdfjsbot · 2023-04-17T13:40:05Z

From: Bot.io (Windows)

Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/edc23c4ead7301c/output.txt

pdfjsbot · 2023-04-17T13:40:05Z

From: Bot.io (Linux m4)

Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/52b3ef203ed726e/output.txt

pdfjsbot · 2023-04-17T13:44:12Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/52b3ef203ed726e/output.txt

Total script time: 4.08 mins

Integration Tests: Passed

pdfjsbot · 2023-04-17T13:54:04Z

From: Bot.io (Windows)

Failed

Full output at http://54.193.163.58:8877/edc23c4ead7301c/output.txt

Total script time: 13.94 mins

Integration Tests: FAILED

calixteman · 2023-04-17T14:02:32Z

I don't see anything really wrong

From a brief look, the textLayer movement seem fine and expected as far as I can tell.

The movement in regular eq-tests on Linux seem strange, however it actually looks like slight improvements to me.

I'd say that the root cause is the one mentioned by Jonathan:
#15157 (comment)
this patch is modifying how we check if a char is in the private area:
https://github.com/mozilla/pdf.js/pull/16200/files#diff-d6af99a911b977730586b335a3c7ee702a383cac240414a011ab5534cda1ff3aR490-R492
where before:
https://github.com/mozilla/pdf.js/pull/16200/files#diff-d6af99a911b977730586b335a3c7ee702a383cac240414a011ab5534cda1ff3aL544
it was wrong.
Consequently, for ﬁ (aka, \uFB01) this patches changes the value we've in the extra data we add in the font and then Jonathan's explanation arrives.

Snuffleupagus · 2023-04-17T14:14:05Z

I don't see anything really wrong

Agreed, I see no real problem with that Linux-only movement; I was just surprised by the changes since I (incorrectly) assumed that only text tests would change :-)

I'll give the entire patch one more look, but this seems pretty good to land as-is (and we've got almost two weeks before the next release).

Snuffleupagus

r=me, thank you!

Once landed, can you please update PDF.js in mozilla-central so that we can get a little bit of real-world testing done before the next PDF.js library release?

calixteman · 2023-04-17T14:50:36Z

/botio makeref

pdfjsbot · 2023-04-17T14:50:38Z

From: Bot.io (Windows)

Received

Command cmd_makeref from @calixteman received. Current queue size: 1

Live output at: http://54.193.163.58:8877/02b2edb4ed3a9d9/output.txt

pdfjsbot · 2023-04-17T14:50:38Z

From: Bot.io (Linux m4)

Received

Command cmd_makeref from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/0b8220dfa3da984/output.txt

pdfjsbot · 2023-04-17T15:13:19Z

From: Bot.io (Linux m4)

Success

Full output at http://54.241.84.105:8877/0b8220dfa3da984/output.txt

Total script time: 22.67 mins

Lint: Passed
Make references: Passed
Check references: Passed

pdfjsbot · 2023-04-17T15:16:12Z

From: Bot.io (Windows)

Success

Full output at http://54.193.163.58:8877/02b2edb4ed3a9d9/output.txt

Total script time: 22.80 mins

Lint: Passed
Make references: Passed
Check references: Passed

calixteman requested a review from Snuffleupagus March 23, 2023 14:51

Snuffleupagus added core viewer text-selection labels Mar 24, 2023

calixteman force-pushed the dont_normalize branch from 3b0dd3c to 9cf704e Compare April 11, 2023 10:50

calixteman force-pushed the dont_normalize branch from 9cf704e to 232b7ff Compare April 12, 2023 14:33

Snuffleupagus reviewed Apr 12, 2023

View reviewed changes

calixteman force-pushed the dont_normalize branch 2 times, most recently from 05c2c75 to f0512b2 Compare April 12, 2023 19:39

calixteman requested a review from Snuffleupagus April 12, 2023 19:39

Snuffleupagus reviewed Apr 13, 2023

View reviewed changes

test/unit/api_spec.js Outdated Show resolved Hide resolved

src/shared/util.js Outdated Show resolved Hide resolved

Snuffleupagus mentioned this pull request Apr 14, 2023

Add the possibility to copy all the pdf text whatever the rendered pages are (bug 1788035) #16286

Merged

Snuffleupagus reviewed Apr 16, 2023

View reviewed changes

web/text_layer_builder.js Outdated Show resolved Hide resolved

Snuffleupagus reviewed Apr 17, 2023

View reviewed changes

web/pdf_find_controller.js Outdated Show resolved Hide resolved

calixteman force-pushed the dont_normalize branch 2 times, most recently from 21ccfc8 to 2c9cce0 Compare April 17, 2023 10:39

Snuffleupagus reviewed Apr 17, 2023

View reviewed changes

test/integration/copy_paste_spec.js Outdated Show resolved Hide resolved

Snuffleupagus reviewed Apr 17, 2023

View reviewed changes

test/integration/copy_paste_spec.js Outdated Show resolved Hide resolved

web/pdf_viewer.js Outdated Show resolved Hide resolved

web/pdf_viewer.js Outdated Show resolved Hide resolved

calixteman force-pushed the dont_normalize branch from 2c9cce0 to 117bbf7 Compare April 17, 2023 12:31

Snuffleupagus approved these changes Apr 17, 2023

View reviewed changes

calixteman merged commit dbe0c4e into mozilla:master Apr 17, 2023

Snuffleupagus mentioned this pull request Aug 12, 2023

Remove the src/core/ Babel excludes, since they no longer seem necessary #16829

Merged

robertknight mentioned this pull request Jan 29, 2025

Update PDF.js (2025 edition) hypothesis/client#6784

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[api-minor] Don't normalize the text used in the text layer. #16200

[api-minor] Don't normalize the text used in the text layer. #16200

calixteman commented Mar 23, 2023

Snuffleupagus commented Mar 24, 2023

calixteman commented Mar 24, 2023

Snuffleupagus commented Mar 24, 2023

Snuffleupagus commented Apr 11, 2023 •

edited

Loading

calixteman commented Apr 12, 2023

Snuffleupagus left a comment

Snuffleupagus left a comment

calixteman commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

Snuffleupagus commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

calixteman commented Apr 17, 2023 •

edited

Loading

Snuffleupagus commented Apr 17, 2023 •

edited

Loading

Snuffleupagus left a comment •

edited

Loading

calixteman commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

[api-minor] Don't normalize the text used in the text layer. #16200

[api-minor] Don't normalize the text used in the text layer. #16200

Conversation

calixteman commented Mar 23, 2023

Snuffleupagus commented Mar 24, 2023

calixteman commented Mar 24, 2023

Snuffleupagus commented Mar 24, 2023

Snuffleupagus commented Apr 11, 2023 • edited Loading

calixteman commented Apr 12, 2023

Snuffleupagus left a comment

Choose a reason for hiding this comment

Snuffleupagus left a comment

Choose a reason for hiding this comment

calixteman commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

From: Bot.io (Linux m4)

Received

pdfjsbot commented Apr 17, 2023

From: Bot.io (Windows)

Received

pdfjsbot commented Apr 17, 2023

From: Bot.io (Windows)

Failed

pdfjsbot commented Apr 17, 2023

From: Bot.io (Linux m4)

Failed

Snuffleupagus commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

From: Bot.io (Windows)

Received

pdfjsbot commented Apr 17, 2023

From: Bot.io (Linux m4)

Received

pdfjsbot commented Apr 17, 2023

From: Bot.io (Linux m4)

Success

pdfjsbot commented Apr 17, 2023

From: Bot.io (Windows)

Failed

calixteman commented Apr 17, 2023 • edited Loading

Snuffleupagus commented Apr 17, 2023 • edited Loading

Snuffleupagus left a comment • edited Loading

Choose a reason for hiding this comment

calixteman commented Apr 17, 2023

pdfjsbot commented Apr 17, 2023

From: Bot.io (Windows)

Received

pdfjsbot commented Apr 17, 2023

From: Bot.io (Linux m4)

Received

pdfjsbot commented Apr 17, 2023

From: Bot.io (Linux m4)

Success

pdfjsbot commented Apr 17, 2023

From: Bot.io (Windows)

Success

Snuffleupagus commented Apr 11, 2023 •

edited

Loading

calixteman commented Apr 17, 2023 •

edited

Loading

Snuffleupagus commented Apr 17, 2023 •

edited

Loading

Snuffleupagus left a comment •

edited

Loading