Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[api-minor] Don't normalize the text used in the text layer. #16200

Merged
merged 1 commit into from
Apr 17, 2023

Conversation

calixteman
Copy link
Contributor

Some arabic chars like \ufe94 could be searched in a pdf, hence it must be normalized when creating the search query. So to avoid to duplicate the normalization code, everything is moved in the find controller.
The previous code to normalize text was using NFKC but with a hardcoded map, hence it has been replaced by the use of normalize("NFKC") (it helps to reduce the bundle size by 30kb).
In playing with this \ufe94 char, I noticed that the bidi algorithm wasn't taking into account some RTL unicode ranges, the generated font wasn't embedding the mapping this char and the unicode ranges in the OS/2 table weren't up-to-date.

When normalized some chars can be replaced by several ones and it induced to have some extra chars in the text layer. To avoid any regression, when copying some text from the text layer, a copied string is normalized (NFKC) before being put in the clipboard (it works like this in either Acrobat or Chrome).

@Snuffleupagus
Copy link
Collaborator

Given that we should be pretty close to the next PDF.js release, this is probably not something that we want to land just prior.
In the mean-time, I've got a couple of questions and comments based on a quick look:

  • The bundle-size decrease is really nice :-)

  • This is probably too much of an API-break as-is, and I do believe that we'll need to maintain the old behaviour (which has existed since "forever") by normalizing by default.
    So can we please add a new getTextContent/streamTextContent parameter, e.g. named disableNormalization = false and thus off by default, that causes the worker-thread to call normalize("NFKC") on the text-content it returns?
    The relevant viewer call-sites, and the browsertest Driver, would then simply call these API-methods with disableNormalization = true set to get the new behaviour.

  • How does text-selection work in documents that mix LTR and RTL text, especially on the same line?
    Previously we'd reverse e.g. arabic text inside of text-runs that mostly contains LTR text, although I suppose that probably never worked perfectly.

  • How does this work with a11y-software?
    Will it be able to "make sense" of any ligatures found in the textLayer DOM, since they were previously always available in their "expanded" form in the DOM?

  • Does this work in Safari?
    Please note: I don't mean that attempting to support Safari should in any way block or complicate things here, however if this doesn't work in Safari we might simply need to update (or possibly even remove) its support status from the FAQ.

@calixteman
Copy link
Contributor Author

Given that we should be pretty close to the next PDF.js release, this is probably not something that we want to land just prior. In the mean-time, I've got a couple of questions and comments based on a quick look:

* The bundle-size decrease is _really_ nice :-)

* This is probably too much of an API-break as-is, and I do believe that we'll need to maintain the old behaviour (which has existed since "forever") by normalizing by _default_.
  So can we please add a new `getTextContent`/`streamTextContent` parameter, e.g. named `disableNormalization = false` and thus off by default, that causes the worker-thread to call `normalize("NFKC")` on the text-content it returns?
  The relevant viewer call-sites, and the browsertest `Driver`, would then simply call these API-methods with `disableNormalization = true` set to get the new behaviour.

Yep I thought about that too in order to avoid too much breaking change.
About the release: of course this patch can wait.

* How does text-selection work in documents that mix LTR and RTL text, especially on the same line?
  Previously we'd reverse e.g. arabic text inside of text-runs that _mostly_ contains LTR text, although I suppose that probably never worked perfectly.

We'll let the browser deal with that so I'd expect to have something at least identical or maybe better.

* How does this work with a11y-software?
  Will it be able to "make sense" of any ligatures found in the textLayer DOM, since they were previously always available in their "expanded" form in the DOM?

Good question, I tried with a fi ligature and NVDA and it doesn't work correctly:
nvaccess/nvda#14740

* Does this work in Safari?

I tested in Safari the with ArabicCIDTrueType.pdf and with tracemonkey.pdf: I don't see anything wrong.
What could be wrong according to you ?

  _Please note:_ I don't mean that attempting to support Safari should in any way block or complicate things here, however if this doesn't work in Safari we might simply need to update (or possibly even remove) its support status from the FAQ.

I'm not 100% sure that nothing will be broken but we'll fix the issues if we meet any.

For the context, I began to write a patch to use the font we generated for the canvas in the text layer. And in this case I need to have the char for the fi ligature instead of having a f and a i (and I found the search issue with the arabic ligature).
The results are pretty good (the text in the text layer is in transparent red):
image

@Snuffleupagus
Copy link
Collaborator

I tested in Safari the with ArabicCIDTrueType.pdf and with tracemonkey.pdf: I don't see anything wrong.
What could be wrong according to you ?

Thanks for checking!

I didn't have anything particular in mind, it's just that I've noticed over time that users seem to run into much more trouble with Safari than any other browser. Additionally, it also seems that Safari is sometimes lagging behind when it comes to implementing new web-platform features; one example that comes to mind is OffscreenCanvas.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Apr 11, 2023

Good question, I tried with a fi ligature and NVDA and it doesn't work correctly:
nvaccess/nvda#14740

The discussion in that issue seem to suggest that the actual bug lies elsewhere; what's the overall status of a11y support for ligatures?

The previous code to normalize text was using NFKC but with a hardcoded map, hence it has been replaced by the use of normalize("NFKC") (it helps to reduce the bundle size by 30kb).

This actually seems problematic as-is, since we'll now end up normalizing a lot more than before. Take e.g. the fraction-highlight.pdf document as an example, where copy-and-paste currently gives the following result:

some text before a fraction
½
some text after a fraction
½¼¾
more fractions!

With this patch you'll instead get:

some text before a fraction
1⁄2
some text after a fraction
1⁄21⁄43⁄4
more fractions!

This seems problematic for a couple of reasons:

  • We'd then have different copy-and-paste behaviour compared to both Adobe Reader and PDFium.
  • Normalizing those kind of combined-characters feels, in my opinion, to be neither particularly helpful nor what the user likely wants here.
  • It's not clear how/if this works with searching, if you copy one of those "expanded"-characters and then use that as input when searching?

To me this feels like opening the door to problems both now and down the line, by us essentially relinquishing any control over "when" normalization happens. To that end I wonder if we could perhaps combine the current hard-coded map with using standard normalization (which should still be a reduction in code-size)?

const NORMALIZE_UNICODES = new Set([
  "\u00A8",
  "\u00AF",
  // The rest of the "keys" from `getNormalizedUnicodes` goes here...
]);

function normalizeTextContent(str) {
  const buf = [];
  for (const char of str) {
    buf.push(NORMALIZE_UNICODES.has(char) ? char.normalize("NFKC") : char);
  }
  return buf.join("");
}

You'd then call that helper-function on copy-and-paste, and we'd still have control of normalization without having to hard-code everything like previously.

@calixteman
Copy link
Contributor Author

All the chars we've in getNormalizedUnicodes aren't normalized when they're copied in Acrobat.
For example \u0132 is normalized when a string is searched but it isn't when it's copied.

web/text_layer_builder.js Outdated Show resolved Hide resolved
src/core/evaluator.js Outdated Show resolved Hide resolved
src/core/evaluator.js Show resolved Hide resolved
src/display/api.js Outdated Show resolved Hide resolved
test/unit/unicode_spec.js Outdated Show resolved Hide resolved
test/unit/api_spec.js Outdated Show resolved Hide resolved
@calixteman calixteman force-pushed the dont_normalize branch 2 times, most recently from 05c2c75 to f0512b2 Compare April 12, 2023 19:39
Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be feasible to add an integration-test, using a PDF document with ligatures, that checks that the copied text is actually being normalized as expected?

test/unit/api_spec.js Outdated Show resolved Hide resolved
src/shared/util.js Outdated Show resolved Hide resolved
@calixteman calixteman force-pushed the dont_normalize branch 2 times, most recently from 21ccfc8 to 2c9cce0 Compare April 17, 2023 10:39
Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the latest round of comments addressed, let's start running tests to see what the "fallout" looks like :-)

test/integration/copy_paste_spec.js Outdated Show resolved Hide resolved
web/pdf_viewer.js Outdated Show resolved Hide resolved
web/pdf_viewer.js Outdated Show resolved Hide resolved
Some arabic chars like \ufe94 could be searched in a pdf, hence it must be normalized
when creating the search query. So to avoid to duplicate the normalization code,
everything is moved in the find controller.
The previous code to normalize text was using NFKC but with a hardcoded map, hence it
has been replaced by the use of normalize("NFKC") (it helps to reduce the bundle size
by 30kb).
In playing with this \ufe94 char, I noticed that the bidi algorithm wasn't taking into
account some RTL unicode ranges, the generated font wasn't embedding the mapping this
char and the unicode ranges in the OS/2 table weren't up-to-date.

When normalized some chars can be replaced by several ones and it induced to have
some extra chars in the text layer. To avoid any regression, when copying some text
from the text layer, a copied string is normalized (NFKC) before being put in the
clipboard (it works like this in either Acrobat or Chrome).
@calixteman
Copy link
Contributor Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/cb7302008d5c91c/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @calixteman received. Current queue size: 0

Live output at: http://54.193.163.58:8877/0bc1fe0eb14436b/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/0bc1fe0eb14436b/output.txt

Total script time: 26.08 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 53

Image differences available at: http://54.193.163.58:8877/0bc1fe0eb14436b/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.241.84.105:8877/cb7302008d5c91c/output.txt

Total script time: 27.31 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Integration Tests: FAILED
  • Regression tests: FAILED
  different ref/snapshot: 200
  different first/second rendering: 1

Image differences available at: http://54.241.84.105:8877/cb7302008d5c91c/reftest-analyzer.html#web=eq.log

@Snuffleupagus
Copy link
Collaborator

From a brief look, the textLayer movement seem fine and expected as far as I can tell.

The movement in regular eq-tests on Linux seem strange, however it actually looks like slight improvements to me.

/botio integrationtest

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/edc23c4ead7301c/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/52b3ef203ed726e/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/52b3ef203ed726e/output.txt

Total script time: 4.08 mins

  • Integration Tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/edc23c4ead7301c/output.txt

Total script time: 13.94 mins

  • Integration Tests: FAILED

@calixteman
Copy link
Contributor Author

calixteman commented Apr 17, 2023

I don't see anything really wrong

From a brief look, the textLayer movement seem fine and expected as far as I can tell.

The movement in regular eq-tests on Linux seem strange, however it actually looks like slight improvements to me.

I'd say that the root cause is the one mentioned by Jonathan:
#15157 (comment)
this patch is modifying how we check if a char is in the private area:
https://github.com/mozilla/pdf.js/pull/16200/files#diff-d6af99a911b977730586b335a3c7ee702a383cac240414a011ab5534cda1ff3aR490-R492
where before:
https://github.com/mozilla/pdf.js/pull/16200/files#diff-d6af99a911b977730586b335a3c7ee702a383cac240414a011ab5534cda1ff3aL544
it was wrong.
Consequently, for (aka, \uFB01) this patches changes the value we've in the extra data we add in the font and then Jonathan's explanation arrives.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Apr 17, 2023

I don't see anything really wrong

Agreed, I see no real problem with that Linux-only movement; I was just surprised by the changes since I (incorrectly) assumed that only text tests would change :-)

I'll give the entire patch one more look, but this seems pretty good to land as-is (and we've got almost two weeks before the next release).

Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r=me, thank you!

Once landed, can you please update PDF.js in mozilla-central so that we can get a little bit of real-world testing done before the next PDF.js library release?

@calixteman calixteman merged commit dbe0c4e into mozilla:master Apr 17, 2023
@calixteman
Copy link
Contributor Author

/botio makeref

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_makeref from @calixteman received. Current queue size: 1

Live output at: http://54.193.163.58:8877/02b2edb4ed3a9d9/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_makeref from @calixteman received. Current queue size: 0

Live output at: http://54.241.84.105:8877/0b8220dfa3da984/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/0b8220dfa3da984/output.txt

Total script time: 22.67 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.193.163.58:8877/02b2edb4ed3a9d9/output.txt

Total script time: 22.80 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants