-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[api-minor] Don't normalize the text used in the text layer. #16200
Conversation
Given that we should be pretty close to the next PDF.js release, this is probably not something that we want to land just prior.
|
Yep I thought about that too in order to avoid too much breaking change.
We'll let the browser deal with that so I'd expect to have something at least identical or maybe better.
Good question, I tried with a fi ligature and NVDA and it doesn't work correctly:
I tested in Safari the with ArabicCIDTrueType.pdf and with tracemonkey.pdf: I don't see anything wrong.
I'm not 100% sure that nothing will be broken but we'll fix the issues if we meet any. For the context, I began to write a patch to use the font we generated for the canvas in the text layer. And in this case I need to have the char for the |
Thanks for checking! I didn't have anything particular in mind, it's just that I've noticed over time that users seem to run into much more trouble with Safari than any other browser. Additionally, it also seems that Safari is sometimes lagging behind when it comes to implementing new web-platform features; one example that comes to mind is |
3b0dd3c
to
9cf704e
Compare
The discussion in that issue seem to suggest that the actual bug lies elsewhere; what's the overall status of a11y support for ligatures?
This actually seems problematic as-is, since we'll now end up normalizing a lot more than before. Take e.g. the fraction-highlight.pdf document as an example, where copy-and-paste currently gives the following result:
With this patch you'll instead get:
This seems problematic for a couple of reasons:
To me this feels like opening the door to problems both now and down the line, by us essentially relinquishing any control over "when" normalization happens. To that end I wonder if we could perhaps combine the current hard-coded map with using standard normalization (which should still be a reduction in code-size)? const NORMALIZE_UNICODES = new Set([
"\u00A8",
"\u00AF",
// The rest of the "keys" from `getNormalizedUnicodes` goes here...
]);
function normalizeTextContent(str) {
const buf = [];
for (const char of str) {
buf.push(NORMALIZE_UNICODES.has(char) ? char.normalize("NFKC") : char);
}
return buf.join("");
} You'd then call that helper-function on copy-and-paste, and we'd still have control of normalization without having to hard-code everything like previously. |
9cf704e
to
232b7ff
Compare
All the chars we've in |
05c2c75
to
f0512b2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be feasible to add an integration-test, using a PDF document with ligatures, that checks that the copied text is actually being normalized as expected?
21ccfc8
to
2c9cce0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the latest round of comments addressed, let's start running tests to see what the "fallout" looks like :-)
Some arabic chars like \ufe94 could be searched in a pdf, hence it must be normalized when creating the search query. So to avoid to duplicate the normalization code, everything is moved in the find controller. The previous code to normalize text was using NFKC but with a hardcoded map, hence it has been replaced by the use of normalize("NFKC") (it helps to reduce the bundle size by 30kb). In playing with this \ufe94 char, I noticed that the bidi algorithm wasn't taking into account some RTL unicode ranges, the generated font wasn't embedding the mapping this char and the unicode ranges in the OS/2 table weren't up-to-date. When normalized some chars can be replaced by several ones and it induced to have some extra chars in the text layer. To avoid any regression, when copying some text from the text layer, a copied string is normalized (NFKC) before being put in the clipboard (it works like this in either Acrobat or Chrome).
2c9cce0
to
117bbf7
Compare
/botio test |
From: Bot.io (Linux m4)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.241.84.105:8877/cb7302008d5c91c/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @calixteman received. Current queue size: 0 Live output at: http://54.193.163.58:8877/0bc1fe0eb14436b/output.txt |
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/0bc1fe0eb14436b/output.txt Total script time: 26.08 mins
Image differences available at: http://54.193.163.58:8877/0bc1fe0eb14436b/reftest-analyzer.html#web=eq.log |
From: Bot.io (Linux m4)FailedFull output at http://54.241.84.105:8877/cb7302008d5c91c/output.txt Total script time: 27.31 mins
Image differences available at: http://54.241.84.105:8877/cb7302008d5c91c/reftest-analyzer.html#web=eq.log |
From a brief look, the textLayer movement seem fine and expected as far as I can tell. The movement in regular /botio integrationtest |
From: Bot.io (Windows)ReceivedCommand cmd_integrationtest from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/edc23c4ead7301c/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_integrationtest from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/52b3ef203ed726e/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/52b3ef203ed726e/output.txt Total script time: 4.08 mins
|
From: Bot.io (Windows)FailedFull output at http://54.193.163.58:8877/edc23c4ead7301c/output.txt Total script time: 13.94 mins
|
I don't see anything really wrong
I'd say that the root cause is the one mentioned by Jonathan: |
Agreed, I see no real problem with that Linux-only movement; I was just surprised by the changes since I (incorrectly) assumed that only I'll give the entire patch one more look, but this seems pretty good to land as-is (and we've got almost two weeks before the next release). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r=me, thank you!
Once landed, can you please update PDF.js in mozilla-central so that we can get a little bit of real-world testing done before the next PDF.js library release?
/botio makeref |
From: Bot.io (Windows)ReceivedCommand cmd_makeref from @calixteman received. Current queue size: 1 Live output at: http://54.193.163.58:8877/02b2edb4ed3a9d9/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_makeref from @calixteman received. Current queue size: 0 Live output at: http://54.241.84.105:8877/0b8220dfa3da984/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/0b8220dfa3da984/output.txt Total script time: 22.67 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.193.163.58:8877/02b2edb4ed3a9d9/output.txt Total script time: 22.80 mins
|
Some arabic chars like \ufe94 could be searched in a pdf, hence it must be normalized when creating the search query. So to avoid to duplicate the normalization code, everything is moved in the find controller.
The previous code to normalize text was using NFKC but with a hardcoded map, hence it has been replaced by the use of normalize("NFKC") (it helps to reduce the bundle size by 30kb).
In playing with this \ufe94 char, I noticed that the bidi algorithm wasn't taking into account some RTL unicode ranges, the generated font wasn't embedding the mapping this char and the unicode ranges in the OS/2 table weren't up-to-date.
When normalized some chars can be replaced by several ones and it induced to have some extra chars in the text layer. To avoid any regression, when copying some text from the text layer, a copied string is normalized (NFKC) before being put in the clipboard (it works like this in either Acrobat or Chrome).