Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove null chars, i.e. \u0000, when getting all text (PR 16286 follow-up) #16297

Merged
merged 1 commit into from
Apr 16, 2023

Conversation

Snuffleupagus
Copy link
Collaborator

I was playing with the new "copy all text" feature, and stumbled upon one document where the copied text was truncated; see http://mirrors.ctan.org/info/lshort/english/lshort.pdf

The problem turns out to be that on page 83 the textLayer contains \u0000 and apparently copying just stops when a null char is encountered. To fix this we can simply use an existing helper function, and with this patch we're able to successfully copy all the text in that document.

…low-up)

I was playing with the new "copy all text" feature, and stumbled upon one document where the copied text was truncated; see http://mirrors.ctan.org/info/lshort/english/lshort.pdf

The problem turns out to be that on [page 83](https://ftp.acc.umu.se/mirror/CTAN/info/lshort/english/lshort.pdf#page=83) the textLayer contains `\u0000` and apparently copying just stops when a null char is encountered.
To fix this we can simply use an existing helper function, and with this patch we're able to successfully copy all the text in that document.
@Snuffleupagus
Copy link
Collaborator Author

/botio integrationtest

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.241.84.105:8877/7d99ee6bcff35e6/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_integrationtest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.193.163.58:8877/1cbee12ab7d0805/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.241.84.105:8877/7d99ee6bcff35e6/output.txt

Total script time: 4.02 mins

  • Integration Tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.193.163.58:8877/1cbee12ab7d0805/output.txt

Total script time: 14.15 mins

  • Integration Tests: FAILED

Copy link
Contributor

@calixteman calixteman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Snuffleupagus Snuffleupagus merged commit 5f7e43a into mozilla:master Apr 16, 2023
@Snuffleupagus Snuffleupagus deleted the copy-all-null-chars branch April 16, 2023 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants