Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFBOX-5487: Remove all space characters if contained within the adjacent letters #155

Closed
wants to merge 2 commits into from

Conversation

noureldin-eg
Copy link

Please see PDFBOX-5487 and the comments below.

In the PDF attached in the Jira issue, there are 2 space characters which overlap with the adjacent letters of 2 Arabic words. When sorting is enabled, this space gets shifted into the middle of a word.

This commit will remove such spaces just after sorting.

PDFBOX-5487_ اعلامية

PDFBOX-5487_ وفضلا

@kaismh
Copy link

kaismh commented Dec 7, 2024

@noureldin-eg Any known side effects for this commit?

@THausherr
Copy link
Contributor

I never got any feedback in PDFBOX-5487. What I need to know is whether the contents in PDFBOX-5487-arabic.pdf-sorted-diff.txt are better in the lines with "new".

@noureldin-eg
Copy link
Author

Hi @kaismh and @THausherr,
It has been quite some time since I created this PR, and I had thought it was already merged. I’m glad to revisit this and contribute to the library again. However, I will need a bit of time to set up the project and review the code.

Thank you for reminding me about this, and I’ll try to provide updates asap.

@kaismh
Copy link

kaismh commented Dec 11, 2024

@THausherr The output is better for all arabic cases I tried, but not sure if it might break some situations or other languages. Might be better to have as an option

@THausherr
Copy link
Contributor

I'm reluctant to add a new option... it doesn't seem to be needed. I have around 100 local test files and only two were changed (only in the sorted output).
I'd still like to get some feedback by @noureldin-eg .

@noureldin-eg
Copy link
Author

noureldin-eg commented Dec 14, 2024

Any known side effects for this commit?

No known side effects for Arabic (and English) text extraction. I can't confirm its impact on other languages, but if you'd like, I could modify the implementation to apply this fix only when the unicode fall within the Arabic code pages (as in PR #156).

whether the contents in PDFBOX-5487-arabic.pdf-sorted-diff.txt are better in the lines with "new"

Yes, the extracted contents are better after my commit. Specifically, the two key changes highlighted in the screenshots above and explained in the Jira issue have been addressed.

asfgit pushed a commit that referenced this pull request Dec 15, 2024
…cent letters, by Mohamed M NourElDin; closes #155

git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/2.0@1922512 13f79535-47bb-0310-9956-ffa450edef68
asfgit pushed a commit that referenced this pull request Dec 15, 2024
…cent letters, by Mohamed M NourElDin; closes #155

git-svn-id: https://svn.apache.org/repos/asf/pdfbox/branches/3.0@1922513 13f79535-47bb-0310-9956-ffa450edef68
@asfgit asfgit closed this in 374972f Dec 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants