Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PDF import support for books (Issue #93) #119

Merged
merged 4 commits into from
Jan 10, 2024

Conversation

dgc08
Copy link
Contributor

@dgc08 dgc08 commented Jan 7, 2024

Implement issue #93

I took the PR for the epub support as template and chnaged it accordingly.

I used PyPDF2 as library, the license should be fine: https://pypi.org/project/PyPDF2/

I took the PR for the epub support as template. I used PyPDf as library, the license should be fine: https://pypi.org/project/PyPDF2/
@jzohrab
Copy link
Collaborator

jzohrab commented Jan 8, 2024

Hi @dgc08 -- this failed one of the acceptance tests, can you take a look?

Copy link
Collaborator

@jzohrab jzohrab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed acceptance test.

@dgc08
Copy link
Contributor Author

dgc08 commented Jan 8, 2024

The acceptance test should work now. The sample PDF file contained an extra page number, which also got imported. Didn't run the test beforehand, sorry.

@jzohrab
Copy link
Collaborator

jzohrab commented Jan 8, 2024 via email

@dgc08
Copy link
Contributor Author

dgc08 commented Jan 8, 2024

kinda screwed up my fork, wait a moment before merging

@dgc08
Copy link
Contributor Author

dgc08 commented Jan 8, 2024

ok its fine now

Copy link
Collaborator

@jzohrab jzohrab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last change requested: what needs to be put into the pyproject.toml file (in project root)?

(I don't have CI checking this, couldn't sort out how to do it well.)

@dgc08
Copy link
Contributor Author

dgc08 commented Jan 8, 2024

I hope that's it. Thank you for your patience with me, Lute is the first Open Source project / larger project in general that i contribute to, pytest and that stuff is still new to me

@jzohrab
Copy link
Collaborator

jzohrab commented Jan 9, 2024 via email

@jzohrab
Copy link
Collaborator

jzohrab commented Jan 9, 2024

Hi @dgc08 -- I did an import of a large file and found a few places where the import adds some spaces, eg

image

vs the original pdf:

image

This happens fairly frequently. It's almost certain that it's due to the PDF parser library ... should check.

@jzohrab
Copy link
Collaborator

jzohrab commented Jan 9, 2024

Yes, it's the library, there's a good write up by the author(s) in these links:

Per the authors:

Getting whitespaces right is notoriously hard. @pubpub-zz is the expert in that topic; I'll leave it to him to decide if we should leave this issue open. The issue is that PDF does not (necessarily) represent the words as words internally. In the worst case, it just gives the absolute position of each character in the document.

So, I think this should be fine as it is, but we should mention somewhere that PDF imports are extremely tricky -- I'll draft that before merging this PR, as users should be aware of the limitations.

@jzohrab
Copy link
Collaborator

jzohrab commented Jan 10, 2024

Added a flash:

image

@jzohrab jzohrab merged commit a3e50a7 into LuteOrg:develop Jan 10, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants