Add PDF import support for books (Issue #93) #119

dgc08 · 2024-01-07T14:45:11Z

Implement issue #93

I took the PR for the epub support as template and chnaged it accordingly.

I used PyPDF2 as library, the license should be fine: https://pypi.org/project/PyPDF2/

I took the PR for the epub support as template. I used PyPDf as library, the license should be fine: https://pypi.org/project/PyPDF2/

jzohrab · 2024-01-08T03:48:55Z

Hi @dgc08 -- this failed one of the acceptance tests, can you take a look?

jzohrab

Failed acceptance test.

dgc08 · 2024-01-08T13:13:36Z

The acceptance test should work now. The sample PDF file contained an extra page number, which also got imported. Didn't run the test beforehand, sorry.

jzohrab · 2024-01-08T13:27:17Z

No problem that’s what the tests are there for. Thanks for adding the test it’s a big help for stability. El El lun, ene. 8, 2024 a la(s) 8:13 p. m., Sinthoras39 < ***@***.***> escribió:

…

The acceptance test should work now. The sample PDF file contained an extra page number, which also got imported. Didn't run the test beforehand, sorry. — Reply to this email directly, view it on GitHub <#119 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMPWDO4BF26GIDGZPS3MHDYNPWIZAVCNFSM6AAAAABBQL5TSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQHE4DIMZVHA> . You are receiving this because you commented.Message ID: ***@***.***>

dgc08 · 2024-01-08T13:36:42Z

kinda screwed up my fork, wait a moment before merging

dgc08 · 2024-01-08T13:42:08Z

ok its fine now

jzohrab

One last change requested: what needs to be put into the pyproject.toml file (in project root)?

(I don't have CI checking this, couldn't sort out how to do it well.)

dgc08 · 2024-01-08T16:03:45Z

I hope that's it. Thank you for your patience with me, Lute is the first Open Source project / larger project in general that i contribute to, pytest and that stuff is still new to me

jzohrab · 2024-01-09T01:01:41Z

It’s a super contribution so thank you. I’ll try it out soon with a bigger pdf to see how it works, and will review the code again. For a first PR it’s great! 👍 El El lun, ene. 8, 2024 a la(s) 11:03 p. m., Sinthoras39 < ***@***.***> escribió:

…

I hope that's it. Thank you for your patience with me, Lute is the first Open Source project / larger project in general that i contribute to, pytest and that stuff is still new to me — Reply to this email directly, view it on GitHub <#119 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMPWDP7C55KRLXBTIIPHHLYNQKGXAVCNFSM6AAAAABBQL5TSKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBRGM3DEOJTG4> . You are receiving this because you commented.Message ID: ***@***.***>

jzohrab · 2024-01-09T15:41:25Z

Hi @dgc08 -- I did an import of a large file and found a few places where the import adds some spaces, eg

vs the original pdf:

This happens fairly frequently. It's almost certain that it's due to the PDF parser library ... should check.

jzohrab · 2024-01-09T15:47:59Z

Yes, it's the library, there's a good write up by the author(s) in these links:

Per the authors:

Getting whitespaces right is notoriously hard. @pubpub-zz is the expert in that topic; I'll leave it to him to decide if we should leave this issue open. The issue is that PDF does not (necessarily) represent the words as words internally. In the worst case, it just gives the absolute position of each character in the document.

So, I think this should be fine as it is, but we should mention somewhere that PDF imports are extremely tricky -- I'll draft that before merging this PR, as users should be aware of the limitations.

jzohrab · 2024-01-10T01:50:22Z

Added a flash:

Add PDF import support for books

def6b7d

I took the PR for the epub support as template. I used PyPDf as library, the license should be fine: https://pypi.org/project/PyPDF2/

jzohrab requested changes Jan 8, 2024

View reviewed changes

fix failed test

15dceb4

dgc08 force-pushed the pdf-import branch from 15dceb4 to def6b7d Compare January 8, 2024 13:33

fix failed acceptance test + lint should be happy now too

5cb25e7

jzohrab requested changes Jan 8, 2024

View reviewed changes

update pyproject.toml

a3e50a7

jzohrab merged commit a3e50a7 into LuteOrg:develop Jan 10, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PDF import support for books (Issue #93) #119

Add PDF import support for books (Issue #93) #119

dgc08 commented Jan 7, 2024

jzohrab commented Jan 8, 2024

jzohrab left a comment

dgc08 commented Jan 8, 2024

jzohrab commented Jan 8, 2024 via email

dgc08 commented Jan 8, 2024

dgc08 commented Jan 8, 2024

jzohrab left a comment

dgc08 commented Jan 8, 2024

jzohrab commented Jan 9, 2024 via email

jzohrab commented Jan 9, 2024

jzohrab commented Jan 9, 2024 •

edited

Loading

jzohrab commented Jan 10, 2024

Add PDF import support for books (Issue #93) #119

Add PDF import support for books (Issue #93) #119

Conversation

dgc08 commented Jan 7, 2024

jzohrab commented Jan 8, 2024

jzohrab left a comment

Choose a reason for hiding this comment

dgc08 commented Jan 8, 2024

jzohrab commented Jan 8, 2024 via email

dgc08 commented Jan 8, 2024

dgc08 commented Jan 8, 2024

jzohrab left a comment

Choose a reason for hiding this comment

dgc08 commented Jan 8, 2024

jzohrab commented Jan 9, 2024 via email

jzohrab commented Jan 9, 2024

jzohrab commented Jan 9, 2024 • edited Loading

jzohrab commented Jan 10, 2024

jzohrab commented Jan 9, 2024 •

edited

Loading