Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PDF import support for books (Issue #93) #119

Merged
merged 4 commits into from
Jan 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions lute/book/forms.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ class NewBookForm(FlaskForm):
"Text file",
validators=[
FileAllowed(
["txt", "epub"],
"Please upload a valid .txt or .epub file.",
["txt", "epub", "pdf"],
"Please upload a valid '.txt', '.epub' or '.pdf' file.",
)
],
)
Expand Down
2 changes: 2 additions & 0 deletions lute/book/routes.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ def _get_file_content(filefielddata):
return service.get_textfile_content(filefielddata)
if ext == ".epub":
return service.get_epub_content(filefielddata)
if ext == ".pdf":
return service.get_pdf_content_from_form(filefielddata)
raise ValueError(f'Unknown file extension "{ext}"')


Expand Down
16 changes: 16 additions & 0 deletions lute/book/service.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from bs4 import BeautifulSoup
from flask import current_app, flash
from openepub import Epub, EpubError
from pypdf import PdfReader
from werkzeug.utils import secure_filename
from lute.book.model import Book

Expand Down Expand Up @@ -82,6 +83,21 @@ def get_epub_content(epub_file_field_data):
return content


def get_pdf_content_from_form(pdf_file_field_data):
"Get content as a single string from a PDF file using PyPDF2."
content = ""
try:
pdf_reader = PdfReader(pdf_file_field_data)

for page in pdf_reader.pages:
content += page.extract_text()

return content
except Exception as e:
msg = f"Could not parse {pdf_file_field_data.filename} (error: {str(e)})"
raise BookImportException(message=msg, cause=e) from e


def book_from_url(url):
"Parse the url and load a new Book."
s = None
Expand Down
2 changes: 1 addition & 1 deletion lute/templates/book/create_new.html
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@
</tr>

<tr>
<td>{{ form.textfile.label }} <i>(.txt, .epub)</i></td>
<td>{{ form.textfile.label }} <i>(.txt, .epub, .pdf)</i></td>
<td>{{ form.textfile() }}</td>
</tr>

Expand Down
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,8 @@ dependencies = [
"PyYAML>=6.0.1,<7",
"toml>=0.10.2,<1",
"waitress>=2.1.2,<3",
"openepub>=0.0.6,<1"
"openepub>=0.0.6,<1",
"pypdf>=3.17.4"
]

[project.scripts]
Expand Down
7 changes: 5 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
astroid==2.15.6
astroid==2.15.6
attrs==23.1.0
beautifulsoup4==4.12.2
black==23.10.1
Expand All @@ -8,6 +8,7 @@ cffi==1.16.0
cfgv==3.4.0
charset-normalizer==3.3.1
click==8.1.7
colorama==0.4.6
coverage==7.3.1
dill==0.3.7
distlib==0.3.7
Expand Down Expand Up @@ -37,6 +38,7 @@ mccabe==0.7.0
mypy-extensions==1.0.0
natto-py==1.0.1
nodeenv==1.8.0
openepub==0.0.6
outcome==1.3.0.post0
packaging==23.1
parse==1.19.1
Expand All @@ -50,6 +52,7 @@ pre-commit==3.5.0
pycparser==2.21
pyee==11.0.1
pylint==2.17.5
pypdf==3.17.4
PySocks==1.7.1
pytest==7.4.2
pytest-base-url==2.0.0
Expand Down Expand Up @@ -81,5 +84,5 @@ Werkzeug==2.3.7
wrapt==1.15.0
wsproto==1.2.0
WTForms==3.0.1
xmltodict==0.13.0
zipp==3.17.0
openepub==0.0.6
12 changes: 12 additions & 0 deletions tests/acceptance/book.feature
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,18 @@ Feature: Books and stats are available
Given a Spanish book "Hola" from file invalid.epub
Then the page contains "Could not parse invalid.epub"

Scenario: I can import a PDF file.
Given I visit "/"
Given a Spanish book "Hola" from file Hola.pdf
Then the page title is Reading "Hola"
And the reading pane shows:
Tengo/ /un/ /amigo/.

Scenario: Invalid PDF files are rejected.
Given I visit "/"
Given a Spanish book "Hola" from file invalid.pdf
Then the page contains "Could not parse invalid.pdf"

Scenario: Books and stats are shown on the first page.
Given I visit "/"
Given a Spanish book "Hola" with content:
Expand Down
Binary file added tests/acceptance/sample_files/Hola.pdf
Binary file not shown.
Binary file added tests/acceptance/sample_files/invalid.pdf
Binary file not shown.