Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Render entire PDFs instead of single pages #840

Merged
merged 32 commits into from
Oct 27, 2023
Merged
Show file tree
Hide file tree
Changes from 22 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
b7d9678
Adding anchors
pamelafox Sep 15, 2023
99ea466
Merge branch 'main' of https://github.com/pamelafox/azure-search-open…
pamelafox Sep 15, 2023
a3c0023
Merge branch 'main' of https://github.com/pamelafox/azure-search-open…
pamelafox Sep 16, 2023
cd2706c
Merge branch 'main' of https://github.com/pamelafox/azure-search-open…
pamelafox Sep 20, 2023
6910e05
Merge branch 'main' of https://github.com/pamelafox/azure-search-open…
pamelafox Sep 20, 2023
f63398a
Merge branch 'main' of https://github.com/pamelafox/azure-search-open…
pamelafox Sep 22, 2023
eaef837
Merge branch 'main' of https://github.com/pamelafox/azure-search-open…
pamelafox Sep 25, 2023
3c4fe71
Show whole file
pamelafox Sep 28, 2023
60e07cc
Show whole file
pamelafox Sep 28, 2023
cf46813
Merge branch 'main' of https://github.com/pamelafox/azure-search-open…
pamelafox Sep 30, 2023
5e745a9
Merge branch 'main' of https://github.com/pamelafox/azure-search-open…
pamelafox Oct 2, 2023
1bec0b3
Merge branch 'main' of https://github.com/pamelafox/azure-search-open…
pamelafox Oct 2, 2023
f8f300f
Page number support
pamelafox Oct 2, 2023
820e4b6
Merge branch 'main' into wholefile
pamelafox Oct 2, 2023
4395e39
More experiments with whole file
pamelafox Oct 13, 2023
e0b036b
Merge branch 'main' into wholefile
pamelafox Oct 19, 2023
9d138a3
Revert unintentional changes
pamelafox Oct 19, 2023
47f9a17
Add tests
pamelafox Oct 20, 2023
786e5eb
Remove random num
pamelafox Oct 23, 2023
4d62561
Add retry_total=0 to avoid unnecessary network requests
pamelafox Oct 24, 2023
0de7037
Add comment to explain retry_total
pamelafox Oct 24, 2023
ee51b9e
Merge branch 'main' into wholefile
pamelafox Oct 24, 2023
ae1d95f
Bring back deleted file
pamelafox Oct 25, 2023
78448e9
Merge branch 'wholefile' of https://github.com/pamelafox/azure-search…
pamelafox Oct 25, 2023
f1cd17a
Merge branch 'main' into wholefile
pamelafox Oct 26, 2023
d4d9db0
Blob manager refactor after merge
pamelafox Oct 26, 2023
23e6aba
Update coverage amount
pamelafox Oct 26, 2023
3acccb8
Make mypy happy with explicit check of path
pamelafox Oct 26, 2023
17fd596
Add debug for 3.9
pamelafox Oct 27, 2023
5f11ce6
Skip in 3.9 since its silly
pamelafox Oct 27, 2023
b8202fd
Reduce fail under percentage due to 3.9
pamelafox Oct 27, 2023
de97cff
Merge branch 'main' into wholefile
pamelafox Oct 27, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .vscode/settings.json
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,10 @@
"search.exclude": {
"**/node_modules": true,
"static": true
}
},
"python.testing.pytestArgs": [
"tests"
],
"python.testing.unittestEnabled": false,
"python.testing.pytestEnabled": true
}
14 changes: 12 additions & 2 deletions app/backend/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@

import aiohttp
import openai
from azure.core.exceptions import ResourceNotFoundError
from azure.identity.aio import DefaultAzureCredential
from azure.monitor.opentelemetry import configure_azure_monitor
from azure.search.documents.aio import SearchClient
Expand Down Expand Up @@ -69,9 +70,18 @@ async def assets(path):
# *** NOTE *** this assumes that the content files are public, or at least that all users of the app
# can access all the files. This is also slow and memory hungry.
@bp.route("/content/<path>")
async def content_file(path):
async def content_file(path: str):
# Remove page number from path, filename-1.txt -> filename.txt
if path.find("#page=") > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this will still work with folks who didn't re-run prepdocs after this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! I just did another manual test to make sure, its working with an old env with individual pages.

path_parts = path.rsplit("#page=", 1)
path = path_parts[0]
logging.info("Opening file %s at page %s", path)
blob_container_client = current_app.config[CONFIG_BLOB_CONTAINER_CLIENT]
blob = await blob_container_client.get_blob_client(path).download_blob()
try:
blob = await blob_container_client.get_blob_client(path).download_blob()
except ResourceNotFoundError:
logging.exception("Path not found: %s", path)
abort(404)
if not blob.properties or not blob.properties.has_key("content_settings"):
abort(404)
mime_type = blob.properties["content_settings"]["content_type"]
Expand Down
Binary file removed data/employee_handbook.pdf
pamelafox marked this conversation as resolved.
Show resolved Hide resolved
Binary file not shown.
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ line-length = 120

[tool.pytest.ini_options]
addopts = "-ra"
pythonpath = ["app/backend"]
pythonpath = ["app/backend", "scripts"]

[tool.coverage.paths]
source = ["scripts", "app"]
Expand Down
30 changes: 8 additions & 22 deletions scripts/prepdocs.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
import glob
import hashlib
import html
import io
import os
import re
import tempfile
Expand Down Expand Up @@ -35,7 +34,7 @@
from azure.storage.filedatalake import (
DataLakeServiceClient,
)
from pypdf import PdfReader, PdfWriter
from pypdf import PdfReader
from tenacity import (
retry,
retry_if_exception_type,
Expand Down Expand Up @@ -78,7 +77,7 @@ def calculate_tokens_emb_aoai(input: str):

def blob_name_from_file_page(filename, page=0):
if os.path.splitext(filename)[1].lower() == ".pdf":
return os.path.splitext(os.path.basename(filename))[0] + f"-{page}" + ".pdf"
return f"{os.path.basename(filename)}#page={page+1}"
else:
return os.path.basename(filename)

Expand All @@ -91,24 +90,11 @@ def upload_blobs(filename):
if not blob_container.exists():
blob_container.create_container()

# if file is PDF split into pages and upload each page as a separate blob
if os.path.splitext(filename)[1].lower() == ".pdf":
reader = PdfReader(filename)
pages = reader.pages
for i in range(len(pages)):
blob_name = blob_name_from_file_page(filename, i)
if args.verbose:
print(f"\tUploading blob for page {i} -> {blob_name}")
f = io.BytesIO()
writer = PdfWriter()
writer.add_page(pages[i])
writer.write(f)
f.seek(0)
blob_container.upload_blob(blob_name, f, overwrite=True)
else:
blob_name = blob_name_from_file_page(filename)
with open(filename, "rb") as data:
blob_container.upload_blob(blob_name, data, overwrite=True)
# Upload the original file
blob_name = os.path.basename(filename)
print(f"\tUploading blob for whole file -> {blob_name}")
with open(filename, "rb") as data:
blob_container.upload_blob(blob_name, data, overwrite=True)


def remove_blobs(filename):
Expand All @@ -124,7 +110,7 @@ def remove_blobs(filename):
else:
prefix = os.path.splitext(os.path.basename(filename))[0]
blobs = filter(
lambda b: re.match(f"{prefix}-\d+\.pdf", b),
lambda b: re.match(f"{prefix}-\d+\.pdf", b) or b == os.path.basename(filename),
blob_container.list_blob_names(name_starts_with=os.path.splitext(os.path.basename(prefix))[0]),
)
for b in blobs:
Expand Down
99 changes: 99 additions & 0 deletions tests/test_content_file.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
import os
from collections import namedtuple

import aiohttp
import pytest
from azure.core.exceptions import ResourceNotFoundError
from azure.core.pipeline.transport import (
AioHttpTransportResponse,
AsyncHttpTransport,
HttpRequest,
)
from azure.storage.blob.aio import BlobServiceClient

import app

MockToken = namedtuple("MockToken", ["token", "expires_on"])


class MockAzureCredential:
async def get_token(self, uri):
return MockToken("mock_token", 9999999999)


@pytest.mark.asyncio
async def test_content_file(monkeypatch, mock_env):
class MockAiohttpClientResponse404(aiohttp.ClientResponse):
def __init__(self, url, body_bytes, headers=None):
self._body = body_bytes
self._headers = headers
self._cache = {}
self.status = 404
self.reason = "Not Found"
self._url = url

class MockAiohttpClientResponse(aiohttp.ClientResponse):
def __init__(self, url, body_bytes, headers=None):
self._body = body_bytes
self._headers = headers
self._cache = {}
self.status = 200
self.reason = "OK"
self._url = url

class MockTransport(AsyncHttpTransport):
async def send(self, request: HttpRequest, **kwargs) -> AioHttpTransportResponse:
if request.url.endswith("notfound.pdf"):
raise ResourceNotFoundError(MockAiohttpClientResponse404(request.url, b""))
else:
return AioHttpTransportResponse(
request,
MockAiohttpClientResponse(
request.url,
b"test content",
{
"Content-Type": "application/octet-stream",
"Content-Range": "bytes 0-27/28",
"Content-Length": "28",
},
),
)

async def __aenter__(self):
return self

async def __aexit__(self, *args):
pass

async def open(self):
pass

async def close(self):
pass

# Then we can plug this into any SDK via kwargs:
blob_client = BlobServiceClient(
f"https://{os.environ['AZURE_STORAGE_ACCOUNT']}.blob.core.windows.net",
credential=MockAzureCredential(),
transport=MockTransport(),
retry_total=0, # Necessary to avoid unnecessary network requests during tests
)
blob_container_client = blob_client.get_container_client(os.environ["AZURE_STORAGE_CONTAINER"])

quart_app = app.create_app()
async with quart_app.test_app() as test_app:
quart_app.config.update({"blob_container_client": blob_container_client})

client = test_app.test_client()
response = await client.get("/content/notfound.pdf")
assert response.status_code == 404

response = await client.get("/content/role_library.pdf")
assert response.status_code == 200
assert response.headers["Content-Type"] == "application/pdf"
assert await response.get_data() == b"test content"

response = await client.get("/content/role_library.pdf#page=10")
assert response.status_code == 200
assert response.headers["Content-Type"] == "application/pdf"
assert await response.get_data() == b"test content"