-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Render entire PDFs instead of single pages #840
Conversation
@tonybaloney @mattmsft You may both be interested in the mocking approach I used for this one, lower-level than what we're using for other ones by overriding the Transport class used for the network requests. It works well except for the fact that it's taking longer than other tests, am not sure if that's due to it making a hidden network request or waiting for some timeout somewhere. Maybe @tonybaloney and I can take a look next week. |
Update: I profiled the tests using "yappi" and figured out that the Azure Python SDK was still making network requests due to its default retry policy. I added retry_total=0 (suggestion from SDK team engineer) and now the tests run very fast as expected. So this is ready for review. |
src = ["app/backend", "scripts"] | ||
|
||
[tool.ruff.isort] | ||
known-local-folder = ["scripts"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi @mattmsft This known-local-folder addition fixed the auto-import sorting that was lumping scripts together with third-party on the preplib files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Update: After bringing in Matt's prep docs refactor, I made a few changes and added full test coverage, so this now requires a re-review. |
Looks like playwright was flaky and failed with a timeout - otherwise LGTM |
async def content_file(path): | ||
async def content_file(path: str): | ||
# Remove page number from path, filename-1.txt -> filename.txt | ||
if path.find("#page=") > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this will still work with folks who didn't re-run prepdocs after this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! I just did another manual test to make sure, its working with an old env with individual pages.
src = ["app/backend", "scripts"] | ||
|
||
[tool.ruff.isort] | ||
known-local-folder = ["scripts"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
commit 3bf2d19 Author: Pamela Fox <pamelafox@microsoft.com> Date: Thu Nov 2 09:10:15 2023 -0700 Fix list file (Azure-Samples#897) commit b3c55b0 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Nov 1 06:42:42 2023 -0700 Bump pypdf from 3.16.3 to 3.17.0 in /scripts (Azure-Samples#890) Bumps [pypdf](https://github.com/py-pdf/pypdf) from 3.16.3 to 3.17.0. - [Release notes](https://github.com/py-pdf/pypdf/releases) - [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md) - [Commits](py-pdf/pypdf@3.16.3...3.17.0) --- updated-dependencies: - dependency-name: pypdf dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pamela Fox <pamelafox@microsoft.com> commit cfbb90b Author: Pamela Fox <pamelafox@microsoft.com> Date: Tue Oct 31 21:37:55 2023 -0700 Add more readmes/guides (Azure-Samples#889) * Add more readmes/guides * Add image * Diagram added commit 4479a2c Author: Pamela Fox <pamelafox@microsoft.com> Date: Mon Oct 30 21:38:26 2023 -0700 Handle errors better especialyl for streaming (Azure-Samples#884) commit 0d54f84 Author: Pamela Fox <pamelafox@microsoft.com> Date: Mon Oct 30 17:58:09 2023 -0700 Add exclude files (Azure-Samples#876) commit 3647826 Author: Roderic Bos <github@rooc.nl> Date: Mon Oct 30 17:35:50 2023 +0100 When using the option --storagekey for the prepdocs script the key might (Azure-Samples#866) contain `==` base64 padding at the end. This will fail to succesfully login because the script just removes the `=` signs during the split action. Copied the version from the app/start.ps1 which is better suited here. Co-authored-by: Pamela Fox <pamelafox@microsoft.com> commit 5daa934 Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Mon Oct 30 09:17:55 2023 -0700 Bump the github-actions group with 1 update (Azure-Samples#880) Bumps the github-actions group with 1 update: [actions/setup-node](https://github.com/actions/setup-node). - [Release notes](https://github.com/actions/setup-node/releases) - [Commits](actions/setup-node@v3...v4) --- updated-dependencies: - dependency-name: actions/setup-node dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> commit bfb3ee5 Author: Pamela Fox <pamelafox@microsoft.com> Date: Fri Oct 27 11:04:37 2023 -0700 Render entire PDFs instead of single pages (Azure-Samples#840) * Adding anchors * Show whole file * Show whole file * Page number support * More experiments with whole file * Revert unintentional changes * Add tests * Remove random num * Add retry_total=0 to avoid unnecessary network requests * Add comment to explain retry_total * Bring back deleted file * Blob manager refactor after merge * Update coverage amount * Make mypy happy with explicit check of path * Add debug for 3.9 * Skip in 3.9 since its silly * Reduce fail under percentage due to 3.9 commit c989048 Author: Pamela Fox <pamelafox@microsoft.com> Date: Thu Oct 26 20:01:18 2023 -0700 add screenshot (Azure-Samples#870) commit a64a12e Author: Matt <57731498+mattmsft@users.noreply.github.com> Date: Thu Oct 26 12:49:02 2023 -0700 Refactor prepdocs (Azure-Samples#862) * setting up types * setting up more types... * working on it... * prepdocs refactor * typing fixes; updating tests * more fixes; updating tests * fixing retry for embeddings * fixing adls gen2 list * more test fixes * fixes from manual runs * more fixes * more fixes * type fixes * more type fixes and test fixes * break into modules * openai embedding fix * novectors fix * fix id generation * doc strings * feedback from pr * rename feedback * trying to get imports to work * update test workflow with pamela's suggestion * fix ci again * delete __init__ * mypy configuration --------- Co-authored-by: Matt Gotteiner <magottei@microsoft.com> commit 94be632 Author: MaciejLitwiniec <MaciejLitwiniec@users.noreply.github.com> Date: Thu Oct 26 15:23:00 2023 +0200 Updated FAQ so that it reflect PR 835 (Azure-Samples#868) * Updated FAQ so that it reflect PR 835 * Update README.md * Update README.md --------- Co-authored-by: Pamela Fox <pamela.fox@gmail.com> commit d7bbf9f Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Date: Wed Oct 25 14:23:05 2023 -0700 Bump werkzeug from 3.0.0 to 3.0.1 in /app/backend (Azure-Samples#863) Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.0 to 3.0.1. - [Release notes](https://github.com/pallets/werkzeug/releases) - [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst) - [Commits](pallets/werkzeug@3.0.0...3.0.1) --- updated-dependencies: - dependency-name: werkzeug dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pamela Fox <pamelafox@microsoft.com> commit d02aa14 Author: Pamela Fox <pamelafox@microsoft.com> Date: Wed Oct 25 13:48:45 2023 -0700 Message builder improvements (Azure-Samples#852) commit 8f55988 Author: Pamela Fox <pamelafox@microsoft.com> Date: Tue Oct 24 19:53:07 2023 -0700 Reorder tags to optimize for sample browser (Azure-Samples#853) commit 16a61bf Author: Pamela Fox <pamelafox@microsoft.com> Date: Mon Oct 23 17:03:59 2023 -0700 Improve follow-up questions and pipe into context (Azure-Samples#832) * Add follow-up questions and parsing * Test breaking the e2e * Actually run tests * Fix runner * Add conditional * Fix the test * chat approach commit ca01af9 Author: Anthony Shaw <anthony.p.shaw@gmail.com> Date: Mon Oct 23 09:27:32 2023 +1100 Store an MD5 hash of uploaded/indexed file and check before prepdocs (Azure-Samples#835) * Hash the uploaded files locally and skip them if you provision a second time and they haven't changed * Overwrite the hash when it changes * Remove open mode parameter * fix f-string * reformat changes * Update prepdocs.py
* Adding anchors * Show whole file * Show whole file * Page number support * More experiments with whole file * Revert unintentional changes * Add tests * Remove random num * Add retry_total=0 to avoid unnecessary network requests * Add comment to explain retry_total * Bring back deleted file * Blob manager refactor after merge * Update coverage amount * Make mypy happy with explicit check of path * Add debug for 3.9 * Skip in 3.9 since its silly * Reduce fail under percentage due to 3.9
Purpose
Fixes #819
According to various user studies and requests, users would generally prefer to see the entire PDF, so that they can see the context, but of course, they'd also like the PDF to start off on the relevant page. Fortunately, browsers support a pseudo-standard of "#page=N" for jumping to a particular page.
This PR changes prepdocs.py to skip the step of uploading individual pages and link to #page=N instead of filename-N.pdf, and then makes sure the content_file route can still load those.
I added tests for the content_file route as well. They're running slowly now (though passing), am hoping to get guidance from Azure SDK team as to improving my mocking approach.
Does this introduce a breaking change?
This should still work with indices that have individual pages.
Pull Request Type
What kind of change does this Pull Request introduce?
How to Test