Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Render entire PDFs instead of single pages #840

Merged
merged 32 commits into from
Oct 27, 2023
Merged

Render entire PDFs instead of single pages #840

merged 32 commits into from
Oct 27, 2023

Conversation

pamelafox
Copy link
Collaborator

@pamelafox pamelafox commented Oct 20, 2023

Purpose

Fixes #819

According to various user studies and requests, users would generally prefer to see the entire PDF, so that they can see the context, but of course, they'd also like the PDF to start off on the relevant page. Fortunately, browsers support a pseudo-standard of "#page=N" for jumping to a particular page.

This PR changes prepdocs.py to skip the step of uploading individual pages and link to #page=N instead of filename-N.pdf, and then makes sure the content_file route can still load those.

Screenshot 2023-10-20 at 4 07 41 PM

I added tests for the content_file route as well. They're running slowly now (though passing), am hoping to get guidance from Azure SDK team as to improving my mocking approach.

Does this introduce a breaking change?

This should still work with indices that have individual pages.

[ ] Yes
[X] No

Pull Request Type

What kind of change does this Pull Request introduce?

[ ] Bugfix
[X] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

How to Test

  • Test with a fresh environment, azd up, ask a question, click on citation, see it jump
  • Test with an old environment, dont re-index, ask a question, click on citation, see it load whole page

@pamelafox
Copy link
Collaborator Author

@tonybaloney @mattmsft You may both be interested in the mocking approach I used for this one, lower-level than what we're using for other ones by overriding the Transport class used for the network requests. It works well except for the fact that it's taking longer than other tests, am not sure if that's due to it making a hidden network request or waiting for some timeout somewhere. Maybe @tonybaloney and I can take a look next week.

@pamelafox
Copy link
Collaborator Author

Update: I profiled the tests using "yappi" and figured out that the Azure Python SDK was still making network requests due to its default retry policy. I added retry_total=0 (suggestion from SDK team engineer) and now the tests run very fast as expected. So this is ready for review.
@mattmsft This PR does change prepdocs, so if you want to wait for your PR, I can refactor when yours in.

data/employee_handbook.pdf Outdated Show resolved Hide resolved
src = ["app/backend", "scripts"]

[tool.ruff.isort]
known-local-folder = ["scripts"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi @mattmsft This known-local-folder addition fixed the auto-import sorting that was lumping scripts together with third-party on the preplib files.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@pamelafox
Copy link
Collaborator Author

Update: After bringing in Matt's prep docs refactor, I made a few changes and added full test coverage, so this now requires a re-review.

@pamelafox pamelafox requested a review from chuwik October 27, 2023 04:03
@mattgotteiner
Copy link
Collaborator

Looks like playwright was flaky and failed with a timeout - otherwise LGTM

async def content_file(path):
async def content_file(path: str):
# Remove page number from path, filename-1.txt -> filename.txt
if path.find("#page=") > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this will still work with folks who didn't re-run prepdocs after this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! I just did another manual test to make sure, its working with an old env with individual pages.

src = ["app/backend", "scripts"]

[tool.ruff.isort]
known-local-folder = ["scripts"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@pamelafox pamelafox merged commit bfb3ee5 into Azure-Samples:main Oct 27, 2023
8 checks passed
@pamelafox pamelafox deleted the wholefile branch October 27, 2023 18:05
vishalgtingre added a commit to vishalgtingre/azure-search-openai-demo that referenced this pull request Nov 5, 2023
commit 3bf2d19
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Thu Nov 2 09:10:15 2023 -0700

    Fix list file (Azure-Samples#897)

commit b3c55b0
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Nov 1 06:42:42 2023 -0700

    Bump pypdf from 3.16.3 to 3.17.0 in /scripts (Azure-Samples#890)

    Bumps [pypdf](https://github.com/py-pdf/pypdf) from 3.16.3 to 3.17.0.
    - [Release notes](https://github.com/py-pdf/pypdf/releases)
    - [Changelog](https://github.com/py-pdf/pypdf/blob/main/CHANGELOG.md)
    - [Commits](py-pdf/pypdf@3.16.3...3.17.0)

    ---
    updated-dependencies:
    - dependency-name: pypdf
      dependency-type: direct:production
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    Co-authored-by: Pamela Fox <pamelafox@microsoft.com>

commit cfbb90b
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Tue Oct 31 21:37:55 2023 -0700

    Add more readmes/guides (Azure-Samples#889)

    * Add more readmes/guides

    * Add image

    * Diagram added

commit 4479a2c
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Mon Oct 30 21:38:26 2023 -0700

    Handle errors better especialyl for streaming (Azure-Samples#884)

commit 0d54f84
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Mon Oct 30 17:58:09 2023 -0700

    Add exclude files (Azure-Samples#876)

commit 3647826
Author: Roderic Bos <github@rooc.nl>
Date:   Mon Oct 30 17:35:50 2023 +0100

    When using the option --storagekey for the prepdocs script the key might (Azure-Samples#866)

    contain `==` base64 padding at the end. This will fail to succesfully
    login because the script just removes the `=` signs during the split
    action. Copied the version from the app/start.ps1 which is better suited here.

    Co-authored-by: Pamela Fox <pamelafox@microsoft.com>

commit 5daa934
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Mon Oct 30 09:17:55 2023 -0700

    Bump the github-actions group with 1 update (Azure-Samples#880)

    Bumps the github-actions group with 1 update: [actions/setup-node](https://github.com/actions/setup-node).

    - [Release notes](https://github.com/actions/setup-node/releases)
    - [Commits](actions/setup-node@v3...v4)

    ---
    updated-dependencies:
    - dependency-name: actions/setup-node
      dependency-type: direct:production
      update-type: version-update:semver-major
      dependency-group: github-actions
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit bfb3ee5
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Fri Oct 27 11:04:37 2023 -0700

    Render entire PDFs instead of single pages (Azure-Samples#840)

    * Adding anchors

    * Show whole file

    * Show whole file

    * Page number support

    * More experiments with whole file

    * Revert unintentional changes

    * Add tests

    * Remove random num

    * Add retry_total=0 to avoid unnecessary network requests

    * Add comment to explain retry_total

    * Bring back deleted file

    * Blob manager refactor after merge

    * Update coverage amount

    * Make mypy happy with explicit check of path

    * Add debug for 3.9

    * Skip in 3.9 since its silly

    * Reduce fail under percentage due to 3.9

commit c989048
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Thu Oct 26 20:01:18 2023 -0700

    add screenshot (Azure-Samples#870)

commit a64a12e
Author: Matt <57731498+mattmsft@users.noreply.github.com>
Date:   Thu Oct 26 12:49:02 2023 -0700

    Refactor prepdocs (Azure-Samples#862)

    * setting up types

    * setting up more types...

    * working on it...

    * prepdocs refactor

    * typing fixes; updating tests

    * more fixes; updating tests

    * fixing retry for embeddings

    * fixing adls gen2 list

    * more test fixes

    * fixes from manual runs

    * more fixes

    * more fixes

    * type fixes

    * more type fixes and test fixes

    * break into modules

    * openai embedding fix

    * novectors fix

    * fix id generation

    * doc strings

    * feedback from pr

    * rename feedback

    * trying to get imports to work

    * update test workflow with pamela's suggestion

    * fix ci again

    * delete __init__

    * mypy configuration

    ---------

    Co-authored-by: Matt Gotteiner <magottei@microsoft.com>

commit 94be632
Author: MaciejLitwiniec <MaciejLitwiniec@users.noreply.github.com>
Date:   Thu Oct 26 15:23:00 2023 +0200

    Updated FAQ so that it reflect PR 835 (Azure-Samples#868)

    * Updated FAQ so that it reflect PR 835

    * Update README.md

    * Update README.md

    ---------

    Co-authored-by: Pamela Fox <pamela.fox@gmail.com>

commit d7bbf9f
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Wed Oct 25 14:23:05 2023 -0700

    Bump werkzeug from 3.0.0 to 3.0.1 in /app/backend (Azure-Samples#863)

    Bumps [werkzeug](https://github.com/pallets/werkzeug) from 3.0.0 to 3.0.1.
    - [Release notes](https://github.com/pallets/werkzeug/releases)
    - [Changelog](https://github.com/pallets/werkzeug/blob/main/CHANGES.rst)
    - [Commits](pallets/werkzeug@3.0.0...3.0.1)

    ---
    updated-dependencies:
    - dependency-name: werkzeug
      dependency-type: indirect
    ...

    Signed-off-by: dependabot[bot] <support@github.com>
    Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
    Co-authored-by: Pamela Fox <pamelafox@microsoft.com>

commit d02aa14
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Wed Oct 25 13:48:45 2023 -0700

    Message builder improvements (Azure-Samples#852)

commit 8f55988
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Tue Oct 24 19:53:07 2023 -0700

    Reorder tags to optimize for sample browser (Azure-Samples#853)

commit 16a61bf
Author: Pamela Fox <pamelafox@microsoft.com>
Date:   Mon Oct 23 17:03:59 2023 -0700

    Improve follow-up questions and pipe into context (Azure-Samples#832)

    * Add follow-up questions and parsing

    * Test breaking the e2e

    * Actually run tests

    * Fix runner

    * Add conditional

    * Fix the test

    * chat approach

commit ca01af9
Author: Anthony Shaw <anthony.p.shaw@gmail.com>
Date:   Mon Oct 23 09:27:32 2023 +1100

    Store an MD5 hash of uploaded/indexed file and check before prepdocs (Azure-Samples#835)

    * Hash the uploaded files locally and skip them if you provision a second time and they haven't changed

    * Overwrite the hash when it changes

    * Remove open mode parameter

    * fix f-string

    * reformat changes

    * Update prepdocs.py
HughRunyan pushed a commit to RMI/RMI_chatbot that referenced this pull request Mar 26, 2024
* Adding anchors

* Show whole file

* Show whole file

* Page number support

* More experiments with whole file

* Revert unintentional changes

* Add tests

* Remove random num

* Add retry_total=0 to avoid unnecessary network requests

* Add comment to explain retry_total

* Bring back deleted file

* Blob manager refactor after merge

* Update coverage amount

* Make mypy happy with explicit check of path

* Add debug for 3.9

* Skip in 3.9 since its silly

* Reduce fail under percentage due to 3.9
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhancing User Experience with Access to Original Files
3 participants