Skip to content

Conversation

SteveSandersonMS
Copy link
Member

@SteveSandersonMS SteveSandersonMS commented Feb 4, 2025

Allows .pdf file citations to open in a PDF viewer.

image

image

It highlights the text based on exact string match, since this uses the #search=... URL parameter supported by pdfjs, which in turn is equivalent to the user opening the "Find" feature and typing in the citation quote. As such it's not 100% guaranteed to highlight the citation, since in some cases the LLM returns very slight variations on the text instead of quoting it character-for-character.

If this wasn't good enough, we could:

  • Update the metadata on the ingested chunks so we track something more, for example the exact pixel region covered by each character in the ingested data
  • Implement some further code to find a best match for the LLM's quotation within the ingested chunk, perhaps minimizing edit distance or similar
  • Manually highlight the matching characters (using JavaScript) based on metadata we initially ingested

However this would be very complex and possibly still error-prone.

Given that in most cases we will still open the correct PDF page even if we can't highlight the citation, this is a reasonable tradeoff. App developers with more stringent requirements can implement the large amount of additional code needed to pick out citations more precisely.

Use of PDFJS

Given future plans to avoid the NPM dependency, I've included the files as actual files in wwwroot. There are unfortunately quite a lot of them (e.g., many toolbar icons) but it's all hidden away in a pdfjs directory so likely won't cause any problems or confusion.

Serving the cited PDFs

I've added use of UseStaticFiles to serve everything from the Data directory. Arguably developers may wish to limit what files are served (e.g., just to .pdf files, or just to files that have been ingested) but that would substantially complicate the template logic. The intention of the Data directory is for a quick getting-started process and isn't intended to scale up to all use cases (e.g., you wouldn't put the entire contents of a CMS in there) so I think it's reasonable to simplify by treating that as a publicly-servable directory. Obviously this needs to be documented when we talk about adding files to that directory.

Bug workaround

In order to serve the pdfjs viewer.html file, I had to work around dotnet/aspnetcore#58940 by changing MapStaticFiles to UseStaticFiles. Hopefully we can change this back if that gets fixed in a patch.

Note that the other workaround of ReloadStaticAssetsAtRuntime: false isn't suitable since it would break the ability to edit any static files content (you'd have to restart the server after every file change).

Microsoft Reviewers: Open in CodeFlow

@dotnet-comment-bot
Copy link
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.Extensions.Caching.Hybrid Line 86 78.07 🔻
Microsoft.Extensions.AI.Ollama Line 80 78.25 🔻
Microsoft.Gen.MetadataExtractor Line 98 57.35 🔻
Microsoft.Gen.MetadataExtractor Branch 98 62.5 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Extensions.AI.Abstractions 83 84
Microsoft.Extensions.AI.OpenAI 77 78
Microsoft.Extensions.AI 88 89

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=939410&view=codecoverage-tab

@SteveSandersonMS SteveSandersonMS merged commit 8e1d2ee into main Feb 4, 2025
6 checks passed
@SteveSandersonMS SteveSandersonMS deleted the stevesa/pdf-citation-viewer branch February 4, 2025 17:19
@jeffhandley jeffhandley added the area-ai-templates Microsoft.Extensions.AI.Templates label Mar 7, 2025
@github-actions github-actions bot locked and limited conversation to collaborators Apr 6, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-ai-templates Microsoft.Extensions.AI.Templates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants