Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF text extraction using PDF.js #290

Closed
wants to merge 3 commits into from

Conversation

RyotaUshio
Copy link

Hi, thank you so much for making this awesome plugin!

The original implementation uses the Text Extractor plugin for this purpose, but I found sometimes it didn't work as expected, especially for handwritten notes exported from GoodNotes.

These notes already have embedded text generated by the great OCR ability of GoodNotes, and this text is searchable by the built-in search of Obsidian's PDF viewer, as shown below.

Alt text

However, Omnisearch can't find it.

Alt text

So I modified the PDF text extraction part so that it simply uses PDF.js.
It can now locate the handwritten text as below.

Alt text

What do you think of this approach?

@scambier
Copy link
Owner

First, thanks for your PR :)

Omnisearch itself doesn't manage text extraction. This work has been offloaded to Text Extractor (that I also maintain), by design.

Now, PDFjs is indeed the library that gives the best results when extracting text, and it was actually my first choice when I started working on text extraction a year ago. It works super great, but there's a catch: it doesn't scale at all. Give it more than a dozen PDFs to extract back-to-back, and it will hard crash Obsidian.

I tried your PR on my test vault, and unfortunately this problem hasn't been solved since last year, so for now, I'll stay with the "half working but not crashing" library used in Text Extractor. Sorry :/

@scambier scambier closed this Sep 16, 2023
@RyotaUshio
Copy link
Author

Thank you for the clarification. I see, that's a sad news...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants