Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enhancing this package with "OCR" and "translation" #21

Open
behrica opened this issue Nov 20, 2020 · 6 comments · May be fixed by #24
Open

enhancing this package with "OCR" and "translation" #21

behrica opened this issue Nov 20, 2020 · 6 comments · May be fixed by #24
Assignees
Milestone

Comments

@behrica
Copy link

behrica commented Nov 20, 2020

We have made experiments using 2 commercial APIs from Azure to OCR scanned pdfs and translate them into English , if not English. They work in my opinion "well enough" for doing keyword search in the results.

I will try to integrated this functionality in this package for our purposes and work on it in this on this fork:
https://github.com/openefsa/pdfsearch

I could potentially contribute this as open source here, if you are interested.
Of course, a potential user, needed to bring his own Azure API token in order to use the functionality.

In case, you do not want to couple the package too much to a commercial provide such as Azure,
it might be useful to have 2 extension points on this package,
which allow to "plugin"

  • a OCR-provider
  • a translation-provider
@lebebr01
Copy link
Owner

Thanks for looking at this. I think if integrated I'd prefer the latter approach to be more agnostic. For example, could use azure or tesseract OCR (which has been open-sourced). I have not followed as closely with potential translations that are open source, but if there are options, being flexible would be useful I think.

If you want to submit a PR that implements the azure piece that would be incredibly helpful. I could add the tesseract approach and generalize the implementation to use whichever the user wishes.

@behrica
Copy link
Author

behrica commented Nov 20, 2020

I have a working implementation.

There is one piece of code, which could be made agnostic, having these concept:

  • one function : ocr_pdf which takes as input a PDF path and outputs a character vector
  • one function: translate_text which takes character vector in and returns a character vector
  • I added the concept "language detection" of text and of a "target language", and translation is only called if "text language " and "target language do not match. Maybe a a general case, would need to handle several target languages

As the Azure APIs (2, one for OCR, one for translation), I need top pass in credentials in some form.

I have the credentials "hardcoded" as function parameters, but we should do this differently.

As my implementation calls slow / expensive APIs, I implemented as well caching via memoization (but this is a implementation detail od Azure)

@behrica
Copy link
Author

behrica commented Nov 20, 2020

I am not an expert in R. Is there a standard concept in R, of "extension" points ?
It is just via "passing a function" into an other function?

@behrica
Copy link
Author

behrica commented Dec 9, 2020

Thanks for looking at this. I think if integrated I'd prefer the latter approach to be more agnostic. For example, could use azure or tesseract OCR (which has been open-sourced). I have not followed as closely with potential translations that are open source, but if there are options, being flexible would be useful I think.

If you want to submit a PR that implements the azure piece that would be incredibly helpful. I could add the tesseract approach and generalize the implementation to use whichever the user wishes.

I am now "ready" for our internal usage.
My colleges (non technicians) have now an very easy way to search in:
PDFs, independent if having extractable text or are scanned and/or non-english.

I changed the existing code of keyword_search slightly, into three directions:

  1. extension point to plugin a "OCR function"
  2. extension point to plugin a "translate" function
  3. some simple logic to decide if 1) and 2) should be called, depending if:
    • pdf_text returns "empty" (if < 100 characters)
    • a language detector (franc package), which decides if current text is already in a target language

The code has now as well an "azure based" implementation of the 2 extension points
This is "quick and dirty", but for us very useful
Its "biggest task" is to chunk the text in small enough pieces, so that the API of azure accepts them.
The OCR Api is a push-task-and-poll-status type of API, so I implement as well the "waiting for a result".

I would be happy to collaborate with you further to move this into the upstream version of teh package

@behrica
Copy link
Author

behrica commented Dec 9, 2020

@lebebr01 please let me know, if you want me to do anything on the #23

@lebebr01
Copy link
Owner

lebebr01 commented Dec 9, 2020

Thanks, @behrica. I'll take a look more closely soon. Likely won't be for at least a week or so, I need to get through the end of the semester here first.

@lebebr01 lebebr01 added this to the v0.4 milestone Nov 16, 2021
@lebebr01 lebebr01 linked a pull request Nov 16, 2021 that will close this issue
@lebebr01 lebebr01 self-assigned this Nov 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants