-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enhancing this package with "OCR" and "translation" #21
Comments
Thanks for looking at this. I think if integrated I'd prefer the latter approach to be more agnostic. For example, could use azure or tesseract OCR (which has been open-sourced). I have not followed as closely with potential translations that are open source, but if there are options, being flexible would be useful I think. If you want to submit a PR that implements the azure piece that would be incredibly helpful. I could add the tesseract approach and generalize the implementation to use whichever the user wishes. |
I have a working implementation. There is one piece of code, which could be made agnostic, having these concept:
As the Azure APIs (2, one for OCR, one for translation), I need top pass in credentials in some form. I have the credentials "hardcoded" as function parameters, but we should do this differently. As my implementation calls slow / expensive APIs, I implemented as well caching via memoization (but this is a implementation detail od Azure) |
I am not an expert in R. Is there a standard concept in R, of "extension" points ? |
I am now "ready" for our internal usage. I changed the existing code of keyword_search slightly, into three directions:
The code has now as well an "azure based" implementation of the 2 extension points I would be happy to collaborate with you further to move this into the upstream version of teh package |
Thanks, @behrica. I'll take a look more closely soon. Likely won't be for at least a week or so, I need to get through the end of the semester here first. |
We have made experiments using 2 commercial APIs from Azure to OCR scanned pdfs and translate them into English , if not English. They work in my opinion "well enough" for doing keyword search in the results.
I will try to integrated this functionality in this package for our purposes and work on it in this on this fork:
https://github.com/openefsa/pdfsearch
I could potentially contribute this as open source here, if you are interested.
Of course, a potential user, needed to bring his own Azure API token in order to use the functionality.
In case, you do not want to couple the package too much to a commercial provide such as Azure,
it might be useful to have 2 extension points on this package,
which allow to "plugin"
The text was updated successfully, but these errors were encountered: