enhancing this package with "OCR" and "translation" #21

behrica · 2020-11-20T09:14:35Z

We have made experiments using 2 commercial APIs from Azure to OCR scanned pdfs and translate them into English , if not English. They work in my opinion "well enough" for doing keyword search in the results.

I will try to integrated this functionality in this package for our purposes and work on it in this on this fork:
https://github.com/openefsa/pdfsearch

I could potentially contribute this as open source here, if you are interested.
Of course, a potential user, needed to bring his own Azure API token in order to use the functionality.

In case, you do not want to couple the package too much to a commercial provide such as Azure,
it might be useful to have 2 extension points on this package,
which allow to "plugin"

a OCR-provider
a translation-provider

lebebr01 · 2020-11-20T17:59:03Z

Thanks for looking at this. I think if integrated I'd prefer the latter approach to be more agnostic. For example, could use azure or tesseract OCR (which has been open-sourced). I have not followed as closely with potential translations that are open source, but if there are options, being flexible would be useful I think.

If you want to submit a PR that implements the azure piece that would be incredibly helpful. I could add the tesseract approach and generalize the implementation to use whichever the user wishes.

behrica · 2020-11-20T21:25:20Z

I have a working implementation.

There is one piece of code, which could be made agnostic, having these concept:

one function : ocr_pdf which takes as input a PDF path and outputs a character vector
one function: translate_text which takes character vector in and returns a character vector
I added the concept "language detection" of text and of a "target language", and translation is only called if "text language " and "target language do not match. Maybe a a general case, would need to handle several target languages

As the Azure APIs (2, one for OCR, one for translation), I need top pass in credentials in some form.

I have the credentials "hardcoded" as function parameters, but we should do this differently.

As my implementation calls slow / expensive APIs, I implemented as well caching via memoization (but this is a implementation detail od Azure)

behrica · 2020-11-20T21:28:24Z

I am not an expert in R. Is there a standard concept in R, of "extension" points ?
It is just via "passing a function" into an other function?

behrica · 2020-12-09T16:14:59Z

Thanks for looking at this. I think if integrated I'd prefer the latter approach to be more agnostic. For example, could use azure or tesseract OCR (which has been open-sourced). I have not followed as closely with potential translations that are open source, but if there are options, being flexible would be useful I think.

If you want to submit a PR that implements the azure piece that would be incredibly helpful. I could add the tesseract approach and generalize the implementation to use whichever the user wishes.

I am now "ready" for our internal usage.
My colleges (non technicians) have now an very easy way to search in:
PDFs, independent if having extractable text or are scanned and/or non-english.

I changed the existing code of keyword_search slightly, into three directions:

extension point to plugin a "OCR function"
extension point to plugin a "translate" function
some simple logic to decide if 1) and 2) should be called, depending if:
- pdf_text returns "empty" (if < 100 characters)
- a language detector (franc package), which decides if current text is already in a target language

The code has now as well an "azure based" implementation of the 2 extension points
This is "quick and dirty", but for us very useful
Its "biggest task" is to chunk the text in small enough pieces, so that the API of azure accepts them.
The OCR Api is a push-task-and-poll-status type of API, so I implement as well the "waiting for a result".

I would be happy to collaborate with you further to move this into the upstream version of teh package

behrica · 2020-12-09T16:32:39Z

@lebebr01 please let me know, if you want me to do anything on the #23

lebebr01 · 2020-12-09T20:55:53Z

Thanks, @behrica. I'll take a look more closely soon. Likely won't be for at least a week or so, I need to get through the end of the semester here first.

behrica mentioned this issue Nov 20, 2020

adding OCR and translation #23

Closed

lebebr01 added this to the v0.4 milestone Nov 16, 2021

lebebr01 linked a pull request Nov 16, 2021 that will close this issue

add OCR / translation #24

Open

lebebr01 self-assigned this Nov 16, 2021

lebebr01 added the enhancement label Nov 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhancing this package with "OCR" and "translation" #21

enhancing this package with "OCR" and "translation" #21

behrica commented Nov 20, 2020

lebebr01 commented Nov 20, 2020

behrica commented Nov 20, 2020

behrica commented Nov 20, 2020 •

edited

Loading

behrica commented Dec 9, 2020 •

edited

Loading

behrica commented Dec 9, 2020

lebebr01 commented Dec 9, 2020

enhancing this package with "OCR" and "translation" #21

enhancing this package with "OCR" and "translation" #21

Comments

behrica commented Nov 20, 2020

lebebr01 commented Nov 20, 2020

behrica commented Nov 20, 2020

behrica commented Nov 20, 2020 • edited Loading

behrica commented Dec 9, 2020 • edited Loading

behrica commented Dec 9, 2020

lebebr01 commented Dec 9, 2020

behrica commented Nov 20, 2020 •

edited

Loading

behrica commented Dec 9, 2020 •

edited

Loading