Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporarily disable text extraction #632

Open
jeremyf opened this issue Dec 19, 2022 · 0 comments
Open

Temporarily disable text extraction #632

jeremyf opened this issue Dec 19, 2022 · 0 comments

Comments

@jeremyf
Copy link
Contributor

jeremyf commented Dec 19, 2022

In reviewing the error logs, we’re seeing some errors around what appears to be invalid characters or unexpected character encoding. One place we’re seeing this in regards to extracting text from PDFs.

Given that we do not have a clear path for our OCR and that we often have side-car TXT files, we’re proposing disabling the current Tesseract based full-text extraction from non-text files (e.g. images and PDF).

The advantage of doing so is that it will speed up imports and looks like it will clear out one or more error pathways based on the existing files.

Related to

jeremyf referenced this issue in notch8/adventist-dl Dec 19, 2022
From the [tracking issue][1]:

> In reviewing the error logs, we’re seeing some errors around what
> appears to be invalid characters or unexpected character encoding. One
> place we’re seeing this in regards to extracting text from PDFs.
>
>  Given that we do not have a clear path for our OCR and that we often
> have side-car TXT files, we’re proposing disabling the current
> Tesseract based full-text extraction from non-text files (e.g. images
> and PDF).
>
> The advantage of doing so is that it will speed up imports and looks
> like it will clear out one or more error pathways based on the
> existing files.

With this commit we're leveraging a configuration option in
[Hyrax::FileSetDerivativesService][2]

```ruby
def extract_full_text(filename, uri)
  return unless Hyrax.config.extract_full_text?
  Hydra::Derivatives::FullTextExtract.create(filename,
                                             outputs: [{ url: uri, container: "extracted_text" }])
end
```

Closes:

- #183

[1]: https://github.com/scientist-softserv/adventist-dl/issues/183

[2]: https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L103-L107
jeremyf referenced this issue in notch8/adventist-dl Dec 19, 2022
From the [tracking issue][1]:

> In reviewing the error logs, we’re seeing some errors around what
> appears to be invalid characters or unexpected character encoding. One
> place we’re seeing this in regards to extracting text from PDFs.
>
>  Given that we do not have a clear path for our OCR and that we often
> have side-car TXT files, we’re proposing disabling the current
> Tesseract based full-text extraction from non-text files (e.g. images
> and PDF).
>
> The advantage of doing so is that it will speed up imports and looks
> like it will clear out one or more error pathways based on the
> existing files.

With this commit we're leveraging a configuration option in
[Hyrax::FileSetDerivativesService][2]

```ruby
def extract_full_text(filename, uri)
  return unless Hyrax.config.extract_full_text?
  Hydra::Derivatives::FullTextExtract.create(filename,
                                             outputs: [{ url: uri, container: "extracted_text" }])
end
```

Let's say that after awhile, we want to reverse this, what do we need to
do for our various files that were uploaded yet we didn't do any text
extraction?

We could use the following:

```ruby
ids.each do |id|
  file_set = FileSet.find(id)
  filename = Hyrax::WorkingDirectory.find_or_retrieve(file_id, file_set.id, filepath)
  Hydra::Derivatives::FullTextExtract.create(
    filename,
    outputs: [{ url: file_set.uri, container: "extracted_text" }])
  end
end
```

Closes:

- #183

[1]: https://github.com/scientist-softserv/adventist-dl/issues/183

[2]: https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L103-L107
@jeremyf jeremyf self-assigned this Dec 19, 2022
@kirkkwang kirkkwang transferred this issue from notch8/adventist-dl May 10, 2024
@jeremyf jeremyf removed their assignment May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Archived in project
Development

No branches or pull requests

1 participant