Skip to content
This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

Commit

Permalink
Disable extracting text from non-txt files
Browse files Browse the repository at this point in the history
From the [tracking issue][1]:

> In reviewing the error logs, we’re seeing some errors around what
> appears to be invalid characters or unexpected character encoding. One
> place we’re seeing this in regards to extracting text from PDFs.
>
>  Given that we do not have a clear path for our OCR and that we often
> have side-car TXT files, we’re proposing disabling the current
> Tesseract based full-text extraction from non-text files (e.g. images
> and PDF).
>
> The advantage of doing so is that it will speed up imports and looks
> like it will clear out one or more error pathways based on the
> existing files.

With this commit we're leveraging a configuration option in
[Hyrax::FileSetDerivativesService][2]

```ruby
def extract_full_text(filename, uri)
  return unless Hyrax.config.extract_full_text?
  Hydra::Derivatives::FullTextExtract.create(filename,
                                             outputs: [{ url: uri, container: "extracted_text" }])
end
```

Let's say that after awhile, we want to reverse this, what do we need to
do for our various files that were uploaded yet we didn't do any text
extraction?

We could use the following:

```ruby
ids.each do |id|
  file_set = FileSet.find(id)
  filename = Hyrax::WorkingDirectory.find_or_retrieve(file_id, file_set.id, filepath)
  Hydra::Derivatives::FullTextExtract.create(
    filename,
    outputs: [{ url: file_set.uri, container: "extracted_text" }])
  end
end
```

Closes:

- #183

[1]: https://github.com/scientist-softserv/adventist-dl/issues/183

[2]: https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L103-L107
  • Loading branch information
jeremyf committed Dec 19, 2022
1 parent c37a42d commit cc7e7a6
Showing 1 changed file with 4 additions and 0 deletions.
4 changes: 4 additions & 0 deletions config/initializers/hyrax.rb
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@
# Email recipient of messages sent via the contact form
config.contact_email = Settings.contact_email

# See https://github.com/scientist-softserv/adventist-dl/issues/183
# Also, we will continue to extract txt file's text using Adventist::TextFileTextExtractionService
config.extract_full_text = false

# Text prefacing the subject entered in the contact form
# config.subject_prefix = "Contact form:"

Expand Down

0 comments on commit cc7e7a6

Please sign in to comment.