Disable extracting text from non-txt files · notch8/adventist-dl@cc7e7a6

This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

Commit

Disable extracting text from non-txt files

From the [tracking issue][1]:

> In reviewing the error logs, we’re seeing some errors around what
> appears to be invalid characters or unexpected character encoding. One
> place we’re seeing this in regards to extracting text from PDFs.
>
>  Given that we do not have a clear path for our OCR and that we often
> have side-car TXT files, we’re proposing disabling the current
> Tesseract based full-text extraction from non-text files (e.g. images
> and PDF).
>
> The advantage of doing so is that it will speed up imports and looks
> like it will clear out one or more error pathways based on the
> existing files.

With this commit we're leveraging a configuration option in
[Hyrax::FileSetDerivativesService][2]

```ruby
def extract_full_text(filename, uri)
  return unless Hyrax.config.extract_full_text?
  Hydra::Derivatives::FullTextExtract.create(filename,
                                             outputs: [{ url: uri, container: "extracted_text" }])
end
```

Let's say that after awhile, we want to reverse this, what do we need to
do for our various files that were uploaded yet we didn't do any text
extraction?

We could use the following:

```ruby
ids.each do |id|
  file_set = FileSet.find(id)
  filename = Hyrax::WorkingDirectory.find_or_retrieve(file_id, file_set.id, filepath)
  Hydra::Derivatives::FullTextExtract.create(
    filename,
    outputs: [{ url: file_set.uri, container: "extracted_text" }])
  end
end
```

Closes:

- #183

[1]: https://github.com/scientist-softserv/adventist-dl/issues/183

[2]: https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L103-L107

Loading branch information

jeremyf committed Dec 19, 2022

1 parent c37a42d commit cc7e7a6

config/initializers/hyrax.rb

-Original file line number
+Diff line change
@@ Expand Up / @@ -18,6 +18,10 @@ @@
       # Email recipient of messages sent via the contact form
       config.contact_email = Settings.contact_email
+      # See https://github.com/scientist-softserv/adventist-dl/issues/183
+      # Also, we will continue to extract txt file's text using Adventist::TextFileTextExtractionService
+      config.extract_full_text = false
       # Text prefacing the subject entered in the contact form
       # config.subject_prefix = "Contact form:"
@@ Expand Down @@

0 comments on commit `cc7e7a6`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `cc7e7a6`

Commit

There are no files selected for viewing

0 comments on commit cc7e7a6

0 comments on commit `cc7e7a6`