Temporarily disable text extraction #632

jeremyf · 2022-12-19T17:46:29Z

In reviewing the error logs, we’re seeing some errors around what appears to be invalid characters or unexpected character encoding. One place we’re seeing this in regards to extracting text from PDFs.

Given that we do not have a clear path for our OCR and that we often have side-car TXT files, we’re proposing disabling the current Tesseract based full-text extraction from non-text files (e.g. images and PDF).

The advantage of doing so is that it will speed up imports and looks like it will clear out one or more error pathways based on the existing files.

Related to

From the [tracking issue][1]: > In reviewing the error logs, we’re seeing some errors around what > appears to be invalid characters or unexpected character encoding. One > place we’re seeing this in regards to extracting text from PDFs. > > Given that we do not have a clear path for our OCR and that we often > have side-car TXT files, we’re proposing disabling the current > Tesseract based full-text extraction from non-text files (e.g. images > and PDF). > > The advantage of doing so is that it will speed up imports and looks > like it will clear out one or more error pathways based on the > existing files. With this commit we're leveraging a configuration option in [Hyrax::FileSetDerivativesService][2] ```ruby def extract_full_text(filename, uri) return unless Hyrax.config.extract_full_text? Hydra::Derivatives::FullTextExtract.create(filename, outputs: [{ url: uri, container: "extracted_text" }]) end ``` Closes: - #183 [1]: https://github.com/scientist-softserv/adventist-dl/issues/183 [2]: https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L103-L107

From the [tracking issue][1]: > In reviewing the error logs, we’re seeing some errors around what > appears to be invalid characters or unexpected character encoding. One > place we’re seeing this in regards to extracting text from PDFs. > > Given that we do not have a clear path for our OCR and that we often > have side-car TXT files, we’re proposing disabling the current > Tesseract based full-text extraction from non-text files (e.g. images > and PDF). > > The advantage of doing so is that it will speed up imports and looks > like it will clear out one or more error pathways based on the > existing files. With this commit we're leveraging a configuration option in [Hyrax::FileSetDerivativesService][2] ```ruby def extract_full_text(filename, uri) return unless Hyrax.config.extract_full_text? Hydra::Derivatives::FullTextExtract.create(filename, outputs: [{ url: uri, container: "extracted_text" }]) end ``` Let's say that after awhile, we want to reverse this, what do we need to do for our various files that were uploaded yet we didn't do any text extraction? We could use the following: ```ruby ids.each do |id| file_set = FileSet.find(id) filename = Hyrax::WorkingDirectory.find_or_retrieve(file_id, file_set.id, filepath) Hydra::Derivatives::FullTextExtract.create( filename, outputs: [{ url: file_set.uri, container: "extracted_text" }]) end end ``` Closes: - #183 [1]: https://github.com/scientist-softserv/adventist-dl/issues/183 [2]: https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L103-L107

jeremyf mentioned this issue Dec 19, 2022

Disable extracting text from non-txt files notch8/adventist-dl#184

Merged

jeremyf self-assigned this Dec 19, 2022

kirkkwang transferred this issue from notch8/adventist-dl May 10, 2024

jeremyf removed their assignment May 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporarily disable text extraction #632

Temporarily disable text extraction #632

jeremyf commented Dec 19, 2022

Temporarily disable text extraction #632

Temporarily disable text extraction #632

Comments

jeremyf commented Dec 19, 2022