This repository has been archived by the owner on Oct 24, 2024. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Disable extracting text from non-txt files
From the [tracking issue][1]: > In reviewing the error logs, we’re seeing some errors around what > appears to be invalid characters or unexpected character encoding. One > place we’re seeing this in regards to extracting text from PDFs. > > Given that we do not have a clear path for our OCR and that we often > have side-car TXT files, we’re proposing disabling the current > Tesseract based full-text extraction from non-text files (e.g. images > and PDF). > > The advantage of doing so is that it will speed up imports and looks > like it will clear out one or more error pathways based on the > existing files. With this commit we're leveraging a configuration option in [Hyrax::FileSetDerivativesService][2] ```ruby def extract_full_text(filename, uri) return unless Hyrax.config.extract_full_text? Hydra::Derivatives::FullTextExtract.create(filename, outputs: [{ url: uri, container: "extracted_text" }]) end ``` Let's say that after awhile, we want to reverse this, what do we need to do for our various files that were uploaded yet we didn't do any text extraction? We could use the following: ```ruby ids.each do |id| file_set = FileSet.find(id) filename = Hyrax::WorkingDirectory.find_or_retrieve(file_id, file_set.id, filepath) Hydra::Derivatives::FullTextExtract.create( filename, outputs: [{ url: file_set.uri, container: "extracted_text" }]) end end ``` Closes: - #183 [1]: https://github.com/scientist-softserv/adventist-dl/issues/183 [2]: https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L103-L107
- Loading branch information