-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Temporarily disable text extraction #632
Comments
jeremyf
referenced
this issue
in notch8/adventist-dl
Dec 19, 2022
From the [tracking issue][1]: > In reviewing the error logs, we’re seeing some errors around what > appears to be invalid characters or unexpected character encoding. One > place we’re seeing this in regards to extracting text from PDFs. > > Given that we do not have a clear path for our OCR and that we often > have side-car TXT files, we’re proposing disabling the current > Tesseract based full-text extraction from non-text files (e.g. images > and PDF). > > The advantage of doing so is that it will speed up imports and looks > like it will clear out one or more error pathways based on the > existing files. With this commit we're leveraging a configuration option in [Hyrax::FileSetDerivativesService][2] ```ruby def extract_full_text(filename, uri) return unless Hyrax.config.extract_full_text? Hydra::Derivatives::FullTextExtract.create(filename, outputs: [{ url: uri, container: "extracted_text" }]) end ``` Closes: - #183 [1]: https://github.com/scientist-softserv/adventist-dl/issues/183 [2]: https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L103-L107
jeremyf
referenced
this issue
in notch8/adventist-dl
Dec 19, 2022
From the [tracking issue][1]: > In reviewing the error logs, we’re seeing some errors around what > appears to be invalid characters or unexpected character encoding. One > place we’re seeing this in regards to extracting text from PDFs. > > Given that we do not have a clear path for our OCR and that we often > have side-car TXT files, we’re proposing disabling the current > Tesseract based full-text extraction from non-text files (e.g. images > and PDF). > > The advantage of doing so is that it will speed up imports and looks > like it will clear out one or more error pathways based on the > existing files. With this commit we're leveraging a configuration option in [Hyrax::FileSetDerivativesService][2] ```ruby def extract_full_text(filename, uri) return unless Hyrax.config.extract_full_text? Hydra::Derivatives::FullTextExtract.create(filename, outputs: [{ url: uri, container: "extracted_text" }]) end ``` Let's say that after awhile, we want to reverse this, what do we need to do for our various files that were uploaded yet we didn't do any text extraction? We could use the following: ```ruby ids.each do |id| file_set = FileSet.find(id) filename = Hyrax::WorkingDirectory.find_or_retrieve(file_id, file_set.id, filepath) Hydra::Derivatives::FullTextExtract.create( filename, outputs: [{ url: file_set.uri, container: "extracted_text" }]) end end ``` Closes: - #183 [1]: https://github.com/scientist-softserv/adventist-dl/issues/183 [2]: https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L103-L107
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In reviewing the error logs, we’re seeing some errors around what appears to be invalid characters or unexpected character encoding. One place we’re seeing this in regards to extracting text from PDFs.
Given that we do not have a clear path for our OCR and that we often have side-car TXT files, we’re proposing disabling the current Tesseract based full-text extraction from non-text files (e.g. images and PDF).
The advantage of doing so is that it will speed up imports and looks like it will clear out one or more error pathways based on the existing files.
Related to
The text was updated successfully, but these errors were encountered: