This repository has been archived by the owner on Oct 24, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adding logic to extract text from plain text files
This is a first pass of logic that passes unit tests. I am putting this up to close out my work week. Prior to this commit, we were not handling uploaded plain text files. The impact was that we were not adding the content of the text file to the field we were using full text search result. With this commit, we insert a derivative service class for handling text files. This service class short-cirtcuits the logic that spans Hyrax and Hydra::Derivatives, which relied on Solr's text extraction service. Acceptance Criteria - [ ] In the application, attach a plain text file (I recommend on that ends in `.txt`). That plain text file should have some unique text. Then after the file has been uploaded, search for the unique text. Note: As <2022-10-14 Fri 17:47> I have not tested locally but wanted to get this up for review. If someone has time to pull this down and test locally, that would be wonderful. If not, I will pick this up later on and amend the commit message. Closes #147
- Loading branch information
Showing
3 changed files
with
108 additions
and
0 deletions.
There are no files selected for viewing
57 changes: 57 additions & 0 deletions
57
app/services/adventist/text_file_text_extraction_service.rb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,57 @@ | ||
# frozen_string_literal: true | ||
|
||
module Adventist | ||
# This class conforms to the interface of a Hyrax::DerivativeService | ||
# | ||
# @see https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/derivative_service.rb#L1 v2.9.5 implementation of Hyrax::DerivativeService | ||
class TextFileTextExtractionService | ||
VALID_MIME_TYPES = ["text/plain"] | ||
attr_reader :file_set | ||
delegate :mime_type, :uri, to: :file_set | ||
def initialize(file_set) | ||
# require "byebug"; byebug | ||
# require "debug"; binding.break | ||
|
||
@file_set = file_set | ||
end | ||
|
||
def cleanup_derivatives; end | ||
|
||
# This is an short-circuit and amalgemation of logic for the Hyrax::DerivativeService ecosystem. | ||
# | ||
# It follows the logic of the following points in the shared code-base: | ||
# | ||
# - https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L99-L107 | ||
# - https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/models/concerns/hyrax/file_set/derivatives.rb#L46 | ||
# - https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/persist_directly_contained_output_file_service.rb | ||
# - https://github.com/samvera/hydra-derivatives/blob/f781d112e05155c90d3de9c6bc05308864ecb1cf/lib/hydra/derivatives/processors/full_text.rb#L1 | ||
# - https://github.com/samvera/hydra-derivatives/blob/f781d112e05155c90d3de9c6bc05308864ecb1cf/lib/hydra/derivatives/services/persist_basic_contained_output_file_service.rb#L16-L23 | ||
# - https://github.com/samvera/hydra-derivatives/blob/f781d112e05155c90d3de9c6bc05308864ecb1cf/lib/hydra/derivatives/services/persist_output_file_service.rb | ||
# - https://github.com/samvera/hydra-derivatives/blob/f781d112e05155c90d3de9c6bc05308864ecb1cf/lib/hydra/derivatives/runners/full_text_extract.rb | ||
# | ||
# But avoids the trip to Solr for the extracted text. | ||
# | ||
# @see https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L99-L107 Hyrax::FileSetDerivatives#extract_full_text | ||
def create_derivatives(filename) | ||
file_set.build_extracted_text.tap do |extracted_text| | ||
extracted_text.content = File.read(filename) | ||
extracted_text.mime_type = mime_type | ||
extracted_text.original_name = filename | ||
end | ||
file_set.save | ||
end | ||
|
||
# @note This is not does not appear to be a necessary method for the interface. | ||
def derivative_url(_destination_name) | ||
"" | ||
end | ||
|
||
def valid? | ||
return true if VALID_MIME_TYPES.detect do |valid_mime_type| | ||
# Because character encoding may be part of the mime_type. So we want both "text/plain" and | ||
# "text/plain;charset=UTF-8" to be a valid type. | ||
valid_mime_type.start_with?(mime_type) | ||
end | ||
end | ||
end | ||
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
44 changes: 44 additions & 0 deletions
44
spec/services/adventist/text_file_text_extraction_service_spec.rb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
# frozen_string_literal: true | ||
require "rails_helper" | ||
|
||
# From Hyrax | ||
require "hyrax/specs/shared_specs/derivative_service" | ||
|
||
RSpec.describe Adventist::TextFileTextExtractionService do | ||
let(:valid_file_set) do | ||
FileSet.new.tap do |f| | ||
allow(f).to receive(:mime_type).and_return(described_class::VALID_MIME_TYPES.first) | ||
end | ||
end | ||
|
||
let(:invalid_file_set) do | ||
FileSet.new.tap do |f| | ||
allow(f).to receive(:mime_type).and_return("image/jpeg") | ||
end | ||
end | ||
|
||
subject { described_class.new(valid_file_set) } | ||
|
||
it_behaves_like "a Hyrax::DerivativeService" | ||
|
||
describe '#valid?' do | ||
context 'when given a non-text format' do | ||
subject { described_class.new(invalid_file_set) } | ||
it { is_expected.not_to be_valid } | ||
end | ||
end | ||
|
||
describe '#create_derivatives' do | ||
let(:filename) { __FILE__ } | ||
let(:valid_file_set) do | ||
FactoryBot.create(:file_set, file: filename).tap do |f| | ||
allow(f).to receive(:mime_type).and_return(described_class::VALID_MIME_TYPES.first) | ||
end | ||
end | ||
|
||
it 'assigns the extracted text to the file_set', aggregate_failures: true do | ||
expect(subject.create_derivatives(filename)).to be_truthy | ||
expect(valid_file_set.extracted_text.content).to eq(File.read(filename)) | ||
end | ||
end | ||
end |