Skip to content
This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

Commit

Permalink
Adding logic to extract text from plain text files
Browse files Browse the repository at this point in the history
This is a first pass of logic that passes unit tests.  I am putting this
up to close out my work week.

Prior to this commit, we were not handling uploaded plain text files.
The impact was that we were not adding the content of the text file to
the field we were using full text search result.

With this commit, we insert a derivative service class for handling text
files.  This service class short-cirtcuits the logic that spans Hyrax
and Hydra::Derivatives, which relied on Solr's text extraction service.

Acceptance Criteria

- [ ] In the application, attach a plain text file (I recommend on that
ends in `.txt`).  That plain text file should have some unique text.
Then after the file has been uploaded, search for the unique text.

Note: As <2022-10-14 Fri 17:47> I have not tested locally but wanted to
get this up for review.  If someone has time to pull this down and test
locally, that would be wonderful.  If not, I will pick this up later on
and amend the commit message.

Closes #147
  • Loading branch information
jeremyf committed Oct 18, 2022
1 parent 5845f38 commit 1d3e1a9
Show file tree
Hide file tree
Showing 3 changed files with 108 additions and 0 deletions.
57 changes: 57 additions & 0 deletions app/services/adventist/text_file_text_extraction_service.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# frozen_string_literal: true

module Adventist
# This class conforms to the interface of a Hyrax::DerivativeService
#
# @see https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/derivative_service.rb#L1 v2.9.5 implementation of Hyrax::DerivativeService
class TextFileTextExtractionService
VALID_MIME_TYPES = ["text/plain"]
attr_reader :file_set
delegate :mime_type, :uri, to: :file_set
def initialize(file_set)
# require "byebug"; byebug
# require "debug"; binding.break

@file_set = file_set
end

def cleanup_derivatives; end

# This is an short-circuit and amalgemation of logic for the Hyrax::DerivativeService ecosystem.
#
# It follows the logic of the following points in the shared code-base:
#
# - https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L99-L107
# - https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/models/concerns/hyrax/file_set/derivatives.rb#L46
# - https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/persist_directly_contained_output_file_service.rb
# - https://github.com/samvera/hydra-derivatives/blob/f781d112e05155c90d3de9c6bc05308864ecb1cf/lib/hydra/derivatives/processors/full_text.rb#L1
# - https://github.com/samvera/hydra-derivatives/blob/f781d112e05155c90d3de9c6bc05308864ecb1cf/lib/hydra/derivatives/services/persist_basic_contained_output_file_service.rb#L16-L23
# - https://github.com/samvera/hydra-derivatives/blob/f781d112e05155c90d3de9c6bc05308864ecb1cf/lib/hydra/derivatives/services/persist_output_file_service.rb
# - https://github.com/samvera/hydra-derivatives/blob/f781d112e05155c90d3de9c6bc05308864ecb1cf/lib/hydra/derivatives/runners/full_text_extract.rb
#
# But avoids the trip to Solr for the extracted text.
#
# @see https://github.com/samvera/hyrax/blob/cff1ddd18764e4d14a2404d61d20ae776ea62916/app/services/hyrax/file_set_derivatives_service.rb#L99-L107 Hyrax::FileSetDerivatives#extract_full_text
def create_derivatives(filename)
file_set.build_extracted_text.tap do |extracted_text|
extracted_text.content = File.read(filename)
extracted_text.mime_type = mime_type
extracted_text.original_name = filename
end
file_set.save
end

# @note This is not does not appear to be a necessary method for the interface.
def derivative_url(_destination_name)
""
end

def valid?
return true if VALID_MIME_TYPES.detect do |valid_mime_type|
# Because character encoding may be part of the mime_type. So we want both "text/plain" and
# "text/plain;charset=UTF-8" to be a valid type.
valid_mime_type.start_with?(mime_type)
end
end
end
end
7 changes: 7 additions & 0 deletions config/application.rb
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,13 @@ class Application < Rails::Application
# authenticity token errors.
Hyrax::Admin::AppearancesController.form_class = AppearanceForm

# See https://gitlab.com/notch8/adventist-dl/-/issues/147
#
# By default plain text files are not processed for text extraction. In adding
# Adventist::TextFileTextExtractionService to the beginning of the services array we are
# enabling text extraction from plain text files.
Hyrax::DerivativeService.services.unshift(Adventist::TextFileTextExtractionService)

# Allows us to use decorator files
Dir.glob(File.join(File.dirname(__FILE__), "../app/**/*_decorator*.rb")).sort.each do |c|
Rails.configuration.cache_classes ? require(c) : load(c)
Expand Down
44 changes: 44 additions & 0 deletions spec/services/adventist/text_file_text_extraction_service_spec.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# frozen_string_literal: true
require "rails_helper"

# From Hyrax
require "hyrax/specs/shared_specs/derivative_service"

RSpec.describe Adventist::TextFileTextExtractionService do
let(:valid_file_set) do
FileSet.new.tap do |f|
allow(f).to receive(:mime_type).and_return(described_class::VALID_MIME_TYPES.first)
end
end

let(:invalid_file_set) do
FileSet.new.tap do |f|
allow(f).to receive(:mime_type).and_return("image/jpeg")
end
end

subject { described_class.new(valid_file_set) }

it_behaves_like "a Hyrax::DerivativeService"

describe '#valid?' do
context 'when given a non-text format' do
subject { described_class.new(invalid_file_set) }
it { is_expected.not_to be_valid }
end
end

describe '#create_derivatives' do
let(:filename) { __FILE__ }
let(:valid_file_set) do
FactoryBot.create(:file_set, file: filename).tap do |f|
allow(f).to receive(:mime_type).and_return(described_class::VALID_MIME_TYPES.first)
end
end

it 'assigns the extracted text to the file_set', aggregate_failures: true do
expect(subject.create_derivatives(filename)).to be_truthy
expect(valid_file_set.extracted_text.content).to eq(File.read(filename))
end
end
end

0 comments on commit 1d3e1a9

Please sign in to comment.