Skip to content
This repository has been archived by the owner on Oct 24, 2024. It is now read-only.

Commit

Permalink
Ensuring that extracted text is encoded
Browse files Browse the repository at this point in the history
Prior to this commit, I was thinking that all extracted text was
properly encoded.  However, in reviewing the underlying data in the
application I found cases where the extracted text's content was not
properly encoded.  The primary culprit was the Byte Order Marker (BOM)
character.

I'm not entirely certain why the original encoding doesn't cover the
problem, but when I was testing in the console, I was getting the
encoding error on the `all_text_timv` SOLR field; and it was the BOM
that was causing the problem.

My conjecture is that there is either issues with Rails's `to_json` is
not recognizing BOM as correct encoding for UTF-8 (which I believe it
is).  Or we're getting something garbled back from Fedora.  Or the
encode method was not quite right.

Regardless, with this commit, I'm forcing encoding of that plain text
content and removing the BOM character.  Testing this is also a
particular challenge because all of our existing tools for copy/paste
and typing tend to do some hiddent encoding antics on our behalf.

Below is a naive example of using the `Hyku.utf_8_encode` for the BOM
stripping.

```ruby
irb(main):001:0> "\xEF\xBB\xBFHello"
=> "Hello"
irb(main):002:0> "\xEF\xBB\xBFHello" == "Hello"
=> false
irb(main):003:0> Hyku.utf_8_encode("\xEF\xBB\xBFHello") == "Hello"
=> true
```

Closes:

- https://github.com/scientist-softserv/adventist-dl/issues/181

Related to:

- samvera/bulkrax#689
- samvera/bulkrax#688
- https://github.com/scientist-softserv/adventist-dl/issues/179
  • Loading branch information
jeremyf committed Dec 19, 2022
1 parent c37a42d commit 335a55d
Show file tree
Hide file tree
Showing 3 changed files with 26 additions and 5 deletions.
2 changes: 1 addition & 1 deletion app/indexers/hyrax/file_set_indexer.rb
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ def generate_solr_document
solr_doc['file_format_tesim'] = file_format
solr_doc['file_format_sim'] = file_format
solr_doc['file_size_lts'] = object.file_size[0]
solr_doc['all_text_timv'] = object.extracted_text.content if object.extracted_text.present?
solr_doc['all_text_timv'] = Hyku.utf_8_encode(object.extracted_text.content) if object.extracted_text.present?
solr_doc['height_is'] = Integer(object.height.first) if object.height.present?
solr_doc['width_is'] = Integer(object.width.first) if object.width.present?
solr_doc['visibility_ssi'] = object.visibility
Expand Down
5 changes: 1 addition & 4 deletions app/services/adventist/text_file_text_extraction_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,7 @@ def self.assign_extracted_text(file_set:, text:, original_file_name:)
#
# Given that we still have the original, and this is a derivative, the forced encoding
# should be acceptable.
extracted_text.content = text.encode(
Encoding.find('UTF-8'),
invalid: :replace, undef: :replace, replace: "?"
)
extracted_text.content = Hyku.utf_8_encode(text)
extracted_text.mime_type = file_set.mime_type
extracted_text.original_name = original_file_name
end
Expand Down
24 changes: 24 additions & 0 deletions config/application.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,30 @@
Bundler.require(*groups)

module Hyku
# Providing a common method to ensure consistent UTF-8 encoding. Also removing the tricksy Byte
# Order Marker character which is an invisible 0 space character.
#
# @note In testing, we encountered errors with the file's character encoding
# (e.g. `Encoding::UndefinedConversionError`). The following will force the encoding to
# UTF-8 and replace any invalid or undefined characters from the original encoding with a
# "?".
#
# Given that we still have the original, and this is a derivative, the forced encoding
# should be acceptable.
#
# @param [String]
# @return [String]
#
# @see https://sentry.io/organizations/scientist-inc/issues/3773392603/?project=6745020&query=is%3Aunresolved&referrer=issue-stream
# @see https://github.com/samvera-labs/bulkrax/pull/689
# @see https://github.com/samvera-labs/bulkrax/issues/688
# @see https://github.com/scientist-softserv/adventist-dl/issues/179
def self.utf_8_encode(string)
string
.encode(Encoding.find('UTF-8'), invalid: :replace, undef: :replace, replace: "?")
.delete("\xEF\xBB\xBF")
end

class Application < Rails::Application
# Settings in config/environments/* take precedence over those specified here.
# Application configuration should go into files in config/initializers
Expand Down

0 comments on commit 335a55d

Please sign in to comment.