Ensuring that extracted text is encoded

Prior to this commit, I was thinking that all extracted text was properly encoded. However, in reviewing the underlying data in the application I found cases where the extracted text's content was not properly encoded. The primary culprit was the Byte Order Marker (BOM) character. I'm not entirely certain why the original encoding doesn't cover the problem, but when I was testing in the console, I was getting the encoding error on the `all_text_timv` SOLR field; and it was the BOM that was causing the problem. My conjecture is that there is either issues with Rails's `to_json` is not recognizing BOM as correct encoding for UTF-8 (which I believe it is). Or we're getting something garbled back from Fedora. Or the encode method was not quite right. Regardless, with this commit, I'm forcing encoding of that plain text content and removing the BOM character. Testing this is also a particular challenge because all of our existing tools for copy/paste and typing tend to do some hiddent encoding antics on our behalf. Below is a naive example of using the `Hyku.utf_8_encode` for the BOM stripping. ```ruby irb(main):001:0> "\xEF\xBB\xBFHello" => "Hello" irb(main):002:0> "\xEF\xBB\xBFHello" == "Hello" => false irb(main):003:0> Hyku.utf_8_encode("\xEF\xBB\xBFHello") == "Hello" => true ``` Closes: - https://github.com/scientist-softserv/adventist-dl/issues/181 Related to: - samvera/bulkrax#689 - samvera/bulkrax#688 - https://github.com/scientist-softserv/adventist-dl/issues/179
scientist-softserv · Dec 19, 2022 · 335a55d · 335a55d
1 parent c37a42d
commit 335a55d
Show file tree

Hide file tree

Showing 3 changed files with 26 additions and 5 deletions.
diff --git a/app/indexers/hyrax/file_set_indexer.rb b/app/indexers/hyrax/file_set_indexer.rb
@@ -16,7 +16,7 @@ def generate_solr_document
         solr_doc['file_format_tesim'] = file_format
         solr_doc['file_format_sim']   = file_format
         solr_doc['file_size_lts'] = object.file_size[0]
-        solr_doc['all_text_timv'] = object.extracted_text.content if object.extracted_text.present?
+        solr_doc['all_text_timv'] = Hyku.utf_8_encode(object.extracted_text.content) if object.extracted_text.present?
         solr_doc['height_is'] = Integer(object.height.first) if object.height.present?
         solr_doc['width_is']  = Integer(object.width.first) if object.width.present?
         solr_doc['visibility_ssi'] = object.visibility

diff --git a/app/services/adventist/text_file_text_extraction_service.rb b/app/services/adventist/text_file_text_extraction_service.rb
@@ -20,10 +20,7 @@ def self.assign_extracted_text(file_set:, text:, original_file_name:)
         #
         # Given that we still have the original, and this is a derivative, the forced encoding
         # should be acceptable.
-        extracted_text.content = text.encode(
-          Encoding.find('UTF-8'),
-          invalid: :replace, undef: :replace, replace: "?"
-        )
+        extracted_text.content = Hyku.utf_8_encode(text)
         extracted_text.mime_type = file_set.mime_type
         extracted_text.original_name = original_file_name
       end

diff --git a/config/application.rb b/config/application.rb
@@ -10,6 +10,30 @@
 Bundler.require(*groups)
 
 module Hyku
+  # Providing a common method to ensure consistent UTF-8 encoding.  Also removing the tricksy Byte
+  # Order Marker character which is an invisible 0 space character.
+  #
+  # @note In testing, we encountered errors with the file's character encoding
+  #       (e.g. `Encoding::UndefinedConversionError`).  The following will force the encoding to
+  #       UTF-8 and replace any invalid or undefined characters from the original encoding with a
+  #       "?".
+  #
+  #       Given that we still have the original, and this is a derivative, the forced encoding
+  #       should be acceptable.
+  #
+  # @param [String]
+  # @return [String]
+  #
+  # @see https://sentry.io/organizations/scientist-inc/issues/3773392603/?project=6745020&query=is%3Aunresolved&referrer=issue-stream
+  # @see https://github.com/samvera-labs/bulkrax/pull/689
+  # @see https://github.com/samvera-labs/bulkrax/issues/688
+  # @see https://github.com/scientist-softserv/adventist-dl/issues/179
+  def self.utf_8_encode(string)
+    string
+      .encode(Encoding.find('UTF-8'), invalid: :replace, undef: :replace, replace: "?")
+      .delete("\xEF\xBB\xBF")
+  end
+
   class Application < Rails::Application
     # Settings in config/environments/* take precedence over those specified here.
     # Application configuration should go into files in config/initializers