This repository has been archived by the owner on Oct 24, 2024. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Ensuring that extracted text is encoded
Prior to this commit, I was thinking that all extracted text was properly encoded. However, in reviewing the underlying data in the application I found cases where the extracted text's content was not properly encoded. The primary culprit was the Byte Order Marker (BOM) character. I'm not entirely certain why the original encoding doesn't cover the problem, but when I was testing in the console, I was getting the encoding error on the `all_text_timv` SOLR field; and it was the BOM that was causing the problem. My conjecture is that there is either issues with Rails's `to_json` is not recognizing BOM as correct encoding for UTF-8 (which I believe it is). Or we're getting something garbled back from Fedora. Or the encode method was not quite right. Regardless, with this commit, I'm forcing encoding of that plain text content and removing the BOM character. Testing this is also a particular challenge because all of our existing tools for copy/paste and typing tend to do some hiddent encoding antics on our behalf. Below is a naive example of using the `Hyku.utf_8_encode` for the BOM stripping. ```ruby irb(main):001:0> "\xEF\xBB\xBFHello" => "Hello" irb(main):002:0> "\xEF\xBB\xBFHello" == "Hello" => false irb(main):003:0> Hyku.utf_8_encode("\xEF\xBB\xBFHello") == "Hello" => true ``` Closes: - https://github.com/scientist-softserv/adventist-dl/issues/181 Related to: - samvera/bulkrax#689 - samvera/bulkrax#688 - https://github.com/scientist-softserv/adventist-dl/issues/179
- Loading branch information