-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalizing serialized data for BOM characters #689
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Prior to this commit, when someone would upload a CSV it might have a column name that included the tricksy, invisibile [Byte Order Mark][1]. The manifestation was that you could look at the raw metadata and see a key called `"file"` however when checking `raw_metadata.key?("file")` the result was `false`. To find this `entry.raw_metadata.keys.first.chars.map(&:chr)` I got `["", "f", "i", "l", "e"]` where the first value of that array was a [Byte Oder Mark][1]. With this commit, we're handling both how we persist and how we load persisted serialized data. This follows on the documentation for the `ActiveRecord::Base.serialize` method. It's envisioned that we might have more characters to sanitize. Closes: #688 Related to: scientist-softserv/adventist-dl#179 [1]: https://en.wikipedia.org/wiki/Byte_order_mark
jeremyf
force-pushed
the
jeremyf---688
branch
from
December 15, 2022 22:37
5ea3dba
to
8a4e86d
Compare
alishaevn
reviewed
Dec 16, 2022
alishaevn
approved these changes
Dec 16, 2022
jeremyf
added a commit
to scientist-softserv/adventist-dl
that referenced
this pull request
Dec 16, 2022
The [Bite Order Mark (BOM)][1] is an invisible character that interfers with field mapping. In this case we had a CSV that had the column named "file" and our mapping had a field named "file". On a visual examination, this should've matched. However the CSV had a BOM character at the beginning of "file" column. Thus the string comparison failed because of the invisible character. The manifestation of which was failure to upload and attach files. Related to: - samvera/bulkrax#689 - scientist-softserv/adventist-dl#179 [1]: https://en.wikipedia.org/wiki/Byte_order_mark
jeremyf
added a commit
to scientist-softserv/adventist-dl
that referenced
this pull request
Dec 16, 2022
The [Bite Order Mark (BOM)][1] is an invisible character that interfers with field mapping. In this case we had a CSV that had the column named "file" and our mapping had a field named "file". On a visual examination, this should've matched. However the CSV had a BOM character at the beginning of "file" column. Thus the string comparison failed because of the invisible character. The manifestation of which was failure to upload and attach files. Related to: - samvera/bulkrax#689 - scientist-softserv/adventist-dl#179 [1]: https://en.wikipedia.org/wiki/Byte_order_mark
jeremyf
added a commit
to scientist-softserv/adventist-dl
that referenced
this pull request
Dec 19, 2022
Prior to this commit, I was thinking that all extracted text was properly encoded. However, in reviewing the underlying data in the application I found cases where the extracted text's content was not properly encoded. The primary culprit was the Byte Order Marker (BOM) character. I'm not entirely certain why the original encoding doesn't cover the problem, but when I was testing in the console, I was getting the encoding error on the `all_text_timv` SOLR field; and it was the BOM that was causing the problem. My conjecture is that there is either issues with Rails's `to_json` is not recognizing BOM as correct encoding for UTF-8 (which I believe it is). Or we're getting something garbled back from Fedora. Or the encode method was not quite right. Regardless, with this commit, I'm forcing encoding of that plain text content and removing the BOM character. Testing this is also a particular challenge because all of our existing tools for copy/paste and typing tend to do some hiddent encoding antics on our behalf. Below is a naive example of using the `Hyku.utf_8_encode` for the BOM stripping. ```ruby irb(main):001:0> "\xEF\xBB\xBFHello" => "Hello" irb(main):002:0> "\xEF\xBB\xBFHello" == "Hello" => false irb(main):003:0> Hyku.utf_8_encode("\xEF\xBB\xBFHello") == "Hello" => true ``` Closes: - https://github.com/scientist-softserv/adventist-dl/issues/181 Related to: - samvera/bulkrax#689 - samvera/bulkrax#688 - https://github.com/scientist-softserv/adventist-dl/issues/179
This was referenced Dec 19, 2022
1 task
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Prior to this commit, when someone would upload a CSV it might have a column name that included the tricksy, invisibile Byte Order Mark. The manifestation was that you could look at the raw metadata and see a key called
"file"
however when checkingraw_metadata.key?("file")
the result wasfalse
.To find this
entry.raw_metadata.keys.first.chars.map(&:chr)
I got["", "f", "i", "l", "e"]
where the first value of that array was a Byte Oder Mark.With this commit, we're handling both how we persist and how we load persisted serialized data. This follows on the documentation for the
ActiveRecord::Base.serialize
method.It's envisioned that we might have more characters to sanitize.
Closes: #688
Related to: scientist-softserv/adventist_knapsack#634