Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalizing serialized data for BOM characters #689

Merged
merged 3 commits into from
Dec 16, 2022
Merged

Normalizing serialized data for BOM characters #689

merged 3 commits into from
Dec 16, 2022

Conversation

jeremyf
Copy link
Contributor

@jeremyf jeremyf commented Dec 15, 2022

Prior to this commit, when someone would upload a CSV it might have a column name that included the tricksy, invisibile Byte Order Mark. The manifestation was that you could look at the raw metadata and see a key called "file" however when checking raw_metadata.key?("file") the result was false.

To find this entry.raw_metadata.keys.first.chars.map(&:chr) I got ["", "f", "i", "l", "e"] where the first value of that array was a Byte Oder Mark.

With this commit, we're handling both how we persist and how we load persisted serialized data. This follows on the documentation for the ActiveRecord::Base.serialize method.

It's envisioned that we might have more characters to sanitize.

Closes: #688
Related to: scientist-softserv/adventist_knapsack#634

Prior to this commit, when someone would upload a CSV it might have a
column name that included the tricksy, invisibile [Byte Order Mark][1].
The manifestation was that you could look at the raw metadata and see a
key called `"file"` however when checking `raw_metadata.key?("file")`
the result was `false`.

To find this `entry.raw_metadata.keys.first.chars.map(&:chr)` I got
`["", "f", "i", "l", "e"]` where the first value of that array was a
[Byte Oder Mark][1].

With this commit, we're handling both how we persist and how we load
persisted serialized data.  This follows on the documentation for the
`ActiveRecord::Base.serialize` method.

It's envisioned that we might have more characters to sanitize.

Closes: #688
Related to: scientist-softserv/adventist-dl#179

[1]: https://en.wikipedia.org/wiki/Byte_order_mark
@jeremyf jeremyf added the patch-ver for release notes label Dec 15, 2022
@jeremyf jeremyf marked this pull request as draft December 15, 2022 22:49
@jeremyf jeremyf marked this pull request as ready for review December 16, 2022 14:46
lib/bulkrax.rb Outdated Show resolved Hide resolved
lib/bulkrax.rb Outdated Show resolved Hide resolved
@jeremyf jeremyf merged commit f486233 into main Dec 16, 2022
@jeremyf jeremyf deleted the jeremyf---688 branch December 16, 2022 20:59
jeremyf added a commit to scientist-softserv/adventist-dl that referenced this pull request Dec 16, 2022
The [Bite Order Mark (BOM)][1] is an invisible character that
interfers with field mapping.  In this case we had a CSV that had the
column named "file" and our mapping had a field named "file".

On a visual examination, this should've matched.  However the CSV had a
BOM character at the beginning of "file" column.  Thus the string
comparison failed because of the invisible character.

The manifestation of which was failure to upload and attach files.

Related to:

- samvera/bulkrax#689
- scientist-softserv/adventist-dl#179

[1]: https://en.wikipedia.org/wiki/Byte_order_mark
jeremyf added a commit to scientist-softserv/adventist-dl that referenced this pull request Dec 16, 2022
The [Bite Order Mark (BOM)][1] is an invisible character that
interfers with field mapping.  In this case we had a CSV that had the
column named "file" and our mapping had a field named "file".

On a visual examination, this should've matched.  However the CSV had a
BOM character at the beginning of "file" column.  Thus the string
comparison failed because of the invisible character.

The manifestation of which was failure to upload and attach files.

Related to:

- samvera/bulkrax#689
- scientist-softserv/adventist-dl#179

[1]: https://en.wikipedia.org/wiki/Byte_order_mark
jeremyf added a commit to scientist-softserv/adventist-dl that referenced this pull request Dec 19, 2022
Prior to this commit, I was thinking that all extracted text was
properly encoded.  However, in reviewing the underlying data in the
application I found cases where the extracted text's content was not
properly encoded.  The primary culprit was the Byte Order Marker (BOM)
character.

I'm not entirely certain why the original encoding doesn't cover the
problem, but when I was testing in the console, I was getting the
encoding error on the `all_text_timv` SOLR field; and it was the BOM
that was causing the problem.

My conjecture is that there is either issues with Rails's `to_json` is
not recognizing BOM as correct encoding for UTF-8 (which I believe it
is).  Or we're getting something garbled back from Fedora.  Or the
encode method was not quite right.

Regardless, with this commit, I'm forcing encoding of that plain text
content and removing the BOM character.  Testing this is also a
particular challenge because all of our existing tools for copy/paste
and typing tend to do some hiddent encoding antics on our behalf.

Below is a naive example of using the `Hyku.utf_8_encode` for the BOM
stripping.

```ruby
irb(main):001:0> "\xEF\xBB\xBFHello"
=> "Hello"
irb(main):002:0> "\xEF\xBB\xBFHello" == "Hello"
=> false
irb(main):003:0> Hyku.utf_8_encode("\xEF\xBB\xBFHello") == "Hello"
=> true
```

Closes:

- https://github.com/scientist-softserv/adventist-dl/issues/181

Related to:

- samvera/bulkrax#689
- samvera/bulkrax#688
- https://github.com/scientist-softserv/adventist-dl/issues/179
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix patch-ver for release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle invisible characters are included in CSV column names
2 participants