Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support binary data type in build_struct_array. #702

Merged
merged 3 commits into from
Aug 21, 2021

Conversation

zijie0
Copy link
Contributor

@zijie0 zijie0 commented Aug 20, 2021

Which issue does this PR close?

Closes #701

Rationale for this change

We encounter this issue in delta-rs project where partitionValues in delta spec could contain binary type values.

What changes are included in this PR?

Add binary type support and corresponding test case.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 20, 2021
@houqp houqp requested review from nevi-me and Dandandan August 20, 2021 04:13
@codecov-commenter
Copy link

codecov-commenter commented Aug 20, 2021

Codecov Report

Attention: Patch coverage is 91.30435% with 2 lines in your changes missing coverage. Please review.

Project coverage is 82.46%. Comparing base (38e5d22) to head (29e63da).
Report is 2837 commits behind head on master.

Files with missing lines Patch % Lines
arrow/src/json/reader.rs 90.90% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master     #702   +/-   ##
=======================================
  Coverage   82.45%   82.46%           
=======================================
  Files         168      168           
  Lines       47397    47419   +22     
=======================================
+ Hits        39083    39103   +20     
- Misses       8314     8316    +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Out of curiosity, isn't json always valid utf8?

@@ -668,7 +668,7 @@ mod tests {
"entry",
DataType::Struct(vec![
Field::new("key", DataType::Utf8, false),
Field::new("key", DataType::Int32, true),
Field::new("value", DataType::Int32, true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An oversight on my end, should have been "value"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, learning from previous PR and found this typo ;)

arrow/src/json/reader.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@nevi-me nevi-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -668,7 +668,7 @@ mod tests {
"entry",
DataType::Struct(vec![
Field::new("key", DataType::Utf8, false),
Field::new("key", DataType::Int32, true),
Field::new("value", DataType::Int32, true),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An oversight on my end, should have been "value"

rows.iter()
.map(|row| {
let maybe_value = row.get(field.name());
maybe_value.and_then(|value| value.as_str())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought value.as_str() would return an error if the binary data can't be converted to UTF8, but then we probably wouldn't be able to read the file in that case.

Just an observation, no action needed

@nevi-me
Copy link
Contributor

nevi-me commented Aug 20, 2021

Thanks!

Out of curiosity, isn't json always valid utf8?

Yeah, I've only ever base64 encoded to a string, so I wonder if there'd be some data loss if we don't do some conversion to a string representation

@alamb
Copy link
Contributor

alamb commented Aug 20, 2021

I pushed a change to fix a cargo fmt nitpick and once this PR passes CI I plan to merge it in. Thanks again @zijie0 !

@zijie0
Copy link
Contributor Author

zijie0 commented Aug 21, 2021

Thank you all! Feeling excited to become a new contributor.

@alamb alamb merged commit 8308615 into apache:master Aug 21, 2021
@alamb
Copy link
Contributor

alamb commented Aug 21, 2021

Looks great -- thank you again @zijie0 -- I'll plan to include this as part of arrow 5.3.0 (eta release in about 10 days time)

alamb added a commit that referenced this pull request Aug 21, 2021
* Support binary data type in `build_struct_array`.

* Modify test case.

* cargo fmt

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@houqp
Copy link
Member

houqp commented Aug 22, 2021

Yeah, I've only ever base64 encoded to a string, so I wonder if there'd be some data loss if we don't do some conversion to a string representation

I think writer needs to take care of the encoding and the reader needs to decode accordingly. For example encoding arbitrary binary into base64 is one way to do this. Another way is to encode binary as 2 bytes unicode code point. Whatever the encoding/decoding strategy gets picked, the writer is responsible to encode the binary into valid utf8 string. So it's really just a convention that the writer and reader needs to follow.

alamb added a commit that referenced this pull request Aug 23, 2021
* Support binary data type in `build_struct_array`. (#702)

* Support binary data type in `build_struct_array`.

* Modify test case.

* cargo fmt

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Remove accidentally included test

Co-authored-by: Yuan Zhou <myoilbox@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support reading json string into binary data type.
6 participants