-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support null values in Avro string columns #6307
Conversation
Thank you @nenorbot !
I suggest that you
I know that is a bit of a pain -- I can help with the merging if you can do steps 1 and 2 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @nenorbot -- this looks good to me. I think the next steps are to get a PR to arrow-testing and then we'll get this one ready to go
Value::Bytes(bytes) => String::from_utf8(bytes.to_vec()) | ||
.map_err(AvroError::ConvertToUtf8) | ||
.map(Some), | ||
Value::Null => Ok(None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me. 👍
let exec = | ||
get_exec(&state, "alltypes_nulls_plain.avro", projection, None).await?; | ||
|
||
let batches = collect(exec, task_ctx).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might also be worth checking out the https://docs.rs/datafusion/latest/datafusion/macro.assert_batches_eq.html macro to verify the rows / columns in a more easy to maintain wai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that as well, however since we're explicitly checking for null values, the expected value would be something like
let expected = vec![
"+------------+",
"| string_col |",
"+------------+",
"| |",
"+------------+",
];
... making it hard to differentiate between an empty string and null, so I opted to explicitly test via Array#is_null
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @nenorbot -- this looks good to me. I think the next steps are to get a PR to arrow-testing and then we'll get this one ready to go
Thanks for the help @alamb . I created an additional PR at apache/arrow-testing#90. Do I pin arrow-testing by modifying |
Yes, I think so -- maybe using the |
To be used for the changes in apache/datafusion#6307 file contents: ``` ❯ avro-tools tojson data/avro/alltypes_nulls_plain.avro {"string_col":null,"int_col":null,"bool_col":null,"bigint_col":null,"float_col":null,"double_col":null,"bytes_col":null} ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @nenorbot
@nenorbot github says this PR has merge conflicts, for reasons I don't understand. Can you possibly merge up from main to fix this? ![]() |
I think it must have been something to do with pulling new commits from |
I believe github is having issues -- https://www.githubstatus.com/ I restarted the failed check, so hopefully it will pass next ![]() |
I still see some failing checks, but |
I agree - I think github was indeed acting up. I retriggered the failed jobs and hopefully they'll pass this time |
Thanks again @nenorbot |
Which issue does this PR close?
Closes #6305 .
What changes are included in this PR?
Add null handling for string values when converting Avro to Arrow.
Are these changes tested?
I have added unit-tests to check the conversion happens properly (for string columns as well as for several other data types). To do so I had to generate a new Avro file containing null values under
testing/data/avro
. Since this file resides in thearrow-testing
repo, it is not part of this PR (and therefore the tests will fail) -- if these changes are approved I'll follow-up with an additional PR to add the new file toarrow-testing
. Please let me know if there's a better way to go about this other than splitting to separate PRs.Are there any user-facing changes?
No