Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support null values in Avro string columns #6307

Merged
merged 6 commits into from
May 12, 2023

Conversation

nenorbot
Copy link
Contributor

@nenorbot nenorbot commented May 9, 2023

Which issue does this PR close?

Closes #6305 .

What changes are included in this PR?

Add null handling for string values when converting Avro to Arrow.

Are these changes tested?

I have added unit-tests to check the conversion happens properly (for string columns as well as for several other data types). To do so I had to generate a new Avro file containing null values under testing/data/avro. Since this file resides in the arrow-testing repo, it is not part of this PR (and therefore the tests will fail) -- if these changes are approved I'll follow-up with an additional PR to add the new file to arrow-testing. Please let me know if there's a better way to go about this other than splitting to separate PRs.

Are there any user-facing changes?

No

@github-actions github-actions bot added the core Core DataFusion crate label May 9, 2023
@alamb
Copy link
Contributor

alamb commented May 10, 2023

Thank you @nenorbot !

Please let me know if there's a better way to go about this other than splitting to separate PRs.

I suggest that you

  1. make a PR to arrow-testing with the new files
  2. Temporarily (for review) update the pin of arrow-testing to our new branch
  3. Once approved, we'll merge the changes to arrow-testing and then update this PR with the new final pin

I know that is a bit of a pain -- I can help with the merging if you can do steps 1 and 2 🙏

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nenorbot -- this looks good to me. I think the next steps are to get a PR to arrow-testing and then we'll get this one ready to go

Value::Bytes(bytes) => String::from_utf8(bytes.to_vec())
.map_err(AvroError::ConvertToUtf8)
.map(Some),
Value::Null => Ok(None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me. 👍

let exec =
get_exec(&state, "alltypes_nulls_plain.avro", projection, None).await?;

let batches = collect(exec, task_ctx).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might also be worth checking out the https://docs.rs/datafusion/latest/datafusion/macro.assert_batches_eq.html macro to verify the rows / columns in a more easy to maintain wai

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that as well, however since we're explicitly checking for null values, the expected value would be something like

        let expected = vec![
            "+------------+",
            "| string_col |",
            "+------------+",
            "|            |",
            "+------------+",
        ];

... making it hard to differentiate between an empty string and null, so I opted to explicitly test via Array#is_null

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nenorbot -- this looks good to me. I think the next steps are to get a PR to arrow-testing and then we'll get this one ready to go

@nenorbot
Copy link
Contributor Author

2. Temporarily (for review) update the pin of arrow-testing to our new branch

Thanks for the help @alamb . I created an additional PR at apache/arrow-testing#90. Do I pin arrow-testing by modifying .gitsubmodules to point to my fork?

@alamb
Copy link
Contributor

alamb commented May 10, 2023

Thanks for the help @alamb . I created an additional PR at apache/arrow-testing#90. Do I pin arrow-testing by modifying .gitsubmodules to point to my fork?

Yes, I think so -- maybe using the git submodule command

alamb added a commit to apache/arrow-testing that referenced this pull request May 10, 2023
To be used for the changes in
apache/datafusion#6307

file contents:

```
❯ avro-tools tojson data/avro/alltypes_nulls_plain.avro
{"string_col":null,"int_col":null,"bool_col":null,"bigint_col":null,"float_col":null,"double_col":null,"bytes_col":null}
```
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @nenorbot

@alamb
Copy link
Contributor

alamb commented May 10, 2023

@nenorbot github says this PR has merge conflicts, for reasons I don't understand. Can you possibly merge up from main to fix this?

Screenshot 2023-05-10 at 11 57 58 AM

@nenorbot
Copy link
Contributor Author

@nenorbot github says this PR has merge conflicts, for reasons I don't understand. Can you possibly merge up from main to fix this?

I think it must have been something to do with pulling new commits from arrow-testing. Merged with main now, conflicts seem to have disappeared.

@alamb
Copy link
Contributor

alamb commented May 10, 2023

I believe github is having issues -- https://www.githubstatus.com/

I restarted the failed check, so hopefully it will pass next

Screenshot 2023-05-10 at 12 52 24 PM

@nenorbot
Copy link
Contributor Author

I restarted the failed check, so hopefully it will pass next

I still see some failing checks, but cargo test doc for example works fine locally. Maybe it was github acting up and we should rerun again?

@alamb
Copy link
Contributor

alamb commented May 12, 2023

I still see some failing checks, but cargo test doc for example works fine locally. Maybe it was github acting up and we should rerun again?

I agree - I think github was indeed acting up. I retriggered the failed jobs and hopefully they'll pass this time

@alamb alamb merged commit b9e5c07 into apache:main May 12, 2023
@alamb
Copy link
Contributor

alamb commented May 12, 2023

Thanks again @nenorbot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error when reading Avro containing null in a string column
2 participants