Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support correct output column names and struct field names when consuming/producing Substrait #10829

Merged

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Jun 7, 2024

Which issue does this PR close?

Closes #10817

Rationale for this change

What changes are included in this PR?

  • fix Substrait producer to include also nested names in the root plan names list
  • add support into Substrait consumer to use that list to enforce column and inner struct field names

Are these changes tested?

Tested by a new test specifically for testing Alias (" AS "), as well as tightening existing tests.

Are there any user-facing changes?

Plans produced by DataFusion before this change are not compatible with plans produced after this change, if the plan contains inner struct fields, since the length of the "names" field does not match. The previous behavior was wrong, but still it's a breaking change. If that's considered important, it is possible to add an extra handling for those cases, but I'd prefer not to unless necessary.

Also some output column names might change if people are depending on the old behavior that did not rename them.

@Blizzara Blizzara marked this pull request as ready for review June 7, 2024 15:09
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me -- thank you @Blizzara (kudos for the tests especially)

I think this PR has some merge conflicts, but once those are resolved we can merge it in

\n TableScan: data projection=[a]",
false // NULL vs Int64(NULL)
true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

false, // "List(..)" vs "make_array(..)"
)
.await
roundtrip("SELECT [[1,2,3], [], NULL, [NULL]] FROM data").await
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that roundtrip actually is more stringent than assert_expected_plan as it checks the plan before and after roundtripping 👍

@Blizzara Blizzara force-pushed the avo/substrait-consumer-fix-output-column-names branch from e4b4b3d to 6c1bb88 Compare June 11, 2024 06:45
(assert_eq gives nicer error messages than assert)
@Blizzara
Copy link
Contributor Author

Looks great to me -- thank you @Blizzara (kudos for the tests especially)

I think this PR has some merge conflicts, but once those are resolved we can merge it in

Thanks for the review @alamb - I've fixed the conflicts, should be good to merge!

@alamb
Copy link
Contributor

alamb commented Jun 11, 2024

🚀 -- thanks again @Blizzara

@alamb alamb merged commit 1b89da4 into apache:main Jun 11, 2024
23 checks passed
@Blizzara Blizzara deleted the avo/substrait-consumer-fix-output-column-names branch June 11, 2024 18:40
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…ming/producing Substrait (apache#10829)

* produce flattened list of names including inner struct fields

* add a (failing) test

* rename output columns (incl. inner struct fields) according to the given list of names

* fix a test

* add column names project to the new TPC-H test and fix case

(assert_eq gives nicer error messages than assert)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Substrait consumer doesn't respect final output column names
2 participants