Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement CardinalityAwareRowConverter while doing streaming merge #7401

Merged
merged 61 commits into from
Sep 18, 2023
Merged

Implement CardinalityAwareRowConverter while doing streaming merge #7401

merged 61 commits into from
Sep 18, 2023

Conversation

JayjeetAtGithub
Copy link
Contributor

Which issue does this PR close?

Closes #7200.

Rationale for this change

This PR uses the CardinalityAwareRowConverter implemented in arrow-row in this PR.

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Aug 24, 2023
@JayjeetAtGithub JayjeetAtGithub changed the title Use CardinalityAwareRow converter Use CardinalityAwareRow converter while doing streaming merge Aug 24, 2023
@JayjeetAtGithub JayjeetAtGithub changed the title Use CardinalityAwareRow converter while doing streaming merge Use CardinalityAwareRowConverter while doing streaming merge Aug 25, 2023
@JayjeetAtGithub JayjeetAtGithub marked this pull request as ready for review August 29, 2023 20:35
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, I think it would also be good to check the benchmarks to see how this impacts performance

datafusion/core/src/physical_plan/wrapper.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/wrapper.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/wrapper.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/mod.rs Outdated Show resolved Hide resolved
@JayjeetAtGithub
Copy link
Contributor Author

Left some comments, I think it would also be good to check the benchmarks to see how this impacts performance

@tustvold I benchmarked the current implementation of CardinalityAwareRowConverter with RowConverter and didn't see any performance regression. See the results here.

@alamb
Copy link
Contributor

alamb commented Sep 5, 2023

I plan to review this PR shortly

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JayjeetAtGithub -- this PR is looking close. I left a few comments and there appears to be a CI test failure and clippy issues that need to be resolved.

Also, is there an end to end test you can add showing how this PR avoids a memory explosion while merging?

datafusion/core/src/physical_plan/wrapper.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/wrapper.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/wrapper.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_plan/wrapper.rs Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Sep 8, 2023

In order to move this PR along, I plan to work on it later this morning. I'll plan to push some changes

@JayjeetAtGithub
Copy link
Contributor Author

@alamb I am working on testing out the wrapper in IOx over the OOM'ed Jaegar queries and adding some changes necessary to make that work.

@alamb
Copy link
Contributor

alamb commented Sep 13, 2023

Update: I think this PR is ready to go except for figuring out what the proper value for the LOW_CARDINALITY cutoff is

I believe @tustvold is checking setting it to zero via apache/arrow-rs#4811. @JayjeetAtGithub can you look into seeing what the performance threshold is?

The idea would be to test the performance of merge on a column of different cardinalities -- maybe cardinality 4, 8, 12, 20, 50 and 100. Maybe there is an existing benchmark that could be used in https://github.com/apache/arrow-rs/blob/master/arrow/benches/row_format.rs

@JayjeetAtGithub
Copy link
Contributor Author

Sure, I can do that

@JayjeetAtGithub
Copy link
Contributor Author

JayjeetAtGithub commented Sep 15, 2023

The idea would be to test the performance of merge on a column of different cardinalities -- maybe cardinality 4, 8, 12, 20, 50 and 100

chart

Row conversion duration vs cardinality for dict preserving on/off. The absolute numbers are in microseconds. See sheet.

This chart shows the durations (in microseconds) taken to convert a RecordBatch (dict<int,utf8>, int) to the Row format. We sweep across the cardinality of the dictionary encoded field from 1 to 500000 and turn dictionary preserving on/off. We just measure the time for RowConverter::convert_columns. See b05919f.

From the graph, it looks like 500 is a good threshold to change to preserve_dict=false, so we set the value of LOW_CARDINALITY_THRESHOLD in the PR to be 512.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) and removed sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Sep 15, 2023
@alamb
Copy link
Contributor

alamb commented Sep 18, 2023

I am going to resolve the merge conflicts and get this PR ready to go

@alamb
Copy link
Contributor

alamb commented Sep 18, 2023

@tustvold and I talked about this PR -- it is likely that when arrow-rs 47 is released (with apache/arrow-rs#4819) this code will become outdated, we think it is an improvement over the current main. Thus I plan to merge this PR when the tests pass

@alamb alamb merged commit f4c4ee1 into apache:main Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RowConverter keeps growing in size while merging streams on high-cardinality dictionary fields
3 participants