Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT]: Sql joins with duplicate cols #3241

Conversation

universalmind303
Copy link
Collaborator

adds support for union union all, and except set operations, as well as fixes an issue when performing joins with duplicate columns #3194

@github-actions github-actions bot added the enhancement New feature or request label Nov 7, 2024
Copy link

codspeed-hq bot commented Nov 7, 2024

CodSpeed Performance Report

Merging #3241 will improve performances by ×2

Comparing universalmind303:sql-joins-with-duplicate-cols (ebf9644) with main (bdd25b6)

Summary

⚡ 1 improvements
✅ 16 untouched benchmarks

Benchmarks breakdown

Benchmark main universalmind303:sql-joins-with-duplicate-cols Change
test_show[100 Small Files] 48.3 ms 24.1 ms ×2

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@universalmind303 I think the Union/Except set operations support could be in a separate PR, and if wanted, I can help to contribute these support.

src/daft-sql/src/planner.rs Outdated Show resolved Hide resolved
src/daft-sql/src/planner.rs Outdated Show resolved Hide resolved
@universalmind303
Copy link
Collaborator Author

I think the Union/Except set operations support could be in a separate PR, and if wanted, I can help to contribute these support.

yeah that makes sense, will split them out!

@universalmind303 universalmind303 changed the title [FEAT]: Sql joins with duplicate cols and EXCEPT set operation [FEAT]: Sql joins with duplicate cols Nov 7, 2024
@kevinzwang
Copy link
Member

kevinzwang commented Nov 7, 2024

@universalmind303 I feel like join behavior diverge quite a bit between SQL and dataframe. Should we somehow split them up instead of sort of tacking the SQL parser on top of the dataframe one like we are currently doing?

Right now I create a project on the right table before the join to handle common column names for dataframe, we could move that out of the actual join op creation and have separate implementations of that project for each API.

@universalmind303
Copy link
Collaborator Author

@universalmind303 I feel like join behavior diverge quite a bit between SQL and dataframe. Should we somehow split them up instead of sort of tacking the SQL parser on top of the dataframe one like we are currently doing?

Right now I create a project on the right table before the join to handle common column names for dataframe, we could move that out of the actual join op creation and have separate implementations of that project for each API.

yeah I'm not a huge fan of all of the extra work needed to replicate sql style joins in our engine. I think this likely involves a larger refactor to how we handle joins and LogicalPlans though. The logicalPlan doesn't have a concept of a relation/table name which causes us to do all of this extra work.

@kevinzwang
Copy link
Member

@universalmind303 I feel like join behavior diverge quite a bit between SQL and dataframe. Should we somehow split them up instead of sort of tacking the SQL parser on top of the dataframe one like we are currently doing?
Right now I create a project on the right table before the join to handle common column names for dataframe, we could move that out of the actual join op creation and have separate implementations of that project for each API.

yeah I'm not a huge fan of all of the extra work needed to replicate sql style joins in our engine. I think this likely involves a larger refactor to how we handle joins and LogicalPlans though. The logicalPlan doesn't have a concept of a relation/table name which causes us to do all of this extra work.

Would you like to start on that refactor in this PR or do you think it's worth merging this fix in for now? If the latter I can give this a review.

@advancedxy
Copy link
Contributor

@universalmind303 I feel like join behavior diverge quite a bit between SQL and dataframe. Should we somehow split them up instead of sort of tacking the SQL parser on top of the dataframe one like we are currently doing?

I'm feeling the same way. The logical_plan::Join::try_new already passes join_suffix and join_prefix to distinguish non-common join keys, maybe we should leverage that to include common join keys as well? At least we can have consistent semantics between SQL and Dataframe API.

Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think one thing we should do at some point is actually move the project/renaming logic out of Join::try_new, so that it can be tested independently and other places that create joins, such as LogicalPlan::with_new_children, do not need worry about that. But the code here is fine to merge in.

@universalmind303 universalmind303 merged commit f290f40 into Eventual-Inc:main Nov 9, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants