Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Support intersect as a DataFrame API #3134

Merged
merged 2 commits into from
Nov 12, 2024

Conversation

advancedxy
Copy link
Contributor

@advancedxy advancedxy commented Oct 28, 2024

This commit leverages null safe equal support in joins(see #3069 and #3161) to support intersect API.

Partially fixes #3122.

@github-actions github-actions bot added the enhancement New feature or request label Oct 28, 2024
StructArray::new(struct_field, values, None).into_series()
}
}
}

pub fn to_series(&self) -> Series {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous to_series doesn't work with Struct, as struct has its own field names but to_series always generate field with "literal".

#[cfg(feature = "python")]
DataType::Python => {
use pyo3::prelude::*;
Self::Python(PyObjectWrapper(Python::with_gil(|py| py.None())))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether None is the right choice for Python type.

Copy link

codspeed-hq bot commented Oct 28, 2024

CodSpeed Performance Report

Merging #3134 will not alter performance

Comparing advancedxy:intersect_operation (dcddfb5) with main (baca61e)

Summary

✅ 17 untouched benchmarks

@advancedxy
Copy link
Contributor Author

@kevinzwang @universalmind303 appreciated if you can take a look at this.

@@ -133,6 +134,39 @@ def lit(value: object) -> Expression:
return Expression._from_pyexpr(lit_value)


def zero_lit(dt: DataType) -> Expression:
Copy link
Contributor Author

@advancedxy advancedxy Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to expose this to Python side as well.

I can remove this if it's not desired.

@kevinzwang kevinzwang self-requested a review October 29, 2024 17:04
Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @advancedxy, thank you for working on this! Really appreciate the work that you've done.

I don't have specific comments about the code yet, but from a cursory look at the PR, I don't see the need for the zero_lit expression. If you take a look at our join functions, we use arrow2's build_multi_array_is_equal to construct an equality check, and that function takes an argument nulls_equal.

let is_equal = build_multi_array_is_equal(
lkeys.columns.as_slice(),
rkeys.columns.as_slice(),
false,
false,
)?;

I believe if we propagate a variable to set that, a null-safe join would automatically work, since the hashes used by the probe are already properly constructed for nulls as well.

Could you give that a try?

@advancedxy
Copy link
Contributor Author

I believe if we propagate a variable to set that, a null-safe join would automatically work, since the hashes used by the probe are already properly constructed for nulls as well.

Hey @kevinzwang kevin, that was exactly my first thought as well. When I created the original issue, I think we can add null equal safe joins first and then leverage that to support the intersect operation. However, passing the parameter from the python side all the way down to the Rust's physical join plan, it seems it might touch a lot of code and I referenced other(a.k.a Spark) query engine's implementation and noticed that null safe equality could be effective rewrote as nvl + is_null, thus this PR. It might have other benefits to open more optimization opportunities when the null safe equality is rewritten. However, that's not the case for Arrow/Daft right now. I'm ok to not introduce that expression then.

Could you give that a try?

Of course, and taking a step back from here, I think we can add null safe equal in joins in the Rust side first, then leverage that to support this PR's intersection operator and then finish Python side's API. How does that sound to you?

@kevinzwang
Copy link
Member

Of course, and taking a step back from here, I think we can add null safe equal in joins in the Rust side first, then leverage that to support this PR's intersection operator and then finish Python side's API. How does that sound to you?

Yep, that sounds like a good plan. Thank you again for taking this on!

@advancedxy
Copy link
Contributor Author

Hey @kevinzwang @universalmind303 PTAL at this after #3161 is merged, thanks.

@advancedxy
Copy link
Contributor Author

The CI failure seems unrelated. Close and re-open to trigger a new CI run.

@advancedxy
Copy link
Contributor Author

Since #3161 is merged, I think this is ready for review.

Comment on lines +509 to +514
pub fn intersect(&self, other: &Self, is_all: bool) -> DaftResult<Self> {
let logical_plan: LogicalPlan =
ops::Intersect::try_new(self.plan.clone(), other.plan.clone(), is_all)?
.to_optimized_join()?;
Ok(self.with_new_plan(logical_plan))
}
Copy link
Collaborator

@universalmind303 universalmind303 Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of immediately converting it to a join, I think we should instead defer this until translate stage.

If we want to have a concept of a logical Intersect as discussed here, then I don't think we should immediately convert it to a logical join, but only convert it once to a physical join during the logical -> physical translation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should instead defer this until translate stage.

That's definitely an option. And DuckDB does that indeed: https://github.com/duckdb/duckdb/blob/a2dce8b1c9fa6039c82e9a32bfcc4c49b03ca871/src/execution/physical_plan/plan_set_operation.cpp#L53. The problem is like you stated in #3241 (comment), it will make the optimization rules more complex: a lot of rules should be amend to be set-operations aware.

On the other side, Spark doesn't defer this to translate stage, the intersect/except operators are optimized during the optimization phase, which makes the optimization phase simpler(at least for set-operation handling).

From my perspective, I agree with you that we should convert set operations to join early to avoid complexing optimization phase currently. I just want to decouple the logical into a separate func/struct and make sure it's extensible for long term plan. I can avoid define these operations if that doesn't sound right to you. WDYT?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, Lets go with what's in here now. We can always revisit it needed 😅

@universalmind303 universalmind303 merged commit a547bd3 into Eventual-Inc:main Nov 12, 2024
73 of 75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL: Add INTERSECT support
3 participants