-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Push SortOptions into DynComparator Allowing Nested Comparisons (#5426) #5792
Conversation
/// # use arrow_ord::cmp; | ||
/// # use arrow_ord::ord::make_comparator; | ||
/// # use arrow_schema::{ArrowError, SortOptions}; | ||
/// fn eq(a: &dyn Array, b: &dyn Array) -> Result<BooleanArray, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I debated adding SortOptions to the existing comparison kernels, but that would not only be breaking change, but is also somewhat incoherent. Whilst I accept postgres and some other systems do this, it is rather surprising, at least to me.
Having them separate also makes it more obvious what has an optimised vectorised kernel and what doesn't
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the API choice 👍
I recommend adding the rationale to the docstring at the stop ("this function does not support overrriding NULL handling to ensure optimized kernels, see make_comparator
for more control over ordering)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring talks about comparsions and total order and this function implement eq
-- woudl it be better ti implement cmp
to better match the text?
598c2f1
to
15b180e
Compare
15b180e
to
10ce321
Compare
fn compare_boolean(left: &dyn Array, right: &dyn Array) -> DynComparator { | ||
let left: BooleanArray = left.as_boolean().clone(); | ||
let right: BooleanArray = right.as_boolean().clone(); | ||
fn compare_impl<const NULLS_FIRST: bool, const DESCENDING: bool, F>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit over using boolean parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoids conditional branches in the comparator closure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I think of this is that the <const ...>
template parameters basically force the compiler to make special versions of this function -- so the downside is potentially more code (you now get a new copy of the function for each distinct set of type parameters) but the upside is that each of those copies has more optimization opportunities (like the arguments are constant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me -- thank you @tustvold . I think it is a very elegant way to solve the problems
Maybe I missed it, but I didn't see any new tests for comparing structs / lists with nested nullability, nor did I see any changes to existing tests.
I think it is important to add tests to both document precisely what the expected behavior as well as ensure we don't introduce a regression in some future refactot
Can you please also run the benchmarks to ensure this doesn't cause any regressions?
fn compare_boolean(left: &dyn Array, right: &dyn Array) -> DynComparator { | ||
let left: BooleanArray = left.as_boolean().clone(); | ||
let right: BooleanArray = right.as_boolean().clone(); | ||
fn compare_impl<const NULLS_FIRST: bool, const DESCENDING: bool, F>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I think of this is that the <const ...>
template parameters basically force the compiler to make special versions of this function -- so the downside is potentially more code (you now get a new copy of the function for each distinct set of type parameters) but the upside is that each of those copies has more optimization opportunities (like the arguments are constant)
arrow-ord/src/ord.rs
Outdated
/// | ||
/// # Postgres-compatible Nested Comparison | ||
/// | ||
/// Whilst SQL prescribes ternary logic for nulls, that is comparing a value against a null yields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the use of terms ternary logic / total ordering/etc somewhat hard to grok exactly what this is saying
Perhaps we can offer an example here -- like
If you want to compare nested nulls in your structures so that
{ a: 1, b: null }
null
{ a: 2, b: 1 }
Sorts like
{ a: 2, b: 1 }
{ a: 1, b: null }
null
if you specify DESC NULL LAST
(if I got that right)
/// assert_eq!(std::cmp::Ordering::Less, cmp(0, 1)); | ||
/// assert_eq!(cmp(0, 1), Ordering::Less); | ||
/// | ||
/// let array1 = Int32Array::from(vec![Some(1), None]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think making this a separate section / example might be clearer
/// let array1 = Int32Array::from(vec![Some(1), None]); | |
/// | |
/// # Example ordering with NULL | |
/// let array1 = Int32Array::from(vec![Some(1), None]); |
Also it might be helpful here to add an example of SortOptions that are not the default (to show the difference)
/// # use arrow_ord::cmp; | ||
/// # use arrow_ord::ord::make_comparator; | ||
/// # use arrow_schema::{ArrowError, SortOptions}; | ||
/// fn eq(a: &dyn Array, b: &dyn Array) -> Result<BooleanArray, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the API choice 👍
I recommend adding the rationale to the docstring at the stop ("this function does not support overrriding NULL handling to ensure optimized kernels, see make_comparator
for more control over ordering)
/// # use arrow_ord::cmp; | ||
/// # use arrow_ord::ord::make_comparator; | ||
/// # use arrow_schema::{ArrowError, SortOptions}; | ||
/// fn eq(a: &dyn Array, b: &dyn Array) -> Result<BooleanArray, ArrowError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring talks about comparsions and total order and this function implement eq
-- woudl it be better ti implement cmp
to better match the text?
/// | ||
/// If `nulls_first` is true `NULL` values will be considered less than any non-null value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might help to point people at the faster kernels too
like "see kernels in ord::cmp
for fast kernels
Tests are written but waiting on #5811 (as they found a bug in the row format) |
/// of nested nulls. That is nulls within nested types are either greater than any value, | ||
/// or less than any value (Spark). This could be implemented as below | ||
/// Whilst SQL prescribes ternary logic for nulls, that is comparing a value against a NULL yields | ||
/// a NULL, many systems, including postgres, instead apply a total ordering to comparison of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Which issue does this PR close?
Closes #5426
Closes #5199
Closes #5559
Rationale for this change
See tickets
What changes are included in this PR?
This pushes null handling out of LexicographicalComparator and into DynComparator. This not only closes off a potential footgun, but also allows supporting nested types in comparison operators.
This probably needs a few more tests prior to merge, in particular StructArray was not supported by LexicographicalComparator before and therefore needs new tests, but the broad meat of the PR I think is ready for review
Are there any user-facing changes?