Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move date_part, date_trunc, date_bin functions to datafusion-functions #9435

Merged
merged 13 commits into from
Mar 10, 2024

Conversation

Omega359
Copy link
Contributor

@Omega359 Omega359 commented Mar 3, 2024

Which issue does this PR close?

Closes #9421

Rationale for this change

Function migration.

What changes are included in this PR?

Code, tests moved to new location, references updated.

Are these changes tested?

Yes.

Are there any user-facing changes?

Functions are moved to new location, otherwise none.

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Physical Expressions labels Mar 3, 2024
@Omega359
Copy link
Contributor Author

Omega359 commented Mar 3, 2024

I have no idea why the tpch test would be failing. When I run tpch locally it's q1 that fails first. Any suggestions are welcome

…lain text to reflect the change. The physical plan is unchanged.
# Conflicts:
#	datafusion/proto/src/logical_plan/from_proto.rs
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 3, 2024
@Omega359
Copy link
Contributor Author

Omega359 commented Mar 4, 2024

blocked by #9450

@Omega359 Omega359 marked this pull request as ready for review March 5, 2024 02:38
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this @Omega359 -- it is looking very good. I am a little worried about some of the test changes, but I have some ideas on how to fix them.

Sorry for the delay in reviewing

@@ -0,0 +1,765 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so nice that the date_bin function has been moved into its own module

datafusion/functions/src/datetime/mod.rs Outdated Show resolved Hide resolved
datafusion/sqllogictest/test_files/tpch/q8.slt.part Outdated Show resolved Hide resolved
@@ -761,10 +735,7 @@ mod tests {
// expected
vec![
// [a_new ASC, date_bin_res ASC]
// Please note that result is not [a_new ASC, date_bin_res ASC, b_new ASC]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think this is important test coverage (it ensures that date_bin is properly being treated as monotonic). Indeed perhaps that montonic marking was not brought over to the the UDF 🤔

Though on the other hand, I think this case is already covered by .slt test (a plan test, specifically). @mustafasrepo do you remember offhand? Otherwise I will dig around

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous version still have the test in there .. but then I had dependencies on the function crate. I saw some things with date_bin and ordering in ordering.slt that I thought might cover it

Copy link
Contributor

@mustafasrepo mustafasrepo Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked whether this functionality is covered by the .slt tests. It seems that, .slt tests covers cases in the form where date_bin argument (such as ts) is leading ordering (e.g first ordering in lexicographical ordering expression). Then after date_bin is applied, date_bin result itself is ordered. In this sense, monotonicity behaviour is covered.

However, this unit test, tests when table is ordered as [a ASC, ts ASC],after projection a, date_bin(ts) result should have ordering: [a, date_bin(ts)]. Meaning, when date_bin argument (e.g ts) is not leading ordering, lexicographical ordering is still preserved. As long as previous expressions are present in the projection (in the example a).

I checked whether this behaviour is preserved in this PR. By running following test

# create an unbounded table that contains name, timestamp.
# where table is ordered by name DESC, ts DESC
statement ok
CREATE UNBOUNDED EXTERNAL TABLE unbounded_csv_with_timestamps2 (
  name VARCHAR,
  ts TIMESTAMP
)
STORED AS CSV
WITH ORDER (name DESC, ts DESC)
LOCATION '../core/tests/data/timestamps.csv'

query TT
EXPLAIN SELECT name, date_bin('15 minutes', ts) as time_chunks
  FROM unbounded_csv_with_timestamps2
  ORDER BY name DESC, time_chunks DESC
  LIMIT 5;
----
logical_plan
Limit: skip=0, fetch=5
--Sort: unbounded_csv_with_timestamps2.name DESC NULLS FIRST, time_chunks DESC NULLS FIRST, fetch=5
----Projection: unbounded_csv_with_timestamps2.name, date_bin(IntervalMonthDayNano("900000000000"), unbounded_csv_with_timestamps2.ts) AS time_chunks
------TableScan: unbounded_csv_with_timestamps2 projection=[name, ts]
physical_plan
GlobalLimitExec: skip=0, fetch=5
--SortPreservingMergeExec: [name@0 DESC,time_chunks@1 DESC], fetch=5
----ProjectionExec: expr=[name@0 as name, date_bin(900000000000, ts@1) as time_chunks]
------RepartitionExec: partitioning=RoundRobinBatch(8), input_partitions=1
--------StreamingTableExec: partition_sizes=1, projection=[name, ts], infinite_source=true, output_ordering=[name@0 DESC, ts@1 DESC]

I think, resulting plan is correct. (e.g after projection ordering [name DESC, time_chunks DESC] is still satisfied). So, we have desired behaviour but not test coverage currently.

However, it would be nice to have unit test. Given that in case of failure, error context is much more visible. It would really speed up debugging. With end to end test coverage, considerable time goes to understanding the root cause of the error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this test has been added by @Omega359 so I think we are good to go. Thank you @mustafasrepo

@mustafasrepo
Copy link
Contributor

I think, plan change comes from optimize_projections logical plan rule. where ScalarFunctionDefinition enum changes from ScalarFunctionDefinition::Builtin variant to ScalarFunctionDefinition::Name variant (to be resolved in physical planning). I think somehow this triggers the plan change. But I am not sure about the mechanism. Writing this in case, it is helpful.

@github-actions github-actions bot added the optimizer Optimizer rules label Mar 6, 2024
@mustafasrepo
Copy link
Contributor

I think, plan change comes from optimize_projections logical plan rule. where ScalarFunctionDefinition enum changes from ScalarFunctionDefinition::Builtin variant to ScalarFunctionDefinition::Name variant (to be resolved in physical planning). I think somehow this triggers the plan change. But I am not sure about the mechanism. Writing this in case, it is helpful.

I have dubugged this further. It seemed like a small fix. Hence I sent a commit to fix this change. With this change, I also added the test in my previous comment to make sure we have desired coverage.

@Omega359
Copy link
Contributor Author

Omega359 commented Mar 6, 2024

I have dubugged this further. It seemed like a small fix. Hence I sent a commit to fix this change. With this change, I also added the test in my previous comment to make sure we have desired coverage.

Thank you very much for your assistance in resolving this issue in this PR - it's very much appreciated

@@ -419,6 +419,11 @@ impl ScalarFunction {
args,
}
}

/// Create a new ScalarFunction expression with a user-defined function (UDF)
pub fn new_func_def(func_def: ScalarFunctionDefinition, args: Vec<Expr>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to inline the arguments directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do that, sure.

@alamb
Copy link
Contributor

alamb commented Mar 8, 2024

Thanks @Omega359 -- I plan to review this one in the next day or two

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sticking with this one @Omega359 -- I think this PR looks great.

@@ -761,10 +735,7 @@ mod tests {
// expected
vec![
// [a_new ASC, date_bin_res ASC]
// Please note that result is not [a_new ASC, date_bin_res ASC, b_new ASC]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this test has been added by @Omega359 so I think we are good to go. Thank you @mustafasrepo

@alamb
Copy link
Contributor

alamb commented Mar 9, 2024

I took the liberty of merging up from main to resolve conflicts (and then reapplied the changes from #9486 manually here)

@alamb
Copy link
Contributor

alamb commented Mar 10, 2024

And another merge to resolve conflicts

@alamb alamb merged commit acddecb into apache:main Mar 10, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 10, 2024

Thanks again @Omega359 and @mustafasrepo -- sorry this one took so long to get in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move date_part, date_trunc, date_bin functions to datafusion-functions
4 participants