Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add date_bin function #3015

Closed
stuartcarnie opened this issue Aug 2, 2022 · 7 comments · Fixed by #3034
Closed

Proposal: Add date_bin function #3015

stuartcarnie opened this issue Aug 2, 2022 · 7 comments · Fixed by #3034
Labels
datafusion Changes in the datafusion crate enhancement New feature or request

Comments

@stuartcarnie
Copy link
Contributor

stuartcarnie commented Aug 2, 2022

Introduction

This proposal suggests adding a new scalar function, date_bin, to DataFusion, for transforming timestamp values to arbitrary intervals for the purpose of grouping and aggregating time-series data.

Motivation

Time-series data is typically analysed in aggregate where one axis is almost always time. DataFusion's date_trunc is modelled after the PostgreSQL date_trunc function, which allows truncating a timestamp column for the purpose of grouping, however, the intervals are limited to an enumeration, such as second, minute, hour, day, week, month, quarter and year. To address this limitation, PostgreSQL 14 introduced the date_bin function, which can bin or adjust the input timestamp to arbitrary intervals.

Describe the solution you'd like

Add a new function, date_bin to DataFusion with the same semantics as the PostgreSQL function.

Name: date_bin(stride, source, origin)

Per the PostgreSQL 14 docs

The function date_bin “bins” the input timestamp into the specified interval (the stride) aligned with a specified origin.

Required arguments

Name Type Description
stride INTERVAL time interval specifying the width of each bin.
source TIMESTAMP The timestamp to transform.
origin TIMESTAMP Acts as an offset applied to the transformed timestamp

Example Usage

Demonstrate date_bin1:

SELECT 
  DATE_BIN(INTERVAL '15' minute, CAST(time as TIMESTAMP), TIMESTAMP '2001-01-01') AS time, 
  val 
FROM (
  VALUES 
    ('2021-06-10 17:05:00Z', 0.5),
    ('2021-06-10 17:19:10Z', 0.3)
  ) as t (time, val);

producing the following output:

+---------------------+-----+
| time                | val |
+---------------------+-----+
| 2021-06-10 17:00:00 | 0.5 |
| 2021-06-10 17:15:00 | 0.3 |
+---------------------+-----+
2 rows in set. Query took 0.002 seconds.

Example Usage: time offset for origin

SELECT 
  DATE_BIN(INTERVAL '15' minute, CAST(time as TIMESTAMP), TIMESTAMP '2001-01-01 00:02:30') AS time, 
  val 
FROM (
  VALUES 
    ('2021-06-10 17:05:00Z', 0.5),
    ('2021-06-10 17:19:10Z', 0.3)
  ) as t (time, val);

producing the following output:

+---------------------+-----+
| time                | val |
+---------------------+-----+
| 2021-06-10 17:02:30 | 0.5 |
| 2021-06-10 17:17:30 | 0.3 |
+---------------------+-----+
2 rows in set. Query took 0.002 seconds.

Describe alternatives you've considered

date_trunc, as mentioned, provides limited support for binning timestamps, but there is no alternative but to provide a native function.

Footnotes

  1. DataFusion does not support "typed string" literals in a VALUES statement, like VALUES ((TIMESTAMP '2021-06-10 17:05:00Z')), but feat: Enable typed strings expressions for VALUES clause #3018 will address that

@stuartcarnie stuartcarnie added the enhancement New feature or request label Aug 2, 2022
@alamb
Copy link
Contributor

alamb commented Aug 3, 2022

I think following postgres is a good idea 👌 thanks @stuartcarnie

cc @avantgardnerio

@alamb alamb added the datafusion Changes in the datafusion crate label Aug 3, 2022
@alamb alamb changed the title Proposal: Add date_bin function Proposal: Add date_bin function Aug 3, 2022
@alamb
Copy link
Contributor

alamb commented Aug 3, 2022

@ovr , @andygrove any concerns or thoughts?

@avantgardnerio
Copy link
Contributor

I think following postgres is a good idea 👌 thanks @stuartcarnie

cc @avantgardnerio

Love it!

@Dandandan
Copy link
Contributor

Sounds like a great feature 👍

@jacobmarble
Copy link

I don't know how far DataFusion likes to deviate from PostgreSQL-flavored SQL, but here's an idea for this feature.

Suggestion 1

Make the default value of the origin argument 1970-01-01 00:00:00 UTC.

It seems there is no default value specified by PostgreSQL. Certainly there exists a need for origin values like XXXX-XX-XX 00:02:30 UTC, but the far more common uses of date_bin are certainly with stride values like 1s, 10s, 30s, 1m, 5m, 10m, 15m, 20m, 30m, 1h, 2h, 3h, 6h, etc, with the reasonable expectation that the result of date_bin align with a top-of-the-hour origin.

The result is that 99% of queries written with this function will be shorter and less error prone.

Some proposed, equivalent statements:

-- as specified by PostgreSQL - origin at Unix epoch
select date_bin('15 minutes', timestamp '2022-08-03 14:49:50 PST', timestamp '1970-01-01 00:00:00 UTC');
-- 2022-08-03 14:45:00

-- proposed
select date_bin('15 minutes', timestamp '2022-08-03 14:49:50 PST');

Suggestion 2

Allow origin values with type timestamp, date, time or interval. Again, PostgreSQL doesn't have any prior art here.

The result enables better readability.

Some proposed, equivalent statements:

-- as specified by PostgreSQL - origin shifts bins forward 5 minutes
select date_bin('15 minutes', timestamp '2022-08-03 14:49:50 PST', timestamp '1970-01-01 00:05:00 UTC');
-- 2022-08-03 14:35:00

-- proposed
select date_bin('15 minutes', timestamp '2022-08-03 14:49:50 PST', time '00:05:00');
select date_bin('15 minutes', timestamp '2022-08-03 14:49:50 PST', interval '5 minutes');
-- as specified by PostgreSQL - origin shifts bins back 5 minutes
select date_bin('15 minutes', timestamp '2022-08-03 14:49:50 PST', timestamp '1969-12-31 23:55:00 UTC');
-- 2022-08-03 14:40:00

-- proposed
select date_bin('15 minutes', timestamp '2022-08-03 14:49:50 PST', - time '00:05:00');
select date_bin('15 minutes', timestamp '2022-08-03 14:49:50 PST', interval '- 5 minutes');
-- as specified by PostgreSQL - origin shifts bins forward one day
select date_bin('7 days', timestamp '2022-08-03 14:49:50 PST', timestamp '1970-01-02 00:00:00 UTC');
-- 2022-07-29 00:00:00

-- proposed
select date_bin('7 days', timestamp '2022-08-03 14:49:50 PST', date '1970-01-02');
select date_bin('7 days', timestamp '2022-08-03 14:49:50 PST', interval '1 day');

@stuartcarnie
Copy link
Contributor Author

stuartcarnie commented Aug 4, 2022

@jacobmarble these are great suggestions.

  • PostgreSQL: The date_bin implementation accepts a string or an INTERVAL type for the first argument. If it is a string, it is parsed as an INTERVAL, which saves the user from having to specify the INTERVAL prefix.
    • ✅ Will do that as part of this issue, as it remains compatible with PostgreSQL.
  • PostgreSQL: Supports several formats for INTERVAL input expressions, including ISO 8601 intervals, as noted in the documentation
    • Fractional seconds support PT0.nnnnnnS for microsecond precision. We would extend that to 9 places for nanosecond precision.
  • DataFusion: no support for ISO 8601 interval types, so we can't use shorthand syntax like PT5S for 5 seconds.
    • 🗒️ Will need to write a separate issue to add additional INTERVAL syntax.
  • DataFusion: We can also support the other data types for the origin argument
    • 🙏🏻 I'd like to propose we recommend these extensions in a follow up issue, including the optional origin argument

stuartcarnie added a commit to stuartcarnie/arrow-datafusion that referenced this issue Aug 4, 2022
@avantgardnerio
Copy link
Contributor

optional origin argument

This definitely seems like a worthwhile extension, but it also makes sense to not expand the scope of this PR.

alamb pushed a commit that referenced this issue Aug 8, 2022
* feat: Add date_bin built-in function

Helps #3015

* chore: Add unit tests for date_bin

* chore: Remove unsupported signature

* chore: Remove alternative `datebin`

* chore: Placate cargo clippy

* chore: Respond to PR feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants