-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: ensure first datapoint is always included in group_by_dynamic #15312
Conversation
91cecb3
to
f81f51a
Compare
/// For example, if we have: | ||
/// | ||
/// - first datapoint is `2020-01-01 01:00` | ||
/// - `every` is `'1d'` | ||
/// - `period` is `'2d'` | ||
/// - `offset` is `'6h'` | ||
/// | ||
/// then truncating the earliest datapoint by `every` and adding `offset` results | ||
/// in the window `[2020-01-01 06:00, 2020-01-03 06:00)`. To give the earliest datapoint | ||
/// a chance of being included, we then shift the window back by `every` to | ||
/// `[2019-12-31 06:00, 2020-01-02 06:00)`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Demo:
In [4]: df = pl.DataFrame({'t': [datetime(2020, 1, 1, 1)], 'i': [0]})
In [5]: df.group_by_dynamic('t', every='1d', period='2d', offset='6h', include_boundaries=True).agg('i')
Out[5]:
shape: (1, 4)
┌─────────────────────┬─────────────────────┬─────────────────────┬───────────┐
│ _lower_boundary ┆ _upper_boundary ┆ t ┆ i │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ list[i64] │
╞═════════════════════╪═════════════════════╪═════════════════════╪═══════════╡
│ 2019-12-31 06:00:00 ┆ 2020-01-02 06:00:00 ┆ 2019-12-31 06:00:00 ┆ [0] │
└─────────────────────┴─────────────────────┴─────────────────────┴───────────┘
Whereas, on the latest release:
In [18]: df.group_by_dynamic('t', every='1d', period='2d', offset='6h', include_boundaries=True).agg('i')
Out[18]:
shape: (0, 4)
┌─────────────────┬─────────────────┬──────────────┬───────────┐
│ _lower_boundary ┆ _upper_boundary ┆ t ┆ i │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ datetime[μs] ┆ datetime[μs] ┆ list[i64] │
╞═════════════════╪═════════════════╪══════════════╪═══════════╡
└─────────────────┴─────────────────┴──────────────┴───────────┘
assert_eq!(groups[0], [1, 2]); // 00:00:00 -> 00:30:00 | ||
assert_eq!(groups[1], [3, 2]); // 01:00:00 -> 01:30:00 | ||
assert_eq!(groups[2], [5, 2]); // 02:00:00 -> 02:30:00 | ||
assert_eq!(groups[0], [0, 1]); // (2021-12-15 23:30, 2021-12-16 00:00] | ||
assert_eq!(groups[1], [1, 2]); // (2021-12-16 00:00, 2021-12-16 00:30] | ||
assert_eq!(groups[2], [3, 2]); // (2021-12-16 00:30, 2021-12-16 01:00] | ||
assert_eq!(groups[3], [5, 2]); // (2021-12-16 01:00, 2021-12-16 01:30] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's an example of where the earliest datapoint is currently being excluded
because closed='right'
, the 00:00:00
point isn't currently included in any windows. This PR fixes that
9c85e2a
to
68c8b94
Compare
let (from, to, offset): ( | ||
let (from, to, offset_fn): ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drive-by: renaming to offset_fn
, partially because that what it's called in other places, and also because otherwise it risks being confused with the offset
parameter of group_by_dynamic
14471e5
to
b96123b
Compare
assert_eq!(groups.len(), 2); | ||
assert_eq!(groups[1], [2, 2]); | ||
assert_eq!(groups.len(), 3); | ||
assert_eq!(groups[1], [1, 1]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same story, earliest datapoint not included in any group due to closed='right'
assert_eq!(groups[0], [1, 2]); // 00:00:00 -> 00:30:00 | ||
assert_eq!(groups[1], [3, 2]); // 01:00:00 -> 01:30:00 | ||
assert_eq!(groups[2], [5, 2]); // 02:00:00 -> 02:30:00 | ||
assert_eq!(groups[0], [0, 1]); // (2021-12-15 23:30, 2021-12-16 00:00] | ||
assert_eq!(groups[1], [1, 2]); // (2021-12-16 00:00, 2021-12-16 00:30] | ||
assert_eq!(groups[2], [3, 2]); // (2021-12-16 00:30, 2021-12-16 01:00] | ||
assert_eq!(groups[3], [5, 2]); // (2021-12-16 01:00, 2021-12-16 01:30] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here too
b96123b
to
7a693f5
Compare
closes #15241
With the example from that issue, the output becomes:
which is their expected output
This doesn't break any existing tested behaviour, but does make group-by-dynamic more user-friendly / expected to users using
offset
Perf impact: there's a little extra computation for finding the first window, but that's only for the first window - after that, the windows just keep getting updated by adding
every
(no change)