Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ensure first datapoint is always included in group_by_dynamic #15312

Merged
merged 1 commit into from
Mar 28, 2024

Conversation

MarcoGorelli
Copy link
Collaborator

@MarcoGorelli MarcoGorelli commented Mar 26, 2024

closes #15241

With the example from that issue, the output becomes:

shape: (2, 2)
┌─────────────────────────┬──────┐
│ t                       ┆ v    │
│ ---                     ┆ ---  │
│ datetime[ms, UTC]       ┆ i64  │
╞═════════════════════════╪══════╡
│ 2024-03-21 05:00:00 UTC ┆ 11   │
│ 2024-03-22 05:00:00 UTC ┆ 1100 │
└─────────────────────────┴──────┘

which is their expected output

This doesn't break any existing tested behaviour, but does make group-by-dynamic more user-friendly / expected to users using offset

Perf impact: there's a little extra computation for finding the first window, but that's only for the first window - after that, the windows just keep getting updated by adding every (no change)

@MarcoGorelli MarcoGorelli force-pushed the include-earliest branch 8 times, most recently from 91cecb3 to f81f51a Compare March 28, 2024 11:19
Comment on lines +13 to +23
/// For example, if we have:
///
/// - first datapoint is `2020-01-01 01:00`
/// - `every` is `'1d'`
/// - `period` is `'2d'`
/// - `offset` is `'6h'`
///
/// then truncating the earliest datapoint by `every` and adding `offset` results
/// in the window `[2020-01-01 06:00, 2020-01-03 06:00)`. To give the earliest datapoint
/// a chance of being included, we then shift the window back by `every` to
/// `[2019-12-31 06:00, 2020-01-02 06:00)`.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Demo:

In [4]: df = pl.DataFrame({'t': [datetime(2020, 1, 1, 1)], 'i': [0]})

In [5]: df.group_by_dynamic('t', every='1d', period='2d', offset='6h', include_boundaries=True).agg('i')
Out[5]: 
shape: (1, 4)
┌─────────────────────┬─────────────────────┬─────────────────────┬───────────┐
│ _lower_boundary_upper_boundaryti         │
│ ------------       │
│ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ list[i64] │
╞═════════════════════╪═════════════════════╪═════════════════════╪═══════════╡
│ 2019-12-31 06:00:002020-01-02 06:00:002019-12-31 06:00:00 ┆ [0]       │
└─────────────────────┴─────────────────────┴─────────────────────┴───────────┘

Whereas, on the latest release:

In [18]: df.group_by_dynamic('t', every='1d', period='2d', offset='6h', include_boundaries=True).agg('i')
Out[18]:
shape: (0, 4)
┌─────────────────┬─────────────────┬──────────────┬───────────┐
│ _lower_boundary_upper_boundaryti         │
│ ------------       │
│ datetime[μs]    ┆ datetime[μs]    ┆ datetime[μs] ┆ list[i64] │
╞═════════════════╪═════════════════╪══════════════╪═══════════╡
└─────────────────┴─────────────────┴──────────────┴───────────┘

Comment on lines -343 to +350
assert_eq!(groups[0], [1, 2]); // 00:00:00 -> 00:30:00
assert_eq!(groups[1], [3, 2]); // 01:00:00 -> 01:30:00
assert_eq!(groups[2], [5, 2]); // 02:00:00 -> 02:30:00
assert_eq!(groups[0], [0, 1]); // (2021-12-15 23:30, 2021-12-16 00:00]
assert_eq!(groups[1], [1, 2]); // (2021-12-16 00:00, 2021-12-16 00:30]
assert_eq!(groups[2], [3, 2]); // (2021-12-16 00:30, 2021-12-16 01:00]
assert_eq!(groups[3], [5, 2]); // (2021-12-16 01:00, 2021-12-16 01:30]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's an example of where the earliest datapoint is currently being excluded

because closed='right', the 00:00:00 point isn't currently included in any windows. This PR fixes that

Comment on lines -167 to +240
let (from, to, offset): (
let (from, to, offset_fn): (
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drive-by: renaming to offset_fn, partially because that what it's called in other places, and also because otherwise it risks being confused with the offset parameter of group_by_dynamic

@MarcoGorelli MarcoGorelli force-pushed the include-earliest branch 2 times, most recently from 14471e5 to b96123b Compare March 28, 2024 12:34
Comment on lines -151 to +152
assert_eq!(groups.len(), 2);
assert_eq!(groups[1], [2, 2]);
assert_eq!(groups.len(), 3);
assert_eq!(groups[1], [1, 1]);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same story, earliest datapoint not included in any group due to closed='right'

Comment on lines -654 to +668
assert_eq!(groups[0], [1, 2]); // 00:00:00 -> 00:30:00
assert_eq!(groups[1], [3, 2]); // 01:00:00 -> 01:30:00
assert_eq!(groups[2], [5, 2]); // 02:00:00 -> 02:30:00
assert_eq!(groups[0], [0, 1]); // (2021-12-15 23:30, 2021-12-16 00:00]
assert_eq!(groups[1], [1, 2]); // (2021-12-16 00:00, 2021-12-16 00:30]
assert_eq!(groups[2], [3, 2]); // (2021-12-16 00:30, 2021-12-16 01:00]
assert_eq!(groups[3], [5, 2]); // (2021-12-16 01:00, 2021-12-16 01:30]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here too

@MarcoGorelli MarcoGorelli changed the title wip ensure first datapoint is always included in group_by_dynamic fix: ensure first datapoint is always included in group_by_dynamic Mar 28, 2024
@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Mar 28, 2024
@MarcoGorelli MarcoGorelli marked this pull request as ready for review March 28, 2024 13:37
@ritchie46 ritchie46 merged commit 85a5e38 into pola-rs:main Mar 28, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

First rows get ignored by group_by_dynamic when using offset
2 participants