-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: 'Efficient' resampling with custom non-uniform non-overlapping time intervals #41212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi, The 2 cases are actually clearly separate:
For case 1 (Dataframe with rows that could fall out of not contiguous intervals), For case 2 (contiguous intervals, with all rows of a DataFrame fitting in an interval),
So here requests from initial ticket above remain:
Thanks in advance for your feedback on these two last questions. |
pls replace the word demand with request |
Hi @jreback , sorry, I just had a look on the definition of 'demand' in english. I am sorry, indeed, it is not the idea. In french, in 'demande', there is not the 'peremptory' aspect of the word 'demand' in english. Sorry again. |
Hi, To support above requests, here are the consolidated use cases, with working code from existing features, and proposal of how could be 'extended' current function signatures. Dataset: import pandas as pd
import random
# Dataset
ts_raw = pd.DatetimeIndex([
pd.Timestamp('2021/01/01 00:37'),
pd.Timestamp('2021/01/01 00:40'),
pd.Timestamp('2021/01/01 00:48'),
pd.Timestamp('2021/01/01 01:00'),
pd.Timestamp('2021/01/01 03:45'),
pd.Timestamp('2021/01/01 03:59'),
pd.Timestamp('2021/01/01 04:55'),
pd.Timestamp('2021/01/01 05:20')])
length = len(ts_raw)
random.seed(1)
val = random.sample(range(1, length+1), length)
df_raw = pd.DataFrame({'val' : val, 'time': ts_raw}) In [50]: df_raw
Out[50]:
val time
0 3 2021-01-01 00:37:00
1 5 2021-01-01 00:40:00
2 1 2021-01-01 00:48:00
3 8 2021-01-01 01:00:00
4 6 2021-01-01 03:45:00
5 2 2021-01-01 03:59:00
6 7 2021-01-01 04:55:00
7 4 2021-01-01 05:20:00 Case 1: not necessarily contiguous intervals defined by a 'start' and an 'end' # Use case 1 // with not necessarily contiguous intervals defined by a 'start' and an 'end'.
# Definition of custom non-contiguous intervals from 'start' and 'end' arrays.
ts_start = pd.DatetimeIndex([
pd.Timestamp('2021/01/01 00:40'),
pd.Timestamp('2021/01/01 01:55'),
pd.Timestamp('2021/01/01 03:45'),
pd.Timestamp('2021/01/01 04:55')])
ts_end = pd.DatetimeIndex([
pd.Timestamp('2021/01/01 01:00'),
pd.Timestamp('2021/01/01 02:00'),
pd.Timestamp('2021/01/01 04:00'),
pd.Timestamp('2021/01/01 05:00')])
intervals = pd.IntervalIndex.from_arrays(ts_start, ts_end, closed='left')
# Keeping intervals for each row in 'df_raw', and renaming it with the right bin edge.
df_raw['interval'] = pd.IntervalIndex(pd.cut(df_raw['time'], bins=intervals)).right # this line would not be kept with modifications requested
result = df_raw.groupby('interval', observed='False').agg({'val': 'sum'}) # this line whould be changed with modifications requested Results In [43]: result
Out[43]:
val
interval
2021-01-01 01:00:00 6
2021-01-01 04:00:00 8
2021-01-01 05:00:00 7 Requests here are (proposals):
With proposed modifications, the 2 last rows would become: grouper = pd.Grouper(by=intervals, key='time', label='right')
result = df_raw.groupby(grouper, observed='False', groups_as_contiguous_rows=True).agg({'val': 'sum'}) Case 2: contiguous intervals defined by a single sequence # Definition of custom contiguous intervals from a single array.
ts_seg = pd.DatetimeIndex([
pd.Timestamp('2021/01/01 00:40'),
pd.Timestamp('2021/01/01 01:55'),
pd.Timestamp('2021/01/01 03:45'),
pd.Timestamp('2021/01/01 04:55')])
# Rows to conduct computations that could be made under the hood.
ts_start = ts_seg[:-1]
ts_end = ts_seg[1:]
intervals = pd.IntervalIndex.from_arrays(ts_start, ts_end, closed='left') # we now have an 'IntervalIndex' made from a sequence of scalars
# Remaining lines of code similar to previous use case, but this time with a 'cumsum'.
df_raw['interval'] = pd.IntervalIndex(pd.cut(df_raw['time'], bins=intervals)).right
df_raw['cumsum'] = df_raw.groupby('interval', observed='False').agg({'val': 'cumsum'})
df_raw = df_raw.drop(columns='interval') # column 'interval' is a temporary variable that is dropped in the end. Results In [48]: df_raw
Out[48]:
val time cumsum
0 3 2021-01-01 00:37:00 -1
1 5 2021-01-01 00:40:00 5
2 1 2021-01-01 00:48:00 6
3 8 2021-01-01 01:00:00 14
4 6 2021-01-01 03:45:00 6
5 2 2021-01-01 03:59:00 8
6 7 2021-01-01 04:55:00 -1
7 4 2021-01-01 05:20:00 -1 For this 2nd use case, a complementary request is thus:
With proposed modification, the intermediate 'under the hood' computations in the middle of the example become: intervals = pd.IntervalIndex.from_array(ts_seg, closed='left') # new method 'from_array' without 's' ? It may seem like the requests are focusing on function signatures mostly. This is what illustrate the use cases. But an important motivation / interest of the request is to know (request details regarding) if Thanks in advance for your feedback, |
Is your feature request related to a problem?
I would like to resample a sorted time series with a sorted list of non-uniform, non-overlapping and not necessarily contiguous time intervals.
Taking benefit that...
...reading through the time serie needs to be done only once.
Posting my question on SO, with use of
groupby
just to show intended result, and asking if this could be done 'more efficiently' in pandas, I have been proposed to use Numba.Describe the solution you'd like
I would propose to:
IntervalIndex
to define non-uniform, sorted and non overlapping intervals. This feature can conveniently be checked then withis_non_overlapping_monotonic
label
inIntervalIndex
to define how should be named each interval (left or right bound, similar toresample
API)resample
acceptingIntervalIndex
closed
andlabel
forwarded fromIntervalIndex
toresample
Additionally,
resample
could also accept a single sequence of timestamps to define sorted non-uniform contiguous time intervals, similar to whatcut
can accept for itsbins
parameter.API breaking implications
Use case example.
Describe alternatives you've considered
rolling
. Am I right in thinking thatresample
is a convenience method callingrolling
under the hood?Additional context
I am assuming that
resample
already takes benefit that the time series is sorted according its DatetimeIndex. Am I right?The text was updated successfully, but these errors were encountered: