ENH: 'Efficient' resampling with custom non-uniform non-overlapping time intervals #41212

yohplala · 2021-04-29T08:09:01Z

Is your feature request related to a problem?

I would like to resample a sorted time series with a sorted list of non-uniform, non-overlapping and not necessarily contiguous time intervals.
Taking benefit that...

the time serie i sorted
and and the non-uniform not necessarily contiguous time intervals are not overlapping,

...reading through the time serie needs to be done only once.

Posting my question on SO, with use of groupby just to show intended result, and asking if this could be done 'more efficiently' in pandas, I have been proposed to use Numba.

Describe the solution you'd like

I would propose to:

re-use IntervalIndex to define non-uniform, sorted and non overlapping intervals. This feature can conveniently be checked then with is_non_overlapping_monotonic
have a new parameter label in IntervalIndex to define how should be named each interval (left or right bound, similar to resample API)
have resample accepting IntervalIndex
have the parameters closed and label forwarded from IntervalIndex to resample

Additionally, resample could also accept a single sequence of timestamps to define sorted non-uniform contiguous time intervals, similar to what cut can accept for its bins parameter.

API breaking implications

Use case example.

import pandas as pd

# Time series.
length = 14
ts = pd.date_range(start='2021/01/01 08:00', periods=length, freq='30T')
tseries = pd.DataFrame({'val':range(length)}, index=ts)

# Custom disjoint intervals.
hours_start = ['7:46', '9:25',  '9:35', '10:15', '10:55', '13:10']
hours_end  = ['8:43', '9:32', '10:09', '10:42', '12:55', '13:31']
days = ['2021/01/01']*len(hours_start)
start = pd.DatetimeIndex(map(pd.Timestamp, map(' '.join, zip(days, hours_start))))
end = pd.DatetimeIndex(map(pd.Timestamp, map(' '.join, zip(days, hours_end))))
custom_intervals = pd.IntervalIndex.from_arrays(start, end, closed='left')

# New API here, with 'closed' and 'label' parameters forwarded from 'IntervalIndex' to 'resample'.
tseries['cumsum'] = tseries['val'].resample(custom_intervals).agg({'val': 'cumsum'})['val']

Describe alternatives you've considered

using Numba of course
using rolling. Am I right in thinking that resample is a convenience method calling rolling under the hood?

Additional context

I am assuming that resample already takes benefit that the time series is sorted according its DatetimeIndex. Am I right?

The text was updated successfully, but these errors were encountered:

yohplala · 2021-04-30T21:52:02Z

Hi,
I am sorry, there is a mistake in above request in case time intervals are not contiguous.
What should be done with rows that do not fall in a segment?

The 2 cases are actually clearly separate:

1 with non contiguous segments, we necessarily have an 'apply' subsequent step and the number of rows of the initial DataFrame 'collapse' during aggregation to result in a number of rows being equal to the number of time intervals.
2 with contiguous segments, then either we can have 'apply' or 'transform' subsequent steps.

For case 1 (Dataframe with rows that could fall out of not contiguous intervals),
I am currently going with a python loop, speeded up with help of Numba.
I have no knowledge of pandas function able to only operate on an user-provided number of windows. Is there any?
I took time to implement a custom BaseIndexer to be provided to rolling, having in mind that start and end would provide the indexes of the windows. Only after I realized that the returned arrays had to be of the size of the initial DataFrame index, applying to each row of the initial DataFrame. With this approach, you can't have any row that fall out of a window.
And there is no cumsum. It would have to be managed by a sum. So clearly not efficient.
=> Here I guess, my initial request can be discarded. While I think this feature is useful, it seems to require a new pandas command?

For case 2 (contiguous intervals, with all rows of a DataFrame fitting in an interval),
I will try using for the moment:

IntervalIndex to define custom bins from 'start' and 'end' arrays.
apply these bins to the DatetimeIndex of my DataFrame with cut to generate a new column with interval label for each row
then applying a groupby on this new column
finally achieving a cumsum

So here requests from initial ticket above remain:

may this workflow be shortened by allowing a direct use of IntervalIndex either in groupby or resample as stated above?
also, is groupby or resample able to take advantage during the grouping operation that rows of the initial DataFrame are sorted, hence ensuring that groups are de-facto made of contiguous rows?

Thanks in advance for your feedback on these two last questions.
Sorry for the unclear request at the start.
Bests

jreback · 2021-04-30T22:10:14Z

pls replace the word demand with request

yohplala · 2021-05-01T09:12:31Z

pls replace the word demand with request

Hi @jreback , sorry, I just had a look on the definition of 'demand' in english. I am sorry, indeed, it is not the idea. In french, in 'demande', there is not the 'peremptory' aspect of the word 'demand' in english. Sorry again.

yohplala · 2021-05-01T12:10:52Z

Hi,

To support above requests, here are the consolidated use cases, with working code from existing features, and proposal of how could be 'extended' current function signatures.

Dataset:

import pandas as pd
import random

# Dataset
ts_raw = pd.DatetimeIndex([
                           pd.Timestamp('2021/01/01 00:37'),
                           pd.Timestamp('2021/01/01 00:40'),
                           pd.Timestamp('2021/01/01 00:48'),
                           pd.Timestamp('2021/01/01 01:00'),
                           pd.Timestamp('2021/01/01 03:45'),
                           pd.Timestamp('2021/01/01 03:59'),
                           pd.Timestamp('2021/01/01 04:55'),
                           pd.Timestamp('2021/01/01 05:20')])
length = len(ts_raw)
random.seed(1)
val = random.sample(range(1, length+1), length)
df_raw = pd.DataFrame({'val' : val, 'time': ts_raw})

In [50]: df_raw
Out[50]: 
   val                time
0    3 2021-01-01 00:37:00
1    5 2021-01-01 00:40:00
2    1 2021-01-01 00:48:00
3    8 2021-01-01 01:00:00
4    6 2021-01-01 03:45:00
5    2 2021-01-01 03:59:00
6    7 2021-01-01 04:55:00
7    4 2021-01-01 05:20:00

Case 1: not necessarily contiguous intervals defined by a 'start' and an 'end'

# Use case 1 // with not necessarily contiguous intervals defined by a 'start' and an 'end'.
# Definition of custom non-contiguous intervals from 'start' and 'end' arrays.
ts_start = pd.DatetimeIndex([
                             pd.Timestamp('2021/01/01 00:40'),
                             pd.Timestamp('2021/01/01 01:55'),
                             pd.Timestamp('2021/01/01 03:45'),
                             pd.Timestamp('2021/01/01 04:55')])
ts_end = pd.DatetimeIndex([
                             pd.Timestamp('2021/01/01 01:00'),
                             pd.Timestamp('2021/01/01 02:00'),
                             pd.Timestamp('2021/01/01 04:00'),
                             pd.Timestamp('2021/01/01 05:00')])
intervals = pd.IntervalIndex.from_arrays(ts_start, ts_end, closed='left')
# Keeping intervals for each row in 'df_raw', and renaming it with the right bin edge.
df_raw['interval'] = pd.IntervalIndex(pd.cut(df_raw['time'], bins=intervals)).right   # this line would not be kept with modifications requested
result = df_raw.groupby('interval', observed='False').agg({'val': 'sum'})  # this line whould be changed with modifications requested

Results

In [43]: result
Out[43]: 
                     val
interval                
2021-01-01 01:00:00    6
2021-01-01 04:00:00    8
2021-01-01 05:00:00    7

Requests here are (proposals):

'Grouper' to accept 'IntervalIndex' (through a new 'by' parameter?)
'groupby' to accept a parameter 'groups_as_contiguous_rows=True/False' to take advantage in computation that blocks are made of contiguous rows or not.

With proposed modifications, the 2 last rows would become:

grouper = pd.Grouper(by=intervals, key='time', label='right')
result = df_raw.groupby(grouper, observed='False', groups_as_contiguous_rows=True).agg({'val': 'sum'})

Case 2: contiguous intervals defined by a single sequence

# Definition of custom contiguous intervals from a single array.
ts_seg = pd.DatetimeIndex([
                           pd.Timestamp('2021/01/01 00:40'),
                           pd.Timestamp('2021/01/01 01:55'),
                           pd.Timestamp('2021/01/01 03:45'),
                           pd.Timestamp('2021/01/01 04:55')])

# Rows to conduct computations that could be made under the hood.
ts_start = ts_seg[:-1]
ts_end = ts_seg[1:]
intervals = pd.IntervalIndex.from_arrays(ts_start, ts_end, closed='left') # we now have an 'IntervalIndex' made from a sequence of scalars

# Remaining lines of code similar to previous use case, but this time with a 'cumsum'.
df_raw['interval'] = pd.IntervalIndex(pd.cut(df_raw['time'], bins=intervals)).right
df_raw['cumsum'] = df_raw.groupby('interval', observed='False').agg({'val': 'cumsum'})
df_raw = df_raw.drop(columns='interval')    # column 'interval' is a temporary variable that is dropped in the end.

Results

In [48]: df_raw
Out[48]: 
   val                time  cumsum
0    3 2021-01-01 00:37:00      -1
1    5 2021-01-01 00:40:00       5
2    1 2021-01-01 00:48:00       6
3    8 2021-01-01 01:00:00      14
4    6 2021-01-01 03:45:00       6
5    2 2021-01-01 03:59:00       8
6    7 2021-01-01 04:55:00      -1
7    4 2021-01-01 05:20:00      -1

For this 2nd use case, a complementary request is thus:

'IntervalIndex' to accept sequence of scalars to define both start and end of intervals (similar to sequence of scalars that accepts 'cut' through the 'bins' parameter)

With proposed modification, the intermediate 'under the hood' computations in the middle of the example become:

intervals = pd.IntervalIndex.from_array(ts_seg, closed='left')   # new method 'from_array' without 's' ?

It may seem like the requests are focusing on function signatures mostly. This is what illustrate the use cases.

But an important motivation / interest of the request is to know (request details regarding) if groupby can be made more efficient in case the user informs that groups are made of contiguous rows?
It means, once a group is dealt with / once the code realizes it has changed of group when reading the array, it does not have to expect finding separate rows further in the array that belongs to past groups.

Thanks in advance for your feedback,
Bests

yohplala · 2021-05-25T19:54:20Z

Hi there,
I realize it is a duplicate of #37949, instead with more detail examples perhaps?
Anyway, closing, and 'thumbing up' #37949.

yohplala added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 29, 2021

yohplala mentioned this issue Apr 30, 2021

ENH: Having rolling accepts pd.IntervalIndex on a sorted datetime column #40702

Closed

yohplala closed this as completed May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: 'Efficient' resampling with custom non-uniform non-overlapping time intervals #41212

ENH: 'Efficient' resampling with custom non-uniform non-overlapping time intervals #41212

yohplala commented Apr 29, 2021 •

edited

Loading

yohplala commented Apr 30, 2021 •

edited

Loading

jreback commented Apr 30, 2021

yohplala commented May 1, 2021

yohplala commented May 1, 2021 •

edited

Loading

yohplala commented May 25, 2021

ENH: 'Efficient' resampling with custom non-uniform non-overlapping time intervals #41212

ENH: 'Efficient' resampling with custom non-uniform non-overlapping time intervals #41212

Comments

yohplala commented Apr 29, 2021 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

yohplala commented Apr 30, 2021 • edited Loading

jreback commented Apr 30, 2021

yohplala commented May 1, 2021

yohplala commented May 1, 2021 • edited Loading

yohplala commented May 25, 2021

yohplala commented Apr 29, 2021 •

edited

Loading

yohplala commented Apr 30, 2021 •

edited

Loading

yohplala commented May 1, 2021 •

edited

Loading