Skip to content

ENH: 'Efficient' resampling with custom non-uniform non-overlapping time intervals #41212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
yohplala opened this issue Apr 29, 2021 · 5 comments
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@yohplala
Copy link

yohplala commented Apr 29, 2021

Is your feature request related to a problem?

I would like to resample a sorted time series with a sorted list of non-uniform, non-overlapping and not necessarily contiguous time intervals.
Taking benefit that...

  • the time serie i sorted
  • and and the non-uniform not necessarily contiguous time intervals are not overlapping,

...reading through the time serie needs to be done only once.

Posting my question on SO, with use of groupby just to show intended result, and asking if this could be done 'more efficiently' in pandas, I have been proposed to use Numba.

Describe the solution you'd like

I would propose to:

  • re-use IntervalIndex to define non-uniform, sorted and non overlapping intervals. This feature can conveniently be checked then with is_non_overlapping_monotonic
  • have a new parameter label in IntervalIndex to define how should be named each interval (left or right bound, similar to resample API)
  • have resample accepting IntervalIndex
  • have the parameters closed and label forwarded from IntervalIndex to resample

Additionally, resample could also accept a single sequence of timestamps to define sorted non-uniform contiguous time intervals, similar to what cut can accept for its bins parameter.

API breaking implications

Use case example.

import pandas as pd

# Time series.
length = 14
ts = pd.date_range(start='2021/01/01 08:00', periods=length, freq='30T')
tseries = pd.DataFrame({'val':range(length)}, index=ts)

# Custom disjoint intervals.
hours_start = ['7:46', '9:25',  '9:35', '10:15', '10:55', '13:10']
hours_end  = ['8:43', '9:32', '10:09', '10:42', '12:55', '13:31']
days = ['2021/01/01']*len(hours_start)
start = pd.DatetimeIndex(map(pd.Timestamp, map(' '.join, zip(days, hours_start))))
end = pd.DatetimeIndex(map(pd.Timestamp, map(' '.join, zip(days, hours_end))))
custom_intervals = pd.IntervalIndex.from_arrays(start, end, closed='left')

# New API here, with 'closed' and 'label' parameters forwarded from 'IntervalIndex' to 'resample'.
tseries['cumsum'] = tseries['val'].resample(custom_intervals).agg({'val': 'cumsum'})['val']

Describe alternatives you've considered

  • using Numba of course
  • using rolling. Am I right in thinking that resample is a convenience method calling rolling under the hood?

Additional context

I am assuming that resample already takes benefit that the time series is sorted according its DatetimeIndex. Am I right?

@yohplala
Copy link
Author

yohplala commented Apr 30, 2021

Hi,
I am sorry, there is a mistake in above request in case time intervals are not contiguous.
What should be done with rows that do not fall in a segment?

The 2 cases are actually clearly separate:

  • 1 with non contiguous segments, we necessarily have an 'apply' subsequent step and the number of rows of the initial DataFrame 'collapse' during aggregation to result in a number of rows being equal to the number of time intervals.
  • 2 with contiguous segments, then either we can have 'apply' or 'transform' subsequent steps.

For case 1 (Dataframe with rows that could fall out of not contiguous intervals),
I am currently going with a python loop, speeded up with help of Numba.
I have no knowledge of pandas function able to only operate on an user-provided number of windows. Is there any?
I took time to implement a custom BaseIndexer to be provided to rolling, having in mind that start and end would provide the indexes of the windows. Only after I realized that the returned arrays had to be of the size of the initial DataFrame index, applying to each row of the initial DataFrame. With this approach, you can't have any row that fall out of a window.
And there is no cumsum. It would have to be managed by a sum. So clearly not efficient.
=> Here I guess, my initial request can be discarded. While I think this feature is useful, it seems to require a new pandas command?

For case 2 (contiguous intervals, with all rows of a DataFrame fitting in an interval),
I will try using for the moment:

  • IntervalIndex to define custom bins from 'start' and 'end' arrays.
  • apply these bins to the DatetimeIndex of my DataFrame with cut to generate a new column with interval label for each row
  • then applying a groupby on this new column
  • finally achieving a cumsum

So here requests from initial ticket above remain:

  • may this workflow be shortened by allowing a direct use of IntervalIndex either in groupby or resample as stated above?
  • also, is groupby or resample able to take advantage during the grouping operation that rows of the initial DataFrame are sorted, hence ensuring that groups are de-facto made of contiguous rows?

Thanks in advance for your feedback on these two last questions.
Sorry for the unclear request at the start.
Bests

@jreback
Copy link
Contributor

jreback commented Apr 30, 2021

pls replace the word demand with request

@yohplala
Copy link
Author

yohplala commented May 1, 2021

pls replace the word demand with request

Hi @jreback , sorry, I just had a look on the definition of 'demand' in english. I am sorry, indeed, it is not the idea. In french, in 'demande', there is not the 'peremptory' aspect of the word 'demand' in english. Sorry again.

@yohplala
Copy link
Author

yohplala commented May 1, 2021

Hi,

To support above requests, here are the consolidated use cases, with working code from existing features, and proposal of how could be 'extended' current function signatures.

Dataset:

import pandas as pd
import random

# Dataset
ts_raw = pd.DatetimeIndex([
                           pd.Timestamp('2021/01/01 00:37'),
                           pd.Timestamp('2021/01/01 00:40'),
                           pd.Timestamp('2021/01/01 00:48'),
                           pd.Timestamp('2021/01/01 01:00'),
                           pd.Timestamp('2021/01/01 03:45'),
                           pd.Timestamp('2021/01/01 03:59'),
                           pd.Timestamp('2021/01/01 04:55'),
                           pd.Timestamp('2021/01/01 05:20')])
length = len(ts_raw)
random.seed(1)
val = random.sample(range(1, length+1), length)
df_raw = pd.DataFrame({'val' : val, 'time': ts_raw})
In [50]: df_raw
Out[50]: 
   val                time
0    3 2021-01-01 00:37:00
1    5 2021-01-01 00:40:00
2    1 2021-01-01 00:48:00
3    8 2021-01-01 01:00:00
4    6 2021-01-01 03:45:00
5    2 2021-01-01 03:59:00
6    7 2021-01-01 04:55:00
7    4 2021-01-01 05:20:00

Case 1: not necessarily contiguous intervals defined by a 'start' and an 'end'

# Use case 1 // with not necessarily contiguous intervals defined by a 'start' and an 'end'.
# Definition of custom non-contiguous intervals from 'start' and 'end' arrays.
ts_start = pd.DatetimeIndex([
                             pd.Timestamp('2021/01/01 00:40'),
                             pd.Timestamp('2021/01/01 01:55'),
                             pd.Timestamp('2021/01/01 03:45'),
                             pd.Timestamp('2021/01/01 04:55')])
ts_end = pd.DatetimeIndex([
                             pd.Timestamp('2021/01/01 01:00'),
                             pd.Timestamp('2021/01/01 02:00'),
                             pd.Timestamp('2021/01/01 04:00'),
                             pd.Timestamp('2021/01/01 05:00')])
intervals = pd.IntervalIndex.from_arrays(ts_start, ts_end, closed='left')
# Keeping intervals for each row in 'df_raw', and renaming it with the right bin edge.
df_raw['interval'] = pd.IntervalIndex(pd.cut(df_raw['time'], bins=intervals)).right   # this line would not be kept with modifications requested
result = df_raw.groupby('interval', observed='False').agg({'val': 'sum'})  # this line whould be changed with modifications requested

Results

In [43]: result
Out[43]: 
                     val
interval                
2021-01-01 01:00:00    6
2021-01-01 04:00:00    8
2021-01-01 05:00:00    7

Requests here are (proposals):

  • 'Grouper' to accept 'IntervalIndex' (through a new 'by' parameter?)
  • 'groupby' to accept a parameter 'groups_as_contiguous_rows=True/False' to take advantage in computation that blocks are made of contiguous rows or not.

With proposed modifications, the 2 last rows would become:

grouper = pd.Grouper(by=intervals, key='time', label='right')
result = df_raw.groupby(grouper, observed='False', groups_as_contiguous_rows=True).agg({'val': 'sum'})

Case 2: contiguous intervals defined by a single sequence

# Definition of custom contiguous intervals from a single array.
ts_seg = pd.DatetimeIndex([
                           pd.Timestamp('2021/01/01 00:40'),
                           pd.Timestamp('2021/01/01 01:55'),
                           pd.Timestamp('2021/01/01 03:45'),
                           pd.Timestamp('2021/01/01 04:55')])

# Rows to conduct computations that could be made under the hood.
ts_start = ts_seg[:-1]
ts_end = ts_seg[1:]
intervals = pd.IntervalIndex.from_arrays(ts_start, ts_end, closed='left') # we now have an 'IntervalIndex' made from a sequence of scalars

# Remaining lines of code similar to previous use case, but this time with a 'cumsum'.
df_raw['interval'] = pd.IntervalIndex(pd.cut(df_raw['time'], bins=intervals)).right
df_raw['cumsum'] = df_raw.groupby('interval', observed='False').agg({'val': 'cumsum'})
df_raw = df_raw.drop(columns='interval')    # column 'interval' is a temporary variable that is dropped in the end.

Results

In [48]: df_raw
Out[48]: 
   val                time  cumsum
0    3 2021-01-01 00:37:00      -1
1    5 2021-01-01 00:40:00       5
2    1 2021-01-01 00:48:00       6
3    8 2021-01-01 01:00:00      14
4    6 2021-01-01 03:45:00       6
5    2 2021-01-01 03:59:00       8
6    7 2021-01-01 04:55:00      -1
7    4 2021-01-01 05:20:00      -1

For this 2nd use case, a complementary request is thus:

  • 'IntervalIndex' to accept sequence of scalars to define both start and end of intervals (similar to sequence of scalars that accepts 'cut' through the 'bins' parameter)

With proposed modification, the intermediate 'under the hood' computations in the middle of the example become:

intervals = pd.IntervalIndex.from_array(ts_seg, closed='left')   # new method 'from_array' without 's' ?

It may seem like the requests are focusing on function signatures mostly. This is what illustrate the use cases.

But an important motivation / interest of the request is to know (request details regarding) if groupby can be made more efficient in case the user informs that groups are made of contiguous rows?
It means, once a group is dealt with / once the code realizes it has changed of group when reading the array, it does not have to expect finding separate rows further in the array that belongs to past groups.

Thanks in advance for your feedback,
Bests

@yohplala
Copy link
Author

Hi there,
I realize it is a duplicate of #37949, instead with more detail examples perhaps?
Anyway, closing, and 'thumbing up' #37949.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

2 participants