Description
Currently, the only way to query a time series object in the form of start_time < t < end_time
, is by using a boolean mask array, as using loc[slice]
includes both the start and end of the range.
Querying using a boolean array is very slow compared to a slice on large dataframes, and pretty much makes no sense, as it always returns an array the size of the dataframe when my query only looks at a fraction of the data (which is the reason for indexing).
How does df.query
perform range queries?
Is there another method for closed/open ranges?
I'm looking for something similar to df.between_time that allows to optionally include the start/end
[from @TomAugspurger]
I think we're reluctant to add more complexity / options to indexing with .loc
and friends, but this would be a good doc example of how to, achieve it using lower-level methods:
In [24]: import pandas.util.testing as tm
In [25]: ts = tm.makeTimeSeries()
In [26]: ts
Out[26]:
2000-01-03 0.804101
2000-01-04 0.042160
2000-01-05 -0.580078
2000-01-06 0.757864
2000-01-07 -0.349766
2000-01-10 -0.058222
2000-01-11 -0.274172
2000-01-12 -1.539538
2000-01-13 0.505398
2000-01-14 0.665445
2000-01-17 0.998438
...
Freq: B, dtype: float64
Say you want to slice [2000-01-04, 2000-01-10)
(so excluding the right endpoint)
In [27]: lo = ts.index.get_slice_bound("2000-01-04", "left", "loc")
In [28]: hi = ts.index.get_slice_bound("2000-01-10", "left", "loc")
In [29]: ts.iloc[lo:hi]
Out[29]:
2000-01-04 0.042160
2000-01-05 -0.580078
2000-01-06 0.757864
2000-01-07 -0.349766
Freq: B, dtype: float64