Skip to content

API: allow negative steps for label-based indexing #8753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 19, 2014

Conversation

immerrr
Copy link
Contributor

@immerrr immerrr commented Nov 8, 2014

This should fix #8716.

Some of this refactoring may be useful for #8613, so I'd like someone to look through this.

cc'ing @shoyer, @jreback and @jorisvandenbossche .

TODO:

  • add tests for label slicing with negative step (ix, loc)
    • int64index
    • float64index
    • objectindex
  • fix datetime/period indices
  • add tests for label slicing with negative step (datetime, string, native pandas scalar) x (ix, loc)
  • datetimeindex
  • periodindex
  • timedeltaindex
  • clean up old partial string code?
  • benchmark

raise

if isinstance(slc, np.ndarray):
# get_loc may return a boolean array or an array of indices, which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice addition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it needs tests?

@shoyer
Copy link
Member

shoyer commented Nov 8, 2014

generally seems like a nice refactor to me 👍

@immerrr
Copy link
Contributor Author

immerrr commented Nov 8, 2014

This PR will consolidate Index.slice_locs with that of Datetime-/Period-/TimedeltaIndex, which means that the non-present lookups on is_monotonically_decreasing will start working on those classes.

@immerrr immerrr force-pushed the refactor-slice-locs branch 2 times, most recently from 8b63e9d to 6085288 Compare November 8, 2014 17:50
@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Nov 9, 2014
@immerrr immerrr force-pushed the refactor-slice-locs branch 7 times, most recently from cbe858e to 30e7dc1 Compare November 10, 2014 07:54
@immerrr
Copy link
Contributor Author

immerrr commented Nov 10, 2014

Mostly done here, except for

  • maybe old partial string slice code can be cleaned up
  • per-level slices on MultiIndex don't work and I don't yet see a clear way to support steps for it
  • have to check performance


.. note::

Value of `side` parameter should be validated in caller.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small comment (it does not really matter, as this will not get formatted in the online docs), but you can use a Notes section like:

Notes
-----
Value of ....

@shoyer
Copy link
Member

shoyer commented Nov 10, 2014

Again, this looks very nice. Thanks!

@jorisvandenbossche
Copy link
Member

@immerrr maybe you could have a look at https://github.com/pydata/pandas/blob/master/doc/source/internals.rst to see if this is still up to date after your PR (although I am not fully sure this is the best way to document this complex indexing part of the codebase)

slc = self.get_loc(label)
except KeyError:
if self.is_monotonic_increasing:
return self.searchsorted(label, side=side)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not what we are discussing in #8613? (so it seems this is explicitly and intentionally implemented?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blocking out of bounds indexing is actually done in pandas.core.indexing._LocIndexer. So the functionality here is unchanged (though I agree the checks do make more sense here than in the indexing module)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see, and the IXIndexer does not implement those checks, so ix can hit this codepath.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the checks here should be trivial with adding the third side=='exact' or side='strict' that disables this branch of execution (and probably renaming the function and parameter to make more sense with that third value).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I was really hoping that we'll come to dropping those checks altogether.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@immerrr agreed, but I do think we'll want at least a type check (not in this PR), e.g., to ensure pd.Index([1, 2, 3]).slice_indexer('a') raises on Python 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is easy: you just need to add something that throws on invalid type into _maybe_cast_slice_bound. The inverse is tricky (e.g. pd.Index(list('abc')).slice_indexer(1)), but should be doable once StringIndex lands (either by adding another index type or by adding a string-object numpy dtype).

@jreback jreback added this to the 0.15.2 milestone Nov 11, 2014
@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

going to have a look tomorrow.

@immerrr
Copy link
Contributor Author

immerrr commented Nov 11, 2014

I've been doing some benchmarking and found the strangest thing: I couldn't get a stable result from panel_methods.panel_pct_change_major benchmark, it reported anything from 2000ms to 8000ms. It appears that the task is memory intensive (easily several gbs of ram) and leaves a lot of objects in its wake because timeit disables garbage collection during the benchmarks. The problem is that it doesn't do any cleanup afterwards and this slows down subsequent runs because malloc has harder time finding chunks of memory when requested.

Given usually there's no such thing as "run code before each iteration, but don't benchmark it" in benchmark libraries, I wonder what's the proper way to fix this... Maybe we should just reduce panel sizes in panel benchmarks.

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

@immerrr yep that vbench is odd, never really had time to look, feel free to submit a pr to reduce it in size.

return (Timestamp(st, tz=self.tz),
Timestamp(Timestamp(st + offsets.Second(),
tz=self.tz).value - 1))
elif reso == 'microsecond':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for this (in the appropriate section where the partial string resos are tested)

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

@immerrr looks good. minor case that needs testing. Which might be an open issue in any event (e.g. supporting < second, e.g. microsecond reso in partial_strings for DatetimeIndex).

@immerrr
Copy link
Contributor Author

immerrr commented Nov 18, 2014

I devoted some time to this at the weekend:

  • enabled second/microsecond-resolution partial based indexing.
  • partial string lookups are going to stay for now, they're used in get_loc in seemingly a bit inconsistent way and fixing it may alter the API, so the cleanup will be a separate PR.

@immerrr immerrr force-pushed the refactor-slice-locs branch 2 times, most recently from 5fdd168 to 4034a79 Compare November 18, 2014 04:18
@@ -69,7 +69,7 @@ Bug Fixes
- ``io.data.Options`` now raises ``RemoteDataError`` when no expiry dates are available from Yahoo (:issue:`8761`).
- ``Timedelta`` kwargs may now be numpy ints and floats (:issue:`8757`).
- ``sql_schema`` now generates dialect appropriate ``CREATE TABLE`` statements (:issue:`8697`)

- Fix negative step support for label-based slices (:issue:`8753`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you give a previous (use code-block) and current behavior mini example here. otherwise looks good.

@immerrr immerrr force-pushed the refactor-slice-locs branch from 6a81185 to 2f4b61f Compare November 19, 2014 07:49
INT: make Index.slice_locs step-aware

BUG: fix PeriodIndex.searchsorted to accept Periods

INT: refactor time-related indices to use step-aware slice_locs

INT: refactor MultiIndex to use step-aware slice_locs

INT: enable second/microsecond partial string slicing
@immerrr immerrr force-pushed the refactor-slice-locs branch from 2f4b61f to b735ffc Compare November 19, 2014 11:08
@immerrr
Copy link
Contributor Author

immerrr commented Nov 19, 2014

ready & green

jreback added a commit that referenced this pull request Nov 19, 2014
API: allow negative steps for label-based indexing
@jreback jreback merged commit 3274f27 into pandas-dev:master Nov 19, 2014
@jreback
Copy link
Contributor

jreback commented Nov 19, 2014

thanks!

keep-em coming!

@shoyer
Copy link
Member

shoyer commented Nov 19, 2014

Thank you @immerrr !

@immerrr immerrr deleted the refactor-slice-locs branch November 19, 2014 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: label-based slices don't support negative step
4 participants