WIP: Rework IndexEngine to better support hashtable-free algorithms. #14436

ssanderson · 2016-10-17T01:33:58Z

Initial work toward #14273. This makes __contains__ behave uniformly across all index types by making it simply defer to get_loc (we can and probably should go back and put in a fast path for the case that a hash table has already been populated). Assuming this general path seems reasonable, the next step would be to update get_indexer to do binary-search-based lookups on large indices.

A few high-level design questions:

I haven't benchmarked yet to see how big the difference is, but theoretically these changes are trading slower lookups for a smaller memory footprint. Depending on users' usage patterns and resource constraints, the tradeoffs made here may or may not be an improvement for them. I added an avoid_hashtable keyword to IndexEngine to make it easier to test the new behavior interactively without having to work with giant indices. Do we want to support a user-configurable toggle like this for cases where people know that they really do or don't want a hash table? If so, where would they pass that option? Alternatively, should _SIZE_CUTOFF be configurable, perhaps via pandas.set_option?
One odd behavior I ran into while working on this is that Int64Index.__contains__ silently truncates floats to integers before doing lookups, while Int64Index.get_loc does not. On this branch, that behavior is only preserved for integral-valued floats (the behavior of coercing integral floats to ints was explicitly tested). This also happens for DatetimeIndex. Is there any concern about "breaking" the behavior of Int64Index.__contains__ for non-integral floats?

In [11]: pd.__version__
Out[11]: u'0.19.0'

In [12]: i
Out[12]: Int64Index([1, 2, 3, 5], dtype='int64')

In [13]: 1.5 in i
Out[13]: True

In [14]: i.get_loc(1.5)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-14-c37f93fbd72a> in <module>()
----> 1 i.get_loc(1.5)

/home/ssanderson/.virtualenvs/pandas-19/local/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
   2104                 return self._engine.get_loc(key)
   2105             except KeyError:
-> 2106                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2107
   2108         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3983)()

pandas/index.pyx in pandas.index.Int64Engine._check_type (pandas/index.c:7773)()

KeyError: 1.5

- [ ] closes #14273 - [ ] tests added / passed - [ ] passes `git diff upstream/master | flake8 --diff` - [ ] whatsnew entry

chris-b1 · 2016-10-17T14:42:38Z

Some fairly simple/naive benchmarking suggests performance actually improves if the index is decent sized. So assuming this example is representative, you probably could always do binary search above a cutoff, not needing to expose that option.

In [1]: idx = pd.Index(np.arange(0, 100000000, step=2))

In [2]: idx.get_loc(250000)
Out[2]: 125000

In [3]: %timeit idx.get_loc(250000)
The slowest run took 4.52 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.64 µs per loop

In [4]: %timeit np.searchsorted(idx.values, 250000)
The slowest run took 34.45 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.38 µs per loop

In [5]: %timeit idx.get_indexer(np.arange(0, 100000000, step=10))
1 loop, best of 3: 4.43 s per loop

In [6]: %timeit np.searchsorted(idx.values, np.arange(0, 100000000, step=10))
1 loop, best of 3: 416 ms per loop

jreback · 2016-10-19T12:29:25Z

I think we need a set of tests to validate whether the hash tables are instantiated. e.g. something simple like this can work.

    def get_state(self):
        """ return my internal state """
        return {attr: getattr(self, attr) for attr in ['initialized', 'monotonic_check',
                                                       'unique', 'unique_check']}

Then you can have a set of tests which assert the internal state of the IndexEngine (as you do operations).

jreback · 2016-12-06T21:29:34Z

@ssanderson haven't looked at this in a while.

can you rebase / update?

ssanderson · 2016-12-06T22:01:05Z

@jreback I'm still interested in this, just haven't been able to dedicate a solid block of time to it. I'll try to at least rebase against master this week.

jreback · 2016-12-30T21:25:58Z

note that the index.pyx was mostly moved to templates (so this is in fact more generic now)

jreback · 2017-02-27T16:00:28Z

closing for now @ssanderson but feel free to update if you can

Scott Sanderson added 3 commits October 16, 2016 16:42

DOC: Add docstrings for get_value and set_value.

61847fb

MAINT: Remove unused local variables.

ac8459c

WIP

ede83a6

jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Oct 19, 2016

jreback closed this Feb 27, 2017

jreback added the Closed PR label Feb 27, 2017

jreback added this to the No action milestone Feb 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

WIP: Rework IndexEngine to better support hashtable-free algorithms. #14436

WIP: Rework IndexEngine to better support hashtable-free algorithms. #14436

Uh oh!

ssanderson commented Oct 17, 2016 •

edited

Loading

Uh oh!

chris-b1 commented Oct 17, 2016

Uh oh!

jreback commented Oct 19, 2016

Uh oh!

jreback commented Dec 6, 2016

Uh oh!

ssanderson commented Dec 6, 2016

Uh oh!

jreback commented Dec 30, 2016

Uh oh!

jreback commented Feb 27, 2017

Uh oh!

Uh oh!

Uh oh!

WIP: Rework IndexEngine to better support hashtable-free algorithms. #14436

WIP: Rework IndexEngine to better support hashtable-free algorithms. #14436

Uh oh!

Conversation

ssanderson commented Oct 17, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chris-b1 commented Oct 17, 2016

Uh oh!

jreback commented Oct 19, 2016

Uh oh!

jreback commented Dec 6, 2016

Uh oh!

ssanderson commented Dec 6, 2016

Uh oh!

jreback commented Dec 30, 2016

Uh oh!

jreback commented Feb 27, 2017

Uh oh!

Uh oh!

ssanderson commented Oct 17, 2016 •

edited

Loading