Skip to content

WIP: Rework IndexEngine to better support hashtable-free algorithms. #14436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

ssanderson
Copy link
Contributor

@ssanderson ssanderson commented Oct 17, 2016

Initial work toward #14273. This makes __contains__ behave uniformly across all index types by making it simply defer to get_loc (we can and probably should go back and put in a fast path for the case that a hash table has already been populated). Assuming this general path seems reasonable, the next step would be to update get_indexer to do binary-search-based lookups on large indices.

A few high-level design questions:

  • I haven't benchmarked yet to see how big the difference is, but theoretically these changes are trading slower lookups for a smaller memory footprint. Depending on users' usage patterns and resource constraints, the tradeoffs made here may or may not be an improvement for them. I added an avoid_hashtable keyword to IndexEngine to make it easier to test the new behavior interactively without having to work with giant indices. Do we want to support a user-configurable toggle like this for cases where people know that they really do or don't want a hash table? If so, where would they pass that option? Alternatively, should _SIZE_CUTOFF be configurable, perhaps via pandas.set_option?
  • One odd behavior I ran into while working on this is that Int64Index.__contains__ silently truncates floats to integers before doing lookups, while Int64Index.get_loc does not. On this branch, that behavior is only preserved for integral-valued floats (the behavior of coercing integral floats to ints was explicitly tested). This also happens for DatetimeIndex. Is there any concern about "breaking" the behavior of Int64Index.__contains__ for non-integral floats?
In [11]: pd.__version__
Out[11]: u'0.19.0'

In [12]: i
Out[12]: Int64Index([1, 2, 3, 5], dtype='int64')

In [13]: 1.5 in i
Out[13]: True

In [14]: i.get_loc(1.5)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-14-c37f93fbd72a> in <module>()
----> 1 i.get_loc(1.5)

/home/ssanderson/.virtualenvs/pandas-19/local/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_loc(self, key, method, tolerance)
   2104                 return self._engine.get_loc(key)
   2105             except KeyError:
-> 2106                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2107
   2108         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4160)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3983)()

pandas/index.pyx in pandas.index.Int64Engine._check_type (pandas/index.c:7773)()

KeyError: 1.5
- [ ] closes #14273 - [ ] tests added / passed - [ ] passes `git diff upstream/master | flake8 --diff` - [ ] whatsnew entry

@chris-b1
Copy link
Contributor

Some fairly simple/naive benchmarking suggests performance actually improves if the index is decent sized. So assuming this example is representative, you probably could always do binary search above a cutoff, not needing to expose that option.

In [1]: idx = pd.Index(np.arange(0, 100000000, step=2))

In [2]: idx.get_loc(250000)
Out[2]: 125000

In [3]: %timeit idx.get_loc(250000)
The slowest run took 4.52 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.64 µs per loop

In [4]: %timeit np.searchsorted(idx.values, 250000)
The slowest run took 34.45 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 1.38 µs per loop

In [5]: %timeit idx.get_indexer(np.arange(0, 100000000, step=10))
1 loop, best of 3: 4.43 s per loop

In [6]: %timeit np.searchsorted(idx.values, np.arange(0, 100000000, step=10))
1 loop, best of 3: 416 ms per loop

@jreback jreback added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Oct 19, 2016
@jreback
Copy link
Contributor

jreback commented Oct 19, 2016

I think we need a set of tests to validate whether the hash tables are instantiated. e.g. something simple like this can work.

    def get_state(self):
        """ return my internal state """
        return {attr: getattr(self, attr) for attr in ['initialized', 'monotonic_check',
                                                       'unique', 'unique_check']}

Then you can have a set of tests which assert the internal state of the IndexEngine (as you do operations).

@jreback
Copy link
Contributor

jreback commented Dec 6, 2016

@ssanderson haven't looked at this in a while.

can you rebase / update?

@ssanderson
Copy link
Contributor Author

@jreback I'm still interested in this, just haven't been able to dedicate a solid block of time to it. I'll try to at least rebase against master this week.

@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

note that the index.pyx was mostly moved to templates (so this is in fact more generic now)

@jreback
Copy link
Contributor

jreback commented Feb 27, 2017

closing for now @ssanderson but feel free to update if you can

@jreback jreback closed this Feb 27, 2017
@jreback jreback added this to the No action milestone Feb 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider extending hashtable-free indexing algorithms for large sorted indexes
3 participants