WIP: Rework IndexEngine to better support hashtable-free algorithms. #14436
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Initial work toward #14273. This makes
__contains__
behave uniformly across all index types by making it simply defer toget_loc
(we can and probably should go back and put in a fast path for the case that a hash table has already been populated). Assuming this general path seems reasonable, the next step would be to updateget_indexer
to do binary-search-based lookups on large indices.A few high-level design questions:
avoid_hashtable
keyword to IndexEngine to make it easier to test the new behavior interactively without having to work with giant indices. Do we want to support a user-configurable toggle like this for cases where people know that they really do or don't want a hash table? If so, where would they pass that option? Alternatively, should _SIZE_CUTOFF be configurable, perhaps viapandas.set_option
?Int64Index.__contains__
silently truncates floats to integers before doing lookups, whileInt64Index.get_loc
does not. On this branch, that behavior is only preserved for integral-valued floats (the behavior of coercing integral floats to ints was explicitly tested). This also happens forDatetimeIndex
. Is there any concern about "breaking" the behavior ofInt64Index.__contains__
for non-integral floats?