- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 19.2k
PERF: CategoricalIndex.get_loc should avoid expensive cast of .codes to int64 #21699
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Codecov Report
 @@            Coverage Diff             @@
##           master   #21699      +/-   ##
==========================================
+ Coverage   92.18%   92.19%   +<.01%     
==========================================
  Files         169      169              
  Lines       50820    50821       +1     
==========================================
+ Hits        46850    46852       +2     
+ Misses       3970     3969       -1
 
 Continue to review full report at Codecov. 
 | 
| so 2) is the prefered solution here. you can simply use the templating like we do in most modules which generates code. its not trivial but not super complicated either. | 
| @topper-123 what issues were you running into with Cython? As far as generating the indexes goes I would think you could just add the appropriate types below: 
 Not sure how you were planning to dynamically select the appropriate engine in  | 
| Ok, I think I understand the design now, at least. What is the templating language used and how is the pxi-file generated? So would the correct approach be to: 
 | 
| Ok, got algos.common_helper.pxi.in compiling and works fine. index_helper.pxi throws a compiler error. I added to the dtype list in index_helper.pxi.in: but get this error when compiling: cythoning pandas\_libs/index.pyx to pandas\_libs\index.c
Error compiling Cython file:
------------------------------------------------------------
...
    cdef _maybe_get_bool_indexer(self, object val):
        cdef:
            ndarray[uint8_t, ndim=1, cast=True] indexer
            ndarray[intp_t, ndim=1] found
            ndarray[int8_t] values
                         ^
------------------------------------------------------------
pandas\_libs\index_class_helper.pxi:179:26: Invalid type.It appears the compiler does not like the int8_t type. Using  Any ideas what's happening? | 
| you need to do the imports of  | 
| Thanks, got everything compiling with Int8Engine, which is good. However, when running, it complains there is no Int8HashTable. Seems like there isn't an easy way to create an Int8HashTable, is that correct? ~\Documents\Python\pandasdev\pandasdev\pandas\_libs\index_class_helper.pxi in pandas._libs.index.Int8Engine._make_hash_table()
    212
    213     cdef _make_hash_table(self, n):
--> 214         return _hash.Int8HashTable(n)
AttributeError: module 'pandas._libs.hashtable' has no attribute 'Int8HashTable'I tried looking into using a Int64HashTable in Int8Engine, but that seems to bring a different set of tradeoffs... Is there an easy way to create a Int8HashTable, or is it reasonable to use a Int64HashTable in Int8Engine? | 
| you might need to create the other hash tables  (definitions) as well | 
fbb711f    to
    db5c447      
    Compare
  
    | Ok, I've tried creating the int8 hash table, but got some issues. Everything compiles up till and including " add Int8Index", but that runs into the issue that no Int8HashTable existst. I've tried creating a Int8HashTable (see the commit "add Int8HashTable"), but I can't get to compile. It gives this compilation error: I've tried messing with khash.pxd to get that kh_*_int8 working, but can't get it running. What does khash.pxd do? It seems to in turn depend on khash_python.h, which only seems to have int64, int32 anf float64 version available. Must that be changed before getting a Int8HashTable working? Guidance would be very much appreciated. | 
| hmm, i guess khash doesn't support this either. ok, so maybe let's do a precursor PR to do this properly | 
| Ok, just to confirm, so you/someone else will do the PR to get Int8HashTable etc. working, and I'll wait for that and afterwards I will implement Int8Engine etc? | 
| @topper-123 well I suppose someone can do a PR to make other types of hashtables work. cc @pandas-dev/pandas-core | 
| That would be great. If someone makes the hash tables I will take this the rest of the way. | 
e28b353    to
    6e1adff      
    Compare
  
    | I've made a solution for the problem of not having Int8HashTable, Int16HashTable and Int32HashTable by using Int64HashTable in Int8Engine etc. All the tests pass and performance is the same as above. Is this an ok approach? I could then add some tests and ASVs as needed. A nice side effect BTW is that we now have and can use  >>> n = 1_000_000
>>> ci = pd.CategoricalIndex(list('a' * n + 'b' * n + 'c' * n))
# the below is called in ci.is_monotonic etc. in master
>>> %timeit pd._libs.algos.is_monotonic_int64(ci.codes.astype('int64'), False)
21.6 ms  # master
# the below is called in ci.is_monotonic etc. in this PR
>>> %timeit pd._libs.algos.is_monotonic_int8(ci.codes, False)
5.6 ms  # this PRSo all code that calls CategoricalIndex.is_monotonic/.is_unique will benefit also. | 
| """ | ||
| codes = self.categories.get_loc(key) | ||
| if (codes == -1): | ||
| raise KeyError(key) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codes can never be -1. If key is not in self.categories, a KeyError was always raised.
| def _engine(self): | ||
|  | ||
| # we are going to look things up with the codes themselves | ||
| return self._engine_type(lambda: self.codes.astype('i8'), len(self)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need any changes in cython if you simply so: .astype('i8', copy=False), which is the cause of the perf issue, no?
its still not as nice as actually using type specific hashtables though (which this PR is not addressing)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried changing that. Still is slow, 14 ms.
In index_class_helper.pxi.in, I also tried changing
    cdef _get_index_values(self):
        return algos.ensure_{{dtype}}(self.vgetter())to
    cdef _get_index_values(self):
        return self.vgetter()But also slow, 14 ms.
I agree that type specific hash tables would be nicer, but I've tried and I failed making it work. If someone could contribute those, I could change this PR to use type specific hash tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then you have something else going on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numpy docs say about the copy parameter of array.astype:
By default, astype always returns a newly allocated array. If this is set
to false, and the dtype, order, and subok requirements are satisfied,
the input array is returned instead of a copy.
So, calling astype(..., copy=False) will only avoid returning a copy when the dtype of codes is int64, i.e. in practice never for Categoricals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so try using algos.ensure_int64(values) I find this extremely odd that you have to do the if inside cython.
        
          
                pandas/core/indexes/category.py
              
                Outdated
          
        
      | raise KeyError(key) | ||
| return self._engine.get_loc(codes) | ||
| code = self.categories.get_loc(key) | ||
| # dtype must be same as dtype for codes for searchsorted not lose speed | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line between
6e1adff    to
    912d81b      
    Compare
  
    | Hey,  I don't see any good way forward other than someone commiting to make an Int8/16/32HashTable at some point. The question is thought if this PR needs to wait for that, or can be pulled in now and the relevant lines be changed when the hash table is implemented. I argue pulling this in now, as index operations are critical for performance and that would allow me to pursue a few more index-related performance issues. | 
| def _engine(self): | ||
|  | ||
| # we are going to look things up with the codes themselves | ||
| return self._engine_type(lambda: self.codes.astype('i8'), len(self)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so try using algos.ensure_int64(values) I find this extremely odd that you have to do the if inside cython.
912d81b    to
    c4f74cb      
    Compare
  
    | 
 A technically possible solution to avoid that if statement while keeping the speed improvements would be to input a int64 array into _engine:     def _engine(self):
        codes = self.codes.astype('int64')
        return self._engine_type(lambda: codes, len(self))But that would keep an int64 array in memory, in addition to an int8 array, so doesn't make any sense memorywise. We want to just use the existing int8 array. I made another solution, where I've moved this specific implementation to index_class_helper.pxi.in. This makes its location more specific to the problem it solves. | 
        
          
                doc/source/whatsnew/v0.24.0.txt
              
                Outdated
          
        
      | - Improved performance of membership checks in :class:`Categorical` and :class:`CategoricalIndex` | ||
| (i.e. ``x in cat``-style checks are much faster). :meth:`CategoricalIndex.contains` | ||
| is likewise much faster (:issue:`21369`, :issue:`21508`) | ||
| - Improved performance of :func:`Series.describe` in case of numeric dtypes (:issue:`21274`) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generally pls try to avoid changing orthogonal things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how that may be confusing. I'll limit doing this from now on, but this was just a spelling error, that I corrected, as it was adjacent to my text..
        
          
                pandas/tests/indexes/conftest.py
              
                Outdated
          
        
      | 'UInt64', 'UInt32', 'UInt16', 'UInt8', | ||
| 'Float64', 'Float32', | ||
| ]) | ||
| def num_engine(request): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this more descriptive, numeric_indexing_engine
| class TestCategoricalIndexEngine(object): | ||
|  | ||
| @pytest.mark.parametrize('nbits', [8, 16, 32, 64]) | ||
| def test_engine_type(self, nbits): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woa, this is very complicated. why is this necessary? can you make much more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I've made it simpler.
        
          
                pandas/tests/indexes/test_engine.py
              
                Outdated
          
        
      |  | ||
| class TestNumericEngine(object): | ||
|  | ||
| @pytest.mark.parametrize('data', [[0, 1, 2]]) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this parameterized over data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. You start out with a certain structure and forget to check if is sensible for the end product...
        
          
                pandas/tests/indexes/test_engine.py
              
                Outdated
          
        
      |  | ||
| @pytest.mark.parametrize('data', [[0, 1, 2]]) | ||
| def test_is_monotonic_ordered(self, data, num_engine): | ||
| codes = np.array(data, dtype=num_engine._dtype) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you have multiple tests for monotonic? and not monotonic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there just one data to check for. Fixed.
adac8f4    to
    9836666      
    Compare
  
    | Changes according to comments. | 
| Is this ok to merge in? | 
| ping @jreback | 
| will have a look | 
| not sure of the state of thIS PR, needs a rebase | 
| @jreback : I believe this PR is ready for review see above. However, time as a resource is limited, and this one got overlooked unfortunately. @topper-123 : Sorry for the delay, but can you rebase this? | 
9836666    to
    1a438a3      
    Compare
  
    | Hello @topper-123! Thanks for updating the PR. 
 | 
| Rebased and green. EDIT: ok, the whatsnew disappeared in the rebase. I've added it again, while the code itself unchanged. | 
1a438a3    to
    6aa94e9      
    Compare
  
    | Hello @topper-123! Thanks for updating the PR. 
 | 
| @topper-123 : Thanks! Can you double check if you addressed all comments? IINM, I see a few that might need attention... Feel free to resolve those that you think have completed discussion. | 
| I am not convinced this is actually a good solution here. I think there was a much simpler version that worked. This is adding quite a bit of code and complexity. I would like to see the performance tests split off into a new PR (and the codes can never be -1) bug fix as a first step. | 
| Closed in favor of #23235. | 
git diff upstream/master -u -- "*.py" | flake8 --diffThis is the the final puzzle for #20395.
The problem with the current implementation is that
CategoricalIndex._engineconstantly recodes int8 arrays to int64 arrays:pandas/pandas/core/indexes/category.py
Lines 365 to 369 in dc45fba
Notice the
lambda: self.codes.astype('i8')part, which means that every time_engine.vgetteris called, an int64-array is created. This is very expensive and we would ideally want to just use the original array dtype (int8, int16 or whatever) always and avoid this conversion.A complicating issue is that it is not enough to just avoid the int64 transformation, as
array.searchsortedapparantly needs a dtype-compatible input or else it is also very slow:Solution options
As CategoricalIndex.codes may be int8, int16, etc, the solution must be to (1) have an indexing engine for each integer dtype or an indexing engine that accepts all int types, not just int64 and (2) that the key must be translated into the same dtype as the codes array before calling searchsorted. So either:
algos.is_monotonic_int64for checking monotonicity)I assume option 1 is not desired, and option 3 assumedly likewise. In the updated PR I've made a proposal in Cython, that attains the needed speed.
Benchmarks from
asv_bench/indexing.py