- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 19.2k
BUG: IntervalIndex.get_loc/get_indexer wrong return value / error #25090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Revert pandas-dev#24048 change that caused bug.
| Codecov Report
 @@           Coverage Diff           @@
##           master   #25090   +/-   ##
=======================================
  Coverage   92.37%   92.37%           
=======================================
  Files         166      166           
  Lines       52420    52420           
=======================================
  Hits        48423    48423           
  Misses       3997     3997
 
 Continue to review full report at Codecov. 
 | 
| Codecov Report
 @@            Coverage Diff             @@
##           master   #25090      +/-   ##
==========================================
+ Coverage   91.48%   91.89%   +0.41%     
==========================================
  Files         175      175              
  Lines       52885    52495     -390     
==========================================
- Hits        48380    48241     -139     
+ Misses       4505     4254     -251
 
 Continue to review full report at Codecov. 
 | 
Write tests first
When tested with a variable that has the wrong dtype, this raises an exception instead of False
When supplied a variable with the wrong type get_loc should raise a KeyError (not type error). Otherwise things like checking if a variable is in an index will fail.
| Hello @samuelsinayoko! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-04-10 07:20:10 UTC | 
This is enough for making the test pass but it's not the right implementation
        
          
                pandas/core/indexes/interval.py
              
                Outdated
          
        
      | try: | ||
| start, stop = self._find_non_overlapping_monotonic_bounds(key) | ||
| except TypeError: | ||
| # get loc should raise KeyError | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: .get_loc().
        
          
                pandas/core/indexes/interval.py
              
                Outdated
          
        
      | # get loc should raise KeyError | ||
| # if key is hashable but | ||
| # of an incorrect type | ||
| raise KeyError | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be raise from, which is a Python 3 construct. To stay version-agnostic use six.raise_from.
| raise KeyError | |
| import six | |
| ... | |
| try: | |
| start, stop = self._find_non_overlapping_monotonic_bounds(key) | |
| except TypeError as exc: | |
| six.raise_from(KeyError('Key is hashable, but of an incorrect type'), exc) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is six.raise_from really necessary?  I don't see us doing this anywhere else in the codebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use python 3 only constructs yet, the existing is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I've left the code as is but have improved the reported message as suggested by @rs2
        
          
                pandas/core/indexes/interval.py
              
                Outdated
          
        
      | ) | ||
| except TypeError: | ||
| # This is probably wrong | ||
| # but not sure what I should do here | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
¯\_(ツ)_/¯
        
          
                pandas/core/indexes/interval.py
              
                Outdated
          
        
      | except TypeError: | ||
| # This is probably wrong | ||
| # but not sure what I should do here | ||
| return np.array([-1]) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please comment on the choice of -1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be the same length as target and of intp dtype: np.repeat(np.intp(-1), len(target))
        
          
                pandas/core/indexes/interval.py
              
                Outdated
          
        
      | except TypeError: | ||
| # This is probably wrong | ||
| # but not sure what I should do here | ||
| return np.array([-1]) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be the same length as target and of intp dtype: np.repeat(np.intp(-1), len(target))
        
          
                pandas/core/indexes/interval.py
              
                Outdated
          
        
      | # get loc should raise KeyError | ||
| # if key is hashable but | ||
| # of an incorrect type | ||
| raise KeyError | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is six.raise_from really necessary?  I don't see us doing this anywhere else in the codebase?
| result = index.symmetric_difference(other, sort=sort) | ||
| tm.assert_index_equal(result, expected) | ||
|  | ||
| def test_interval_range_get_indexer_with_different_input_type(self): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rename to test_get_indexer_errors and move to around line 618 where the other get_indexer tests are?
|  | ||
| def test_interval_range_get_indexer_with_different_input_type(self): | ||
| # not sure about this one | ||
| index = pd.interval_range(0, 1) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parametrize over index and include an non-monotonic/overlapping IntervalIndex, e.g. pd.IntervalIndex.from_tuples([(1, 3), (2, 4), (0, 2)])
| index = pd.interval_range(0, 1) | ||
| # behaviour should be the same as Int64Index and return an | ||
| # array with values of -1 | ||
| assert np.all(index.get_indexer(['gg']) == np.array([-1])) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use tm.assert_numpy_array_equal and make sure your expected is 'intp' dtype.
        
          
                pandas/tests/indexing/test_loc.py
              
                Outdated
          
        
      | """ GH25087, test get_loc returns key error for interval indexes""" | ||
| idx = pd.interval_range(0, 1.0) | ||
| with pytest.raises(KeyError): | ||
| idx.get_loc('gg') | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of testing here, can you add this as a test case to test_get_loc_value in pandas/tests/indexes/interval/test_interval.py:
| def test_get_loc_value(self): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good spot, I must say I wasn't completely clear about the distinction between indexes and indexing with regards to tests.
I've implemented your suggestion in 93f75ea.
| key = self._maybe_cast_slice_bound(key, 'left', None) | ||
|  | ||
| start, stop = self._find_non_overlapping_monotonic_bounds(key) | ||
| try: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll also need to do something similar in the else branch to cover the overlapping/non-monotonic case, e.g. I think something like pd.IntervalIndex.from_tuples([(1, 3), (2, 4), (0, 2)]).get_loc('foo') will still fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it should be the engine that should properly raise a KeyError? (eg the int64 engine does that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to look into @jorisvandenbossche's comment on raising the error in the engine itself (especially if that's the behaviour for int64), but I think I've addressed everything else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche : There is code in place within the engine that raises a KeyError, but strings queries fail before it gets there since the engine is expecting a scalar_t type (fused type consisting of numeric types) for key:
pandas/pandas/_libs/intervaltree.pxi.in
Line 104 in 2e38d55
| def get_loc(self, scalar_t key): | 
I'm not super well versed in Cython.  Is there a graceful way to force this to raise a KeyError within the Cython code?  Removing the scalar_t type gets a step further but still raises a TypeError as the code expects things to be comparable (probably some perf implications to removing it too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the other engines have the key as object typed, and then afterwards do a check of that.
But for me fine as well to leave that for now, and do the check here in the level above that. But on the long term would still be good to make the behaviour consistent throughout the different engines.
| Thanks!  To provide some additional context here: the indexing methods for  That being said, this is still something that'd be viable for a 0.24.x change, since the new specs require breaking changes and would need to go in a major release (aiming for 0.25.0). | 
        
          
                pandas/core/indexes/interval.py
              
                Outdated
          
        
      | # get loc should raise KeyError | ||
| # if key is hashable but | ||
| # of an incorrect type | ||
| raise KeyError | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't use python 3 only constructs yet, the existing is ok.
Include some non-monotonic/overlapping IntervalIndex. This triggers another bug, due to the fact that self.get_loc(i) is called on an unexpected key.
| key = self._maybe_cast_slice_bound(key, 'left', None) | ||
|  | ||
| start, stop = self._find_non_overlapping_monotonic_bounds(key) | ||
| try: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it should be the engine that should properly raise a KeyError? (eg the int64 engine does that)
Move the test from indexing/test_loc to index/interval/test_interval.
target was missing from call to _find_non_overlapping_monotonic_bounds
| try: | ||
| start, stop = self._find_non_overlapping_monotonic_bounds(key) | ||
| except TypeError: | ||
| # get_loc should raise KeyError | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add here a comment as Tom proposed (TODO(py3): use raise from.) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can now use raise from (as we are PY3 only)
        
          
                pandas/core/indexes/interval.py
              
                Outdated
          
        
      | start, stop = self._find_non_overlapping_monotonic_bounds(key) | ||
| except TypeError: | ||
| # get_loc should raise KeyError | ||
| raise KeyError('key is hashable but of incorrect type') | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for key errors we typically just pass the key itself as message, so might be good to be consistent with that. Or at least I would include the key in the message, something like: "Key {0} is of the incorrect type".format(key)
        
          
                pandas/core/indexes/interval.py
              
                Outdated
          
        
      | try: | ||
| return self._engine.get_loc(key) | ||
| except TypeError: | ||
| raise KeyError('No engine for key {!r}'.format(key)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to mention "engine" here (that is something internal to pandas, while this error message will be visible for users)
        
          
                pandas/core/indexes/interval.py
              
                Outdated
          
        
      | self._find_non_overlapping_monotonic_bounds(target) | ||
| ) | ||
| except TypeError: | ||
| return np.repeat(np.int(-1), len(target)) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jschendel I am not fully sure this will cover all cases (but what was there before also not).
target can in principle a mixture of valid and invalid keys. So maybe the easiest would be to fall back to the else branch that iterates through the elements separately in case this raises a TypeError.
That could be done by putting a pass here, and putting the return value of three lines below in the else part of a try/except/else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry getting a bit confused here. You're saying that we should pass on line 833 here and left the code run to the else branch starting on line 851 (non IntervalIndex) where it loops over each element and appends -1 to the list if a KeyError is raised (I would probably add TypeError too).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is basically what I meant I think.
So that if you do idx.get_indexer(['a', 1]) (where 1 is a valid key), you get [-0, 1] as result instead of [-1, -1] (assuming that idx.get_loc(1) would return 1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I've just pushed a commit implementing this. I had to make a few tweaks to the "non IntervalIndex" else branch (starting line 849 in core/indexes/interval.py) to make my two new tests pass, but hopefully haven't broken anything. See 2c48272
| ]) | ||
| def test_get_indexer_errors(self, index): | ||
| expected = np.array([-1], dtype='intp') | ||
| assert tm.assert_numpy_array_equal(index.get_indexer(['gg']), expected) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you here also test multiple values and a mixture of values? Like get_indexer(['a', 'b'] and get_indexer([1, 'a'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure will look into this over the weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see 02127ff
| idx = IntervalIndex.from_arrays([0, 2], [1, 3]) | ||
| pytest.raises(KeyError, idx.get_loc, 1.5) | ||
|  | ||
| # GH25087, test get_loc returns key error for interval indexes | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put this in a new test? (you can leave it in this place, but just put a def test_get_loc_invalid_key(self) above this line)
Reason is that the other test is commented to be replaced, but this new test we want to keep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have got a few broken tests to fix. Hoping to make the build green in the coming days.
Instead of returning [-1, -1, -1] when the middle value is incorrect type, return [a, -1, b].
Add mix of invalid and valid values
Fixes test_with_overlaps test
interval.get_indexer() should still raise a TypeError in cases where the types are unorderable. This is needed for DataFrame.append for example, which was breaking tests in test_concat.
| Any tips on making the tests pass on macOS, windows and Linux py27? | 
| @samuelsinayoko do those tests fail for you locally? If not, I can take a look later. | 
| will take another look at this tomorrow… On Thu, 7 Mar 2019 at 18:00, Tom Augspurger ***@***.***> wrote:
 @samuelsinayoko <https://github.com/samuelsinayoko> do those tests fail
 for you locally? If not, I can take a look later.
 —
 You are receiving this because you were mentioned.
 Reply to this email directly, view it on GitHub
 <#25090 (comment)>,
 or mute the thread
 <https://github.com/notifications/unsubscribe-auth/AQuwvnphRf-gB8HLdg_PKKyjG_Yfqc15ks5vUVOkgaJpZM4afkZ_>
 .
 -- 
*Sam Sinayoko* | Data Scientist
*BMLL Technologies Ltd.*
10th Floor – Portland House
Bressenden Place
London
SW1E 5RS
+44(0)20 3828 9020 | 
| @TomAugspurger yes the tests pass locally on my linux machine. Not sure what's happening on Windows and OSX, so would be grateful for any help on this. | 
| @samuelsinayoko can you merge master? Recently dropped Py2 support so should make things easier | 
| Ok will do | 
| @samuelsinayoko can you merge mater and update | 
| closing as stale, if you'd like to continue, pls ping. | 
git diff upstream/master -u -- "*.py" | flake8 --diff