Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.Series.loc.__getitem__ promotes to float64 instead of raising KeyError #25927

Closed
allComputableThings opened this issue Mar 29, 2019 · 16 comments · Fixed by #37687
Closed

pd.Series.loc.__getitem__ promotes to float64 instead of raising KeyError #25927

allComputableThings opened this issue Mar 29, 2019 · 16 comments · Fixed by #37687
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@allComputableThings
Copy link

allComputableThings commented Mar 29, 2019

Code Sample, a copy-pastable example if possible

For:


result = a2b.loc[vals]       # pd.Series()[ np.array ]

If a2b is a series that maps {int64:int64} and vals is an int64 array, the result should be a series that maps {int64:int64}, or a KeyError should be thrown

Pasteable repo:

       import pandas as pd
       import numpy as np
       a2b = pd.Series(
           index = np.array([ 9724501000001103,  9724701000001109,  9725101000001107,
                     9725301000001109,  9725601000001103,  9725801000001104,
                     9730701000001104, 10049011000001109, 10328511000001105]),
           data = np.array([999000011000001104, 999000011000001104, 999000011000001104,
                        999000011000001104, 999000011000001104, 999000011000001104,
                        999000011000001104, 999000011000001104, 999000011000001104])
       )
       assert a2b.dtype==np.int64
       assert a2b.index.dtype==np.int64
       key = np.array([ 9724501000001103,  9724701000001109,  9725101000001107,
                             9725301000001109,  9725601000001103,  9725801000001104,
                             9730701000001104,
                             10047311000001102, # Misin in a2b.index
                             10049011000001109,
                             10328511000001105])
       result = a2b.loc[key]
       result
       assert result.dtype==np.int64
       assert result.index.dtype==np.int64

What happens:

In [2]:         import pandas as pd
   ...:         import numpy as np
   ...:         a2b = pd.Series(
   ...:             index = np.array([ 9724501000001103,  9724701000001109,  9725101000001107,
   ...:                       9725301000001109,  9725601000001103,  9725801000001104,
   ...:                       9730701000001104, 10049011000001109, 10328511000001105]),
   ...:             data = np.array([999000011000001104, 999000011000001104, 999000011000001104,
   ...:                          999000011000001104, 999000011000001104, 999000011000001104,
   ...:                          999000011000001104, 999000011000001104, 999000011000001104])
   ...:         )
   ...:         assert a2b.dtype==np.int64
   ...:         assert a2b.index.dtype==np.int64
   ...:         key = np.array([ 9724501000001103,  9724701000001109,  9725101000001107,
   ...:                               9725301000001109,  9725601000001103,  9725801000001104,
   ...:                               9730701000001104,
   ...:                               10047311000001102, # Misin in a2b.index
   ...:                               10049011000001109,
   ...:                               10328511000001105])
   ...:         result = a2b.loc[key]
   ...:         result
   ...: 
Out[2]: 
 9.990000e+17   NaN
 9.990000e+17   NaN
 9.990000e+17   NaN
 9.990000e+17   NaN
 9.990000e+17   NaN
 9.990000e+17   NaN
 9.990000e+17   NaN
NaN             NaN
 9.990000e+17   NaN
 9.990000e+17   NaN
dtype: float64

In [3]:         assert result.dtype==np.int64
   ...:         assert result.index.dtype==np.int64
   ...: 
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-3-be86ec17a393> in <module>()
----> 1 assert result.dtype==np.int64
      2 assert result.index.dtype==np.int64

AssertionError: 

Problem description

I don't like this behavior because:

  • I have quietly lost all my data due to cast to float64
  • in other calls to getitem a KeyError is raised if a value is not found in the index.

Expected Output

Asserts should not fail.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
In [4]: pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.15.candidate.1
python-bits: 64
OS: Linux
OS-release: 4.15.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 18.1
setuptools: 40.6.2
Cython: 0.29.1
numpy: 1.16.1
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 5.0.0
sphinx: None
patsy: 0.5.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.6.0
html5lib: 0.9999999
sqlalchemy: 1.2.17
pymysql: None
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@allComputableThings
Copy link
Author

allComputableThings commented Mar 29, 2019

Same in: 0.23.4, 0.24.2

@jreback
Copy link
Contributor

jreback commented Mar 30, 2019

use .iloc as that is what is designed for selecting by position as the docs indicate: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-position

getitem it falling back here as described http://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#miscellaneous-indexing-faq

this is as expected behavior and is not likely to change

@jreback jreback closed this as completed Mar 30, 2019
@allComputableThings
Copy link
Author

allComputableThings commented Mar 30, 2019

My series has integer labels, not billions of rows.
.loc would be correct (not .iloc).

It fails for .loc and .getitem_ both.

This is bug. Please reopen.

@allComputableThings allComputableThings changed the title pd.Series.__getitem__ promotes to float64 instead of raising KeyError pd.Series.loc.__getitem__ promotes to float64 instead of raising KeyError Mar 30, 2019
@allComputableThings
Copy link
Author

Have update the title and sample to make it clear.

.loc is the correct indexer (integer label, and integer index).

Please reopen.

@jreback
Copy link
Contributor

jreback commented Mar 30, 2019

this is on master

In [7]: pd.__version__                                                                                                                                                                                                                                                  
Out[7]: '0.25.0.dev0+337.g1d4c89f4d'

In [6]: a2b.loc[key]                                                                                                                                                                                                                                                    
/Users/jreback/miniconda3/envs/pandas/bin/ipython:1: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  #!/Users/jreback/miniconda3/envs/pandas/bin/python
Out[6]: 
9724501000001103     9.990000e+17
9724701000001109     9.990000e+17
9725101000001107     9.990000e+17
9725301000001109     9.990000e+17
9725601000001103     9.990000e+17
9725801000001104     9.990000e+17
9730701000001104     9.990000e+17
10047311000001102             NaN
10049011000001109    9.990000e+17
10328511000001105    9.990000e+17
dtype: float64
In [14]: a2b.index.isin(key)                                                                                                                                                                                                                                            
Out[14]: array([ True,  True,  True,  True,  True,  True,  True,  True,  True])

prob a bug somewhere, would need investigation by the community

@jreback jreback reopened this Mar 30, 2019
@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves labels Mar 30, 2019
@jreback jreback added this to the Contributions Welcome milestone Mar 30, 2019
@cbertinato
Copy link
Contributor

cbertinato commented Apr 3, 2019

This doesn't really look like a bug. Once a KeyError is thrown, you'll never get this far anyway. But suppose you do, and the expected behavior is a NaN for the extra index. The Series is dtype int64 and NaN is compatible with that dtype so values get cast to float64. If you want to avoid the cast, then use Int64 instead.

@jreback
Copy link
Contributor

jreback commented Apr 3, 2019

no this is an issue i think; see how the value is i. the index; but there is a disconnect somewhere after the get_indexer call (way before the erroneous KeyError)

@cbertinato
Copy link
Contributor

Was there an erroneous KeyError? The key 10047311000001102 isn't in a2b, so the warning given for a2b.loc[key] seems appropriate.

@allComputableThings
Copy link
Author

allComputableThings commented Apr 3, 2019 via email

@cbertinato
Copy link
Contributor

I see the KeyError with __getitem__, but the message given when trying a2b.loc[key] indicates that, while no error is thrown now, it will be in the future. It seems to me that, while the current behavior is not ideal, it is expected.

But I'm still trying to sleuth out whether there's another issue here. @jreback do you mean that because one of the labels in key is not in a2b, then that label shouldn't show up in the index of a2b.loc[key]?

@allComputableThings
Copy link
Author

@cbertinato My apologies - I missed that a warning was being printed, and I had to do some digging to find the case where KeyError was thrown:

# Scalar key
for k in key:
    print a2b.loc[k]   # KeyError: 10047311000001102

# List like key
for k in key:
    print a2b.loc[[k]]  # KeyError: u'None of [[10047311000001102]] are in the [index]'

k = 10047311000001102  # Not in a2b.index
print a2b.loc[np.array([k])]  # KeyError: u'None of [[10047311000001102]] are in the [index]'
print a2b.loc[np.array([k,k])]  # KeyError: u'None of [[10047311000001102]] are in the [index]'

print a2b.loc[np.array([k,k, 9724501000001103])] # result.values is promoted to float
print a2b.loc[np.array([9724501000001103, k,k, ])] # result.values is promoted to float
print a2b.loc[key]  # result.values is promoted to float

So it seems the behavior is: if ALL of the query labels are not in series's index, only then is a KeyError thrown. Otherwise, its promotion to float. Is expected? Not sure. But it sure is strange behavior:

  1. It leaves the caller unsure whether to expect: a KeyError, or result.values to be a float column.

  2. Automatic promotion to float is a special kind of footgun when the promoted ints were category labels. They are rarely compatible with floats and (particularly for >int32 ints) and the conversion should strictly be treated as data corruption. In my example they are medical codes -- suddenly I found fewer people with heart disease than expected -- the disease category codes in the int32 range worked fine.

Automatic value type promotion in select-like operations is also inconsistent with a Series being a 'datacolumn' that is 'kinda' like an SQL column + index. SQL doesn't do it (precisely because its not safe and leads to data-corruption).
The result could return the non-NULL rows (as SQL), though this may silently break things for people.
KeyError seems preferable since there's no good way to know what to place in result's value's for missing labels (.reindex allows a default parameter, which it why its preferred).

Is there any way to force a KeyError on missing labels for .loc? (since warnings easily slip through CI testing).

@cbertinato
Copy link
Contributor

cbertinato commented Apr 3, 2019

Is there any way to force a KeyError on missing labels for .loc? (since warnings easily slip through CI testing).

For the cases where the result.values is promoted to float, then I think the answer is "yes", eventually a KeyError will be thrown. It's just not implemented, but as the warning states, it will be in the future.

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-with-list-with-missing-labels-is-deprecated

Not sure when exactly this will turn from a warning into throwing a KeyError.

@allComputableThings
Copy link
Author

Related? (seems similar but different #22252 )

@cbertinato
Copy link
Contributor

It is similar. In both cases, the promotion to float is expected, but the message in 1 of that issue does seem misplaced.

@phofl
Copy link
Member

phofl commented Nov 7, 2020

A KeyError is thrown now. I think we can close this issue?

@jreback
Copy link
Contributor

jreback commented Nov 7, 2020

can u see if we have a test for this; ok to add this one in a similar place

@jreback jreback modified the milestones: Contributions Welcome, 1.2 Nov 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants