Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: Setting values with 'loc' and boolean mask mixes up values (all-True mask) #48701

Closed
sohailc opened this issue Sep 22, 2022 · 3 comments · Fixed by #48711
Closed

REGR: Setting values with 'loc' and boolean mask mixes up values (all-True mask) #48701

sohailc opened this issue Sep 22, 2022 · 3 comments · Fixed by #48711
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@sohailc
Copy link

sohailc commented Sep 22, 2022

According to documentation, using loc should work with Boolean arrays. However, the following does not seem to work:

import pandas as pd

n = 17

data = pd.DataFrame({
    "name": n * ["a"], 
    "x": range(n), 
    "y": range(n)
})

copy = data.copy()

idx = n * [True]
data.loc[idx, ["x", "y"]] = data[["x", "y"]]
assert data.equals(copy)   # Raises assertion error

The weird thing is that if n is smaller then the code works. This has been tested on Pandas 1.5.0

@martinfleis
Copy link
Contributor

I believe that this is a regression in pandas 1.5.0 as this works correctly on 1.4.4. We ran into the same bug in downstream yesterday geopandas/geopandas#2558.

@jorisvandenbossche jorisvandenbossche added this to the 1.5.1 milestone Sep 22, 2022
@jorisvandenbossche jorisvandenbossche added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version labels Sep 22, 2022
@jorisvandenbossche jorisvandenbossche changed the title Setting values with 'loc' and Boolean arrays: possible bug REGR: Setting values with 'loc' and boolean mask mixes up values (all-True mask) Sep 22, 2022
@jorisvandenbossche
Copy link
Member

The reproducer from that issue:

In [1]: df = pd.DataFrame({'idx': range(20), 'col': np.arange(20, dtype="float")})

In [2]: df.loc[np.ones(len(df), dtype=bool), 'col'] = df.col.values

In [3]: df
Out[3]: 
    idx   col
0     0   0.0
1     1  17.0
2     2  16.0
3     3  15.0
4     4  14.0
5     5  13.0
6     6  12.0
7     7  11.0
8     8  10.0
9     9   9.0
10   10   8.0
11   11   7.0
12   12   6.0
13   13   5.0
14   14   4.0
15   15   3.0
16   16   2.0
17   17   1.0
18   18  18.0
19   19  19.0

The weird thing is that if n is smaller then the code works.

It indeed needs some minimum values, because of np.argsort only starting to return non-sorted values after a minimum number of True values:

In [4]: np.argsort([True]*10)
Out[4]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [5]: np.argsort([True]*17)
Out[5]: array([ 0, 14, 13, 12, 11, 10,  9, 15,  8,  6,  5,  4,  3,  2,  1,  7, 16])

This argsort is used here:

value = value[np.argsort(pi)]

That code assumes that indexer is already an integer indexer, not a boolean one. Before, we converted a boolean indexer to integers in _locIndexer._convert_to_indexer, but this was removed in #45501 (location in diff).

I haven't verified to be 100% sure, but so given the code path and the change in #45501, I assume that's the cause.

@MarcoGorelli
Copy link
Member

I haven't verified to be 100% sure, but so given the code path and the change in #45501, I assume that's the cause.

yup, git bisect confirms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants