- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 19.2k
BUG-26988 implement replace for categorical blocks #27026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Codecov Report
 @@            Coverage Diff             @@
##           master   #27026      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.02%     
==========================================
  Files         180      180              
  Lines       50774    50790      +16     
==========================================
+ Hits        46712    46719       +7     
- Misses       4062     4071       +9
 
 Continue to review full report at Codecov. 
 | 
| Codecov Report
 @@            Coverage Diff             @@
##           master   #27026      +/-   ##
==========================================
- Coverage   93.07%   91.86%   -1.22%     
==========================================
  Files         192      180      -12     
  Lines       49562    50833    +1271     
==========================================
+ Hits        46130    46697     +567     
- Misses       3432     4136     +704
 
 Continue to review full report at Codecov. 
 | 
        
          
                doc/source/whatsnew/v0.25.0.rst
              
                Outdated
          
        
      |  | ||
| - Bug in :func:`DataFrame.at` and :func:`Series.at` that would raise exception if the index was a :class:`CategoricalIndex` (:issue:`20629`) | ||
| - Fixed bug in comparison of ordered :class:`Categorical` that contained missing values with a scalar which sometimes incorrectly resulted in ``True`` (:issue:`26504`) | ||
| - Bug in :func:`DataFrame.replace` and :func:`Series.replace` that would give unusual results on categorical data (:issue:`26988`) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"unusual" --> "incorrect"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
        
          
                pandas/core/internals/blocks.py
              
                Outdated
          
        
      |  | ||
| def replace(self, to_replace, value, inplace=False, filter=None, | ||
| regex=False, convert=True): | ||
| if filter is None or not self.mgr_locs.isin(filter).any(): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this block looks unrelated to the linked issue. what case is it solving? needs a separate test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first case is used when the replace function applies to the entire block, so the replace can be carried out just by manipulating categories. Most calls to Series.replace should be covered with this case (including the new test in series/test_replace.py). If there's a filter, this might not be possible so we fix the categories then use the original implementation (second case)
I added a test in frame/test_replace.py to cover the second case, where a filter is used. It improves upon the previous behaviour, which would have casted the frame to object
| Failure unrelated #26842 | 
        
          
                pandas/tests/frame/test_replace.py
              
                Outdated
          
        
      | # GH 26988 | ||
| df = DataFrame([[1, 1], [2, 2]], columns=['a', 'b'], dtype='category') | ||
| expected = DataFrame(final_data, columns=['a', 'b'], dtype='category') | ||
| expected['a'].cat.set_categories([1, 2, 3], inplace=True) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try not to use inplace in tests, its non-idiomatic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the merge conflict. LGTM.
| Are there tests for the non-None  | 
| @jbrockmendel yes, the dataframe tests take that branch while the series tests do not | 
| 
 If it goes on Categorical, it should be generalized to EA. | 
| 
 probably. But I think its much better then jamming things in the blocks (now we already have replace on the blocks, this PR might be ok actually), but I don't fully remember. | 
| I don't think we (or downstream EA authors) can reliably implement ExtensionArray.replace since it's such a complicated method. We're also relying on Block.replace to do the heavy lifting. An ExtensionArray.replace wouldn't have that luxury. | 
| can you merge master | 
| will take a look this afternoon | 
        
          
                doc/source/whatsnew/v1.0.0.rst
              
                Outdated
          
        
      | ^^^^^^^^^^^ | ||
|  | ||
| - Added test to assert the :func:`fillna` raises the correct ValueError message when the value isn't a value from categories (:issue:`13628`) | ||
| - Bug in :func:`DataFrame.replace` and :func:`Series.replace` that would give incorrect results on categorical data (:issue:`26988`) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"func" -> "meth"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
        
          
                pandas/core/arrays/categorical.py
              
                Outdated
          
        
      | cat.remove_categories(to_replace, inplace=True) | ||
| else: | ||
| index = categories.index(to_replace) | ||
| categories[index] = value | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if value is already in categories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed with new case
| Only major question is whether we need to add  | 
        
          
                pandas/core/arrays/categorical.py
              
                Outdated
          
        
      | inplace = validate_bool_kwarg(inplace, "inplace") | ||
| cat = self if inplace else self.copy() | ||
| categories = cat.categories.tolist() | ||
| if to_replace in categories: | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if to_replace is not hashable?  I think at the Categorical level we are OK raising, but at the Block level we would cast to object right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would casting to object help?
>>> import pandas as pd
>>> s=pd.Series([1,2]).astype(object)
>>> s.replace(1,[1,2,3])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/series.py", line 4303, in replace
    method=method,
  File "pandas/core/generic.py", line 6819, in replace
    raise TypeError(msg)  # pragma: no cover
TypeError: Invalid "to_replace" type: 'int'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback I've merged master and fixed all comments except this one for the reason above
| this looked pretty good, can you merge master and address any remaining comments. | 
| can you merge master | 
| @JustinZhengBC can you merge master. itd be nice to get this in | 
| @jbrockmendel done | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question about an error condition
| code_values = code_values[null_mask | (code_values >= 0)] | ||
| return algorithms.isin(self.codes, code_values) | ||
|  | ||
| def replace(self, to_replace, value, inplace: bool = False): | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type these more specifically (can be a followon PR)
| @jreback how should parameters like  | 
| 
 right, but they are something like Union[Scalar, AnyArrayLike] I think? (again this is likely tricky so should do as a followup) | 
| Fixed the merge conflict. I think we decided to leave typing as a followup, so this should be good to go? | 
| result.values.add_categories(value, inplace=True) | ||
| return super(CategoricalBlock, result).replace( | ||
| to_replace, value, inplace, filter, regex, convert | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can any of the logic in this method generalize to ExtensionBlock?  (presumably we'd need to add replace to the EA interface?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good point, but we can certainly do as a followup; can you open an issue for discussion.
| @JustinZhengBC if you can address @jbrockmendel comments to add some comments this lgtm and we can merge. (alternatvely @jbrockmendel could do as a followup) | 
| @jreback @jbrockmendel I added the comment | 
| Seems like this is good to merge. | 
| thanks for the patch @JustinZhengBC very nice! | 
git diff upstream/master -u -- "*.py" | flake8 --diffThe replace functions used in the code path that caused the bug report appeared to densify the series. This PR adds a replace function to
CategoricalBlockthat takes advantage of the categorical data type when possible