Skip to content

_ensure_type should use issubclass #32416

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

ericchansen
Copy link

Commit 6fd326d in pull request #30613 added _ensure_type, which utilizes isinstance. However, it is reasonable to assume that someone may want to create a DataFrame subclass. Therefore, _ensure_type should use issubclass.

whatsnew entry isn't necessary?

…1925)

Commit pandas-dev/pandas@6fd326d in
pull request pandas-dev#30613 added `_ensure_type`, which
utilizes `isinstance`. However, it is reasonable to assume that someone
may want to create a DataFrame subclass. Therefore, `_ensure_type`
should use `issubclass`.
@TomAugspurger
Copy link
Contributor

I don't completely follow the changes here. isinstance typically considers subclasses to be instances of the parent's type.

In [1]: class A:
   ...:     pass
   ...:

In [2]: class B(A):
   ...:     pass
   ...:

In [3]: isinstance(B(), A)
Out[3]: True

but not the other way around.

Why do we have this check in the first place? If it's just for static typing, then I'd prefer to remove the assert and handle it some other way. cc @topper-123 @simonjayhawkins

@simonjayhawkins
Copy link
Member

Why do we have this check in the first place?

to avoid adopting Literal and overloads until the inplace parameter of Series/DataFrame methods is deprecated or repetitively adding asserts following calls with inplace=False so that the return type is either DataFrame or Series and not None.

If it's just for static typing, then I'd prefer to remove the assert and handle it some other way.

agreed

@ericchansen
Copy link
Author

Going from pandas 0.25.* to >= 1.0 breaks any libraries that create subclasses of DataFrame. For those that know the pandas code base better than I do, I'm sure there are other, probably better ways to handle this. However, this PR will quickly resolve any breaking issues without breaking anything else.

@simonjayhawkins
Copy link
Member

However, it is reasonable to assume that someone may want to create a DataFrame subclass.

Agreed.

if the type annotations, in places, do not use typevars with generic self then this is another issue (but should be fixed).

i.e.
def reindex(self, *args, **kwargs) -> "DataFrame":
should be
def reindex(self: FrameOrSeries, *args, **kwargs) -> FrameOrSeries:

can you give an example of where you have errors

@WillAyd
Copy link
Member

WillAyd commented Mar 3, 2020

I don't totally follow the issue or the fix; can someone provide a full traceback of what the actual issue is? I didn't see that in the OP

FWIW the tests are subclassing non-public classes, and the OP doesn't correctly subclass a DataFrame per the guide:

https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas

Wondering if the fact that the subclasses don't define _constructor is the actual root cause

@simonjayhawkins
Copy link
Member

def reindex(self: FrameOrSeries, *args, **kwargs) -> FrameOrSeries:

this approach always assumes _constructor is type(self)

@@ -93,7 +93,7 @@ def _ensure_type(self: T, obj) -> T:

Used by type checkers.
"""
assert isinstance(obj, type(self)), type(obj)
assert issubclass(type(obj), type(self)), type(obj)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of ensure_type is to ensure that we have a type and not None for methods with an inplace parameter where the return type is Optional. does this work?

Suggested change
assert issubclass(type(obj), type(self)), type(obj)
assert obj is not None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will probably also need to change
def _ensure_type(self: T, obj) -> T: to def _ensure_type(self, obj: Optional[T]) -> T:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If @simonjayhawkins 's suggestion works, I think that would be the mest fool-proof, though I'm not sure I understand the issue.

Another thing we probably should consider is never return multiple types, so e.g. no Optional[FrameOrSeries] until we get to python 3.8 as the lowest python version. If we did that _ensure_type could be pulled from the code base.

@WillAyd
Copy link
Member

WillAyd commented Mar 3, 2020

So just to confirm, if the OP is modified to provide a _constructor then everything seems OK:

import pandas as pd
import pandas.testing as pdt

class MyDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return MyDataFrame

mdf = MyDataFrame()
df = pd.DataFrame()
mdf.reindex_like(df)
pdt.assert_frame_equal(mdf, df, check_like=True)

Without the assert wouldn't we be returning a DataFrame instead of the subclass? I think this is just a red herring

@ericchansen
Copy link
Author

@WillAyd Yes, the library doesn't override _constructor. Here's a minimal working example.

from pandas import DataFrame

data = {
    "Column 1": [1, 2, 3, 4],
    "Column 2": [10, 11, 12, 13],
    "Column 3": [100, 101, 102, 103]
}

dataframe = DataFrame(data=data)
dataframe.drop(1, inplace=True)

class DataFrameSubclass(DataFrame):
    def useful_new_function(self):
        pass

dataframe_subclass = DataFrameSubclass(data=data)
dataframe_subclass.drop(1, inplace=True)

Here's the traceback.

Traceback (most recent call last):
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "c:\Users\eric.a.hansen\.vscode-insiders\extensions\ms-python.python-2020.2.64397\pythonFiles\lib\python\new_ptvsd\no_wheels\ptvsd\__main__.py", line 45, in <module>
    cli.main()
  File "c:\Users\eric.a.hansen\.vscode-insiders\extensions\ms-python.python-2020.2.64397\pythonFiles\lib\python\new_ptvsd\no_wheels\ptvsd/..\ptvsd\server\cli.py", line 361, in main
    run()
  File "c:\Users\eric.a.hansen\.vscode-insiders\extensions\ms-python.python-2020.2.64397\pythonFiles\lib\python\new_ptvsd\no_wheels\ptvsd/..\ptvsd\server\cli.py", line 203, in run_file
    runpy.run_path(options.target, run_name="__main__")
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\runpy.py", line 263, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "c:\Users\eric.a.hansen\source\repos\pandasexample\example.py", line 17, in <module>
    dataframe_subclass.drop(1, inplace=True)
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\site-packages\pandas\core\frame.py", line 3990, in drop
    return super().drop(
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\site-packages\pandas\core\generic.py", line 3936, in 
drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\site-packages\pandas\core\generic.py", line 3971, in 
_drop_axis
    result = self.reindex(**{axis_name: new_axis})
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\site-packages\pandas\util\_decorators.py", line 227, 
in wrapper
    return func(*args, **kwargs)
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\site-packages\pandas\core\frame.py", line 3856, in reindex
    return self._ensure_type(super().reindex(**kwargs))
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\site-packages\pandas\core\base.py", line 93, in _ensure_type
    assert isinstance(obj, type(self)), type(obj)
AssertionError: <class 'pandas.core.frame.DataFrame'>

@simonjayhawkins
Copy link
Member

from that output, it looks like the return type of DataFrame.reindex should use a typevar to allow subclassing #32416 (comment)

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Mar 3, 2020

from that output, it looks like the return type of DataFrame.reindex should use a typevar to allow subclassing #32416 (comment)

changing that, and running mypy gives pandas\core\frame.py:3633: error: Argument 2 for "super" not an instance of argument 1

none the wiser atm and can't seem to find where the DataFrame object is created instead of the subclass

the AssertionError implies that super().reindex(**kwargs) is a DataFrame not the subclass

@jbrockmendel jbrockmendel mentioned this pull request Mar 4, 2020
@WillAyd
Copy link
Member

WillAyd commented Mar 6, 2020

I don't think this fix actually works, but probably not ideal anyway.

Maybe we should just get rid of _ensure_type altogether? I think that is what the team is leaning towards, though @topper-123 you might disagree.

I will admit I'm not entirely clear on the purpose of the function, even reading through the PR that introduced it, so might just be a lack of understanding on my end

@topper-123
Copy link
Contributor

though @topper-123 you might disagree

I'm fine with that, as hinted above, though functions that return multiple types will have their return annotation removed (which I'm ok with if it solves this annoying problem).

@TomAugspurger
Copy link
Contributor

Actually... looking back at the original issue, isn't it example invalid? Do we expect that for an NDFrame subclass, that the type of _constructor is a callable returning type(self)?

import pandas as pd

class MyDataFrame(pd.DataFrame):
    pass

That said, I still don't think that we should have runtime consequences for static type-checking stuff. But it'd be nice to find a better example for a test case than something that's (potentially) flawed).

@TomAugspurger
Copy link
Contributor

We could use a decision here. IIUC, #32416 (comment) provides a more realistic example where things fail.

Two options:

  1. Change DataFrame._constructor to return the type of the subclass by default
diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index 106128004f..0718860e82 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -404,7 +404,8 @@ class DataFrame(NDFrame):
 
     @property
     def _constructor(self) -> Type["DataFrame"]:
-        return DataFrame
+        return type(self)
+        # return DataFrame
 
     _constructor_sliced: Type[Series] = Series
     _deprecations: FrozenSet[str] = NDFrame._deprecations | frozenset([])
  1. Remove the calls to _ensure_type

I think 1 may be a worthwhile change on its own, but I'm not comfortable doing it in a bugfix release (subclasses may have a different signature in their init). For now I'd prefer removing ensure_type until we can get @simonjayhawkins's suggestion in #32416 (comment) working.

@WillAyd
Copy link
Member

WillAyd commented Mar 10, 2020

So I am still +1 to remove at least for 1.0.2 . Can always add back in later if there is a more comprehensive solution found

@jorisvandenbossche jorisvandenbossche added this to the 1.0.2 milestone Mar 11, 2020
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 11, 2020

Yes, I would also just remove the type checking for 1.0.2, we can discuss further changes related to _constructor for 1.x (that needs some more investigation of the implications, I think), and then possibly re-introduce the type checking later

@TomAugspurger
Copy link
Contributor

Can someone more familiar with mypy (@simonjayhawkins?) remove the ensure_type stuff? I'm getting mypy errors with the following diff from master

diff --git a/pandas/core/base.py b/pandas/core/base.py
index 478b83f538..40ff0640a5 100644
--- a/pandas/core/base.py
+++ b/pandas/core/base.py
@@ -9,7 +9,6 @@ from typing import Dict, FrozenSet, List, Optional, Union
 import numpy as np
 
 import pandas._libs.lib as lib
-from pandas._typing import T
 from pandas.compat import PYPY
 from pandas.compat.numpy import function as nv
 from pandas.errors import AbstractMethodError
@@ -87,15 +86,6 @@ class PandasObject(DirNamesMixin):
         # no memory_usage attribute, so fall back to object's 'sizeof'
         return super().__sizeof__()
 
-    def _ensure_type(self: T, obj) -> T:
-        """
-        Ensure that an object has same type as self.
-
-        Used by type checkers.
-        """
-        assert isinstance(obj, type(self)), type(obj)
-        return obj
-
 
 class NoNewAttributesMixin:
     """
diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index 30abcafb56..d391f1be2c 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -3637,7 +3637,7 @@ class DataFrame(NDFrame):
         # Pop these, since the values are in `kwargs` under different names
         kwargs.pop("axis", None)
         kwargs.pop("labels", None)
-        return self._ensure_type(super().reindex(**kwargs))
+        return super().reindex(**kwargs)
 
     def drop(
         self,
@@ -3955,8 +3955,8 @@ class DataFrame(NDFrame):
 
     @Appender(_shared_docs["shift"] % _shared_doc_kwargs)
     def shift(self, periods=1, freq=None, axis=0, fill_value=None) -> "DataFrame":
-        return self._ensure_type(
-            super().shift(periods=periods, freq=freq, axis=axis, fill_value=fill_value)
+        return super().shift(
+            periods=periods, freq=freq, axis=axis, fill_value=fill_value
         )
 
     def set_index(
@@ -8409,14 +8409,12 @@ Wild         185.0
             from pandas.core.reshape.concat import concat
 
             values = collections.defaultdict(list, values)
-            return self._ensure_type(
-                concat(
-                    (
-                        self.iloc[:, [i]].isin(values[col])
-                        for i, col in enumerate(self.columns)
-                    ),
-                    axis=1,
-                )
+            return concat(
+                (
+                    self.iloc[:, [i]].isin(values[col])
+                    for i, col in enumerate(self.columns)
+                ),
+                axis=1,
             )
         elif isinstance(values, Series):
             if not values.index.is_unique:
diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index f531351747..388a0c0b46 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -8442,9 +8442,7 @@ class NDFrame(PandasObject, SelectionMixin, indexing.IndexingMixin):
         )
 
         if method is not None:
-            left = self._ensure_type(
-                left.fillna(method=method, axis=fill_axis, limit=limit)
-            )
+            left = left.fillna(method=method, axis=fill_axis, limit=limit)
             right = right.fillna(method=method, axis=fill_axis, limit=limit)
 
         # if DatetimeIndex have different tz, convert to UTC
@@ -9977,9 +9975,7 @@ class NDFrame(PandasObject, SelectionMixin, indexing.IndexingMixin):
         if fill_method is None:
             data = self
         else:
-            data = self._ensure_type(
-                self.fillna(method=fill_method, axis=axis, limit=limit)
-            )
+            data = self.fillna(method=fill_method, axis=axis, limit=limit)
 
         rs = data.div(data.shift(periods=periods, freq=freq, axis=axis, **kwargs)) - 1
         if freq is not None:
diff --git a/pandas/core/reshape/pivot.py b/pandas/core/reshape/pivot.py
index 61aa34f724..0f3b8737cc 100644
--- a/pandas/core/reshape/pivot.py
+++ b/pandas/core/reshape/pivot.py
@@ -150,7 +150,7 @@ def pivot_table(
         table = table.sort_index(axis=1)
 
     if fill_value is not None:
-        table = table._ensure_type(table.fillna(fill_value, downcast="infer"))
+        table = table.fillna(fill_value, downcast="infer")
 
     if margins:
         if dropna:
diff --git a/pandas/io/formats/format.py b/pandas/io/formats/format.py
index c879eaeda6..f011293273 100644
--- a/pandas/io/formats/format.py
+++ b/pandas/io/formats/format.py
@@ -283,9 +283,7 @@ class SeriesFormatter:
                 series = series.iloc[:max_rows]
             else:
                 row_num = max_rows // 2
-                series = series._ensure_type(
-                    concat((series.iloc[:row_num], series.iloc[-row_num:]))
-                )
+                series = concat((series.iloc[:row_num], series.iloc[-row_num:]))
             self.tr_row_num = row_num
         else:
             self.tr_row_num = None
pandas/core/generic.py:8445: error: Incompatible types in assignment (expression has type "Optional[NDFrame]", variable has type "NDFrame")
pandas/core/generic.py:9978: error: Incompatible types in assignment (expression has type "Optional[FrameOrSeries]", variable has type "FrameOrSeries")
pandas/core/reshape/pivot.py:153: error: Incompatible types in assignment (expression has type "Optional[DataFrame]", variable has type "DataFrame")

@TomAugspurger
Copy link
Contributor

@ericchansen #32633 removed _ensure_type. Is there anything else to do here?

@TomAugspurger TomAugspurger removed this from the 1.0.2 milestone Mar 11, 2020
@ericchansen
Copy link
Author

@TomAugspurger No, nothing else to do. This will fix the issue I was experiencing.

I can't speak to why _ensure_type was originally implemented and whether or not it's a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas.DataFrame.reindex_like throws if called from a derived class instance
6 participants