`_ensure_type` should use `issubclass` #32416

ericchansen · 2020-03-03T18:12:19Z

Commit 6fd326d in pull request #30613 added _ensure_type, which utilizes isinstance. However, it is reasonable to assume that someone may want to create a DataFrame subclass. Therefore, _ensure_type should use issubclass.

closes pandas.DataFrame.reindex_like throws if called from a derived class instance #31925
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

whatsnew entry isn't necessary?

…1925) Commit pandas-dev/pandas@6fd326d in pull request pandas-dev#30613 added `_ensure_type`, which utilizes `isinstance`. However, it is reasonable to assume that someone may want to create a DataFrame subclass. Therefore, `_ensure_type` should use `issubclass`.

TomAugspurger · 2020-03-03T18:20:46Z

I don't completely follow the changes here. isinstance typically considers subclasses to be instances of the parent's type.

In [1]: class A:
   ...:     pass
   ...:

In [2]: class B(A):
   ...:     pass
   ...:

In [3]: isinstance(B(), A)
Out[3]: True

but not the other way around.

Why do we have this check in the first place? If it's just for static typing, then I'd prefer to remove the assert and handle it some other way. cc @topper-123 @simonjayhawkins

simonjayhawkins · 2020-03-03T18:30:38Z

Why do we have this check in the first place?

to avoid adopting Literal and overloads until the inplace parameter of Series/DataFrame methods is deprecated or repetitively adding asserts following calls with inplace=False so that the return type is either DataFrame or Series and not None.

If it's just for static typing, then I'd prefer to remove the assert and handle it some other way.

agreed

ericchansen · 2020-03-03T18:44:26Z

Going from pandas 0.25.* to >= 1.0 breaks any libraries that create subclasses of DataFrame. For those that know the pandas code base better than I do, I'm sure there are other, probably better ways to handle this. However, this PR will quickly resolve any breaking issues without breaking anything else.

simonjayhawkins · 2020-03-03T18:58:37Z

However, it is reasonable to assume that someone may want to create a DataFrame subclass.

Agreed.

if the type annotations, in places, do not use typevars with generic self then this is another issue (but should be fixed).

i.e.
def reindex(self, *args, **kwargs) -> "DataFrame":
should be
def reindex(self: FrameOrSeries, *args, **kwargs) -> FrameOrSeries:

can you give an example of where you have errors

WillAyd · 2020-03-03T18:58:47Z

I don't totally follow the issue or the fix; can someone provide a full traceback of what the actual issue is? I didn't see that in the OP

FWIW the tests are subclassing non-public classes, and the OP doesn't correctly subclass a DataFrame per the guide:

https://pandas.pydata.org/pandas-docs/stable/development/extending.html#extending-subclassing-pandas

Wondering if the fact that the subclasses don't define _constructor is the actual root cause

simonjayhawkins · 2020-03-03T19:02:33Z

def reindex(self: FrameOrSeries, *args, **kwargs) -> FrameOrSeries:

this approach always assumes _constructor is type(self)

simonjayhawkins · 2020-03-03T19:18:48Z

pandas/core/base.py

@@ -93,7 +93,7 @@ def _ensure_type(self: T, obj) -> T:

        Used by type checkers.
        """
-        assert isinstance(obj, type(self)), type(obj)
+        assert issubclass(type(obj), type(self)), type(obj)


The purpose of ensure_type is to ensure that we have a type and not None for methods with an inplace parameter where the return type is Optional. does this work?

Suggested change

assert issubclass(type(obj), type(self)), type(obj)

assert obj is not None

will probably also need to change
def _ensure_type(self: T, obj) -> T: to def _ensure_type(self, obj: Optional[T]) -> T:

If @simonjayhawkins 's suggestion works, I think that would be the mest fool-proof, though I'm not sure I understand the issue.

Another thing we probably should consider is never return multiple types, so e.g. no Optional[FrameOrSeries] until we get to python 3.8 as the lowest python version. If we did that _ensure_type could be pulled from the code base.

WillAyd · 2020-03-03T20:53:11Z

So just to confirm, if the OP is modified to provide a _constructor then everything seems OK:

import pandas as pd
import pandas.testing as pdt

class MyDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return MyDataFrame

mdf = MyDataFrame()
df = pd.DataFrame()
mdf.reindex_like(df)
pdt.assert_frame_equal(mdf, df, check_like=True)

Without the assert wouldn't we be returning a DataFrame instead of the subclass? I think this is just a red herring

ericchansen · 2020-03-03T21:01:06Z

@WillAyd Yes, the library doesn't override _constructor. Here's a minimal working example.

from pandas import DataFrame

data = {
    "Column 1": [1, 2, 3, 4],
    "Column 2": [10, 11, 12, 13],
    "Column 3": [100, 101, 102, 103]
}

dataframe = DataFrame(data=data)
dataframe.drop(1, inplace=True)

class DataFrameSubclass(DataFrame):
    def useful_new_function(self):
        pass

dataframe_subclass = DataFrameSubclass(data=data)
dataframe_subclass.drop(1, inplace=True)

Here's the traceback.

Traceback (most recent call last):
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\runpy.py", line 193, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "c:\Users\eric.a.hansen\.vscode-insiders\extensions\ms-python.python-2020.2.64397\pythonFiles\lib\python\new_ptvsd\no_wheels\ptvsd\__main__.py", line 45, in <module>
    cli.main()
  File "c:\Users\eric.a.hansen\.vscode-insiders\extensions\ms-python.python-2020.2.64397\pythonFiles\lib\python\new_ptvsd\no_wheels\ptvsd/..\ptvsd\server\cli.py", line 361, in main
    run()
  File "c:\Users\eric.a.hansen\.vscode-insiders\extensions\ms-python.python-2020.2.64397\pythonFiles\lib\python\new_ptvsd\no_wheels\ptvsd/..\ptvsd\server\cli.py", line 203, in run_file
    runpy.run_path(options.target, run_name="__main__")
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\runpy.py", line 263, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\runpy.py", line 96, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "c:\Users\eric.a.hansen\source\repos\pandasexample\example.py", line 17, in <module>
    dataframe_subclass.drop(1, inplace=True)
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\site-packages\pandas\core\frame.py", line 3990, in drop
    return super().drop(
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\site-packages\pandas\core\generic.py", line 3936, in 
drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\site-packages\pandas\core\generic.py", line 3971, in 
_drop_axis
    result = self.reindex(**{axis_name: new_axis})
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\site-packages\pandas\util\_decorators.py", line 227, 
in wrapper
    return func(*args, **kwargs)
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\site-packages\pandas\core\frame.py", line 3856, in reindex
    return self._ensure_type(super().reindex(**kwargs))
  File "C:\Users\eric.a.hansen\AppData\Local\Continuum\anaconda3\envs\isclone\lib\site-packages\pandas\core\base.py", line 93, in _ensure_type
    assert isinstance(obj, type(self)), type(obj)
AssertionError: <class 'pandas.core.frame.DataFrame'>

simonjayhawkins · 2020-03-03T21:10:08Z

from that output, it looks like the return type of DataFrame.reindex should use a typevar to allow subclassing #32416 (comment)

simonjayhawkins · 2020-03-03T21:30:39Z

from that output, it looks like the return type of DataFrame.reindex should use a typevar to allow subclassing #32416 (comment)

changing that, and running mypy gives pandas\core\frame.py:3633: error: Argument 2 for "super" not an instance of argument 1

none the wiser atm and can't seem to find where the DataFrame object is created instead of the subclass

the AssertionError implies that super().reindex(**kwargs) is a DataFrame not the subclass

WillAyd · 2020-03-06T05:08:12Z

I don't think this fix actually works, but probably not ideal anyway.

Maybe we should just get rid of _ensure_type altogether? I think that is what the team is leaning towards, though @topper-123 you might disagree.

I will admit I'm not entirely clear on the purpose of the function, even reading through the PR that introduced it, so might just be a lack of understanding on my end

topper-123 · 2020-03-06T18:00:06Z

though @topper-123 you might disagree

I'm fine with that, as hinted above, though functions that return multiple types will have their return annotation removed (which I'm ok with if it solves this annoying problem).

TomAugspurger · 2020-03-06T21:45:37Z

Actually... looking back at the original issue, isn't it example invalid? Do we expect that for an NDFrame subclass, that the type of _constructor is a callable returning type(self)?

import pandas as pd

class MyDataFrame(pd.DataFrame):
    pass

That said, I still don't think that we should have runtime consequences for static type-checking stuff. But it'd be nice to find a better example for a test case than something that's (potentially) flawed).

TomAugspurger · 2020-03-10T18:43:40Z

We could use a decision here. IIUC, #32416 (comment) provides a more realistic example where things fail.

Two options:

Change DataFrame._constructor to return the type of the subclass by default

diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index 106128004f..0718860e82 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -404,7 +404,8 @@ class DataFrame(NDFrame):
 
     @property
     def _constructor(self) -> Type["DataFrame"]:
-        return DataFrame
+        return type(self)
+        # return DataFrame
 
     _constructor_sliced: Type[Series] = Series
     _deprecations: FrozenSet[str] = NDFrame._deprecations | frozenset([])

Remove the calls to _ensure_type

I think 1 may be a worthwhile change on its own, but I'm not comfortable doing it in a bugfix release (subclasses may have a different signature in their init). For now I'd prefer removing ensure_type until we can get @simonjayhawkins's suggestion in #32416 (comment) working.

WillAyd · 2020-03-10T20:25:34Z

So I am still +1 to remove at least for 1.0.2 . Can always add back in later if there is a more comprehensive solution found

jorisvandenbossche · 2020-03-11T13:28:13Z

Yes, I would also just remove the type checking for 1.0.2, we can discuss further changes related to _constructor for 1.x (that needs some more investigation of the implications, I think), and then possibly re-introduce the type checking later

TomAugspurger · 2020-03-11T14:21:41Z

Can someone more familiar with mypy (@simonjayhawkins?) remove the ensure_type stuff? I'm getting mypy errors with the following diff from master

diff --git a/pandas/core/base.py b/pandas/core/base.py
index 478b83f538..40ff0640a5 100644
--- a/pandas/core/base.py
+++ b/pandas/core/base.py
@@ -9,7 +9,6 @@ from typing import Dict, FrozenSet, List, Optional, Union
 import numpy as np
 
 import pandas._libs.lib as lib
-from pandas._typing import T
 from pandas.compat import PYPY
 from pandas.compat.numpy import function as nv
 from pandas.errors import AbstractMethodError
@@ -87,15 +86,6 @@ class PandasObject(DirNamesMixin):
         # no memory_usage attribute, so fall back to object's 'sizeof'
         return super().__sizeof__()
 
-    def _ensure_type(self: T, obj) -> T:
-        """
-        Ensure that an object has same type as self.
-
-        Used by type checkers.
-        """
-        assert isinstance(obj, type(self)), type(obj)
-        return obj
-
 
 class NoNewAttributesMixin:
     """
diff --git a/pandas/core/frame.py b/pandas/core/frame.py
index 30abcafb56..d391f1be2c 100644
--- a/pandas/core/frame.py
+++ b/pandas/core/frame.py
@@ -3637,7 +3637,7 @@ class DataFrame(NDFrame):
         # Pop these, since the values are in `kwargs` under different names
         kwargs.pop("axis", None)
         kwargs.pop("labels", None)
-        return self._ensure_type(super().reindex(**kwargs))
+        return super().reindex(**kwargs)
 
     def drop(
         self,
@@ -3955,8 +3955,8 @@ class DataFrame(NDFrame):
 
     @Appender(_shared_docs["shift"] % _shared_doc_kwargs)
     def shift(self, periods=1, freq=None, axis=0, fill_value=None) -> "DataFrame":
-        return self._ensure_type(
-            super().shift(periods=periods, freq=freq, axis=axis, fill_value=fill_value)
+        return super().shift(
+            periods=periods, freq=freq, axis=axis, fill_value=fill_value
         )
 
     def set_index(
@@ -8409,14 +8409,12 @@ Wild         185.0
             from pandas.core.reshape.concat import concat
 
             values = collections.defaultdict(list, values)
-            return self._ensure_type(
-                concat(
-                    (
-                        self.iloc[:, [i]].isin(values[col])
-                        for i, col in enumerate(self.columns)
-                    ),
-                    axis=1,
-                )
+            return concat(
+                (
+                    self.iloc[:, [i]].isin(values[col])
+                    for i, col in enumerate(self.columns)
+                ),
+                axis=1,
             )
         elif isinstance(values, Series):
             if not values.index.is_unique:
diff --git a/pandas/core/generic.py b/pandas/core/generic.py
index f531351747..388a0c0b46 100644
--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -8442,9 +8442,7 @@ class NDFrame(PandasObject, SelectionMixin, indexing.IndexingMixin):
         )
 
         if method is not None:
-            left = self._ensure_type(
-                left.fillna(method=method, axis=fill_axis, limit=limit)
-            )
+            left = left.fillna(method=method, axis=fill_axis, limit=limit)
             right = right.fillna(method=method, axis=fill_axis, limit=limit)
 
         # if DatetimeIndex have different tz, convert to UTC
@@ -9977,9 +9975,7 @@ class NDFrame(PandasObject, SelectionMixin, indexing.IndexingMixin):
         if fill_method is None:
             data = self
         else:
-            data = self._ensure_type(
-                self.fillna(method=fill_method, axis=axis, limit=limit)
-            )
+            data = self.fillna(method=fill_method, axis=axis, limit=limit)
 
         rs = data.div(data.shift(periods=periods, freq=freq, axis=axis, **kwargs)) - 1
         if freq is not None:
diff --git a/pandas/core/reshape/pivot.py b/pandas/core/reshape/pivot.py
index 61aa34f724..0f3b8737cc 100644
--- a/pandas/core/reshape/pivot.py
+++ b/pandas/core/reshape/pivot.py
@@ -150,7 +150,7 @@ def pivot_table(
         table = table.sort_index(axis=1)
 
     if fill_value is not None:
-        table = table._ensure_type(table.fillna(fill_value, downcast="infer"))
+        table = table.fillna(fill_value, downcast="infer")
 
     if margins:
         if dropna:
diff --git a/pandas/io/formats/format.py b/pandas/io/formats/format.py
index c879eaeda6..f011293273 100644
--- a/pandas/io/formats/format.py
+++ b/pandas/io/formats/format.py
@@ -283,9 +283,7 @@ class SeriesFormatter:
                 series = series.iloc[:max_rows]
             else:
                 row_num = max_rows // 2
-                series = series._ensure_type(
-                    concat((series.iloc[:row_num], series.iloc[-row_num:]))
-                )
+                series = concat((series.iloc[:row_num], series.iloc[-row_num:]))
             self.tr_row_num = row_num
         else:
             self.tr_row_num = None

pandas/core/generic.py:8445: error: Incompatible types in assignment (expression has type "Optional[NDFrame]", variable has type "NDFrame")
pandas/core/generic.py:9978: error: Incompatible types in assignment (expression has type "Optional[FrameOrSeries]", variable has type "FrameOrSeries")
pandas/core/reshape/pivot.py:153: error: Incompatible types in assignment (expression has type "Optional[DataFrame]", variable has type "DataFrame")

TomAugspurger · 2020-03-11T22:12:23Z

@ericchansen #32633 removed _ensure_type. Is there anything else to do here?

ericchansen · 2020-03-13T17:37:26Z

@TomAugspurger No, nothing else to do. This will fix the issue I was experiencing.

I can't speak to why _ensure_type was originally implemented and whether or not it's a good idea.

ericchansen added 2 commits March 3, 2020 13:08

TST: Added tests for pandas object ensure type (pandas-dev#31925)

5ddbea9

simonjayhawkins reviewed Mar 3, 2020

View reviewed changes

jbrockmendel mentioned this pull request Mar 4, 2020

RLS: 1.0.2 #32415

Closed

jorisvandenbossche added this to the 1.0.2 milestone Mar 11, 2020

TomAugspurger removed this from the 1.0.2 milestone Mar 11, 2020

ericchansen closed this Mar 13, 2020

tgibson11 mentioned this pull request Jun 7, 2020

print raising exception in process_adjusted_prices_single_instrument robcarver17/pysystemtrade#175

Closed

	assert issubclass(type(obj), type(self)), type(obj)
	assert obj is not None

Uh oh!

_ensure_type should use issubclass #32416

_ensure_type should use issubclass #32416

Uh oh!

Conversation

ericchansen commented Mar 3, 2020

Uh oh!

TomAugspurger commented Mar 3, 2020

Uh oh!

simonjayhawkins commented Mar 3, 2020

Uh oh!

ericchansen commented Mar 3, 2020

Uh oh!

simonjayhawkins commented Mar 3, 2020

Uh oh!

WillAyd commented Mar 3, 2020

Uh oh!

simonjayhawkins commented Mar 3, 2020

Uh oh!

simonjayhawkins Mar 3, 2020

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins Mar 3, 2020

Choose a reason for hiding this comment

Uh oh!

topper-123 Mar 3, 2020

Choose a reason for hiding this comment

Uh oh!

WillAyd commented Mar 3, 2020

Uh oh!

ericchansen commented Mar 3, 2020

Uh oh!

simonjayhawkins commented Mar 3, 2020

Uh oh!

simonjayhawkins commented Mar 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd commented Mar 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

topper-123 commented Mar 6, 2020

Uh oh!

TomAugspurger commented Mar 6, 2020

Uh oh!

TomAugspurger commented Mar 10, 2020

Uh oh!

WillAyd commented Mar 10, 2020

Uh oh!

jorisvandenbossche commented Mar 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Mar 11, 2020

Uh oh!

TomAugspurger commented Mar 11, 2020

Uh oh!

ericchansen commented Mar 13, 2020

Uh oh!

Uh oh!

`_ensure_type` should use `issubclass` #32416

`_ensure_type` should use `issubclass` #32416

simonjayhawkins commented Mar 3, 2020 •

edited

Loading

WillAyd commented Mar 6, 2020 •

edited

Loading

jorisvandenbossche commented Mar 11, 2020 •

edited

Loading