Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame does not play well with classes extending it #2859

Closed
lukauskas opened this issue Feb 12, 2013 · 7 comments
Closed

DataFrame does not play well with classes extending it #2859

lukauskas opened this issue Feb 12, 2013 · 7 comments

Comments

@lukauskas
Copy link

Hi,

I am running on the git cloned version of pandas, and there seems to be quite a few issues with user defined classes extending DataFrame.

It seems that DataFrame class constructor is hardcoded in a lot of places, where self.__class__ or cls constructors should be used instead. This causes some weird behaviour.

Allow me to illustrate, let's import pandas and define some class that would extend DataFrame

In[2]: import pandas as pd
In [3]: pd.__version__
Out[3]: '0.10.1'
In [4]: class ClassExtendingDataFrame(pd.DataFrame):
   ...:     pass
   ...: 

Note that ClassExtendingDataFrame does not override anything and is essentially the same DataFrame, just renamed.

Now one would expect a new instance of ClassExtendingDataFrame to be created by the following code:

In [5]: dict = {'a' : [1,2,3], 'b': [2,3,4]}
In [6]: x = ClassExtendingDataFrame.from_dict(dict)

Unfortunately:

In [10]: assert(isinstance(x, ClassExtendingDataFrame))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-10-3f1ceeb1b90f> in <module>()
----> 1 assert(isinstance(x, ClassExtendingDataFrame))

AssertionError: 

In [11]: type(x)
Out[11]: pandas.core.frame.DataFrame

This is due to DataFrame being hardcoded in from_dict: https://github.com/pydata/pandas/blob/master/pandas/core/frame.py#L905 .
cls variable should be used here.

Note that ClassExtendingDataFrame is initialised using constructor, rather than from_dict method, correct object is created:

In [12]: a = ClassExtendingDataFrame(dict)
In [13]: isinstance(a, ClassExtendingDataFrame)
Out[13]: True

However, operations as simple as slicing break this:

In [14]: isinstance(a[:5], ClassExtendingDataFrame)
Out[14]: False

These are just the two examples I have noticed myself, but I am sure there could be more.
A thorough review of Hardcoded DataFrame constructors is needed to check if they could be replaced by self.__class__ or cls instead.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2013

see #60, #2485, #2695
also #2407, #2242 (for multi-dim)

this is non-trivial and almost always not necessary

lukauskas added a commit to lukauskas/pandas that referenced this issue Feb 13, 2013

Verified

This commit was signed with the committer’s verified signature.
thomcc Thom Chiovoloni
Changed all occurrences of `return DataFrame` to `return
self._constructor` and changed the latter to return `self.__class__`
rather than `DataFrame` by default.

Also made static methods like `DataFrame.from_dict` work on the
`cls` passed to them rather than on hard-coded `DataFrame`.

This is a minimal quick win on the cases described in Issue pandas-dev#2859.
A formal approach to make classes friendly to subclassing might still be
needed, see discussion in the case.
@lukauskas
Copy link
Author

In the cases described here it looks trivial,
I have submitted a pull request that should fix these cases and similar cases.
test_fast.sh seems to work fine

Though I agree that a major refactoring might be needed to do this properly.

@jreback
Copy link
Contributor

jreback commented Feb 13, 2013

this 'works' but will not change your ability to subclass

is there a specific case that sub classing would solve your problem rather than composition?
also MetaDataFrame might be useful

@lukauskas
Copy link
Author

Well, I do not think there is a case where subclassing could not be exchanged by composition.
One can always emulate inheritance with composition as in make both classes have the same API and one call another, yet it is tedious, and may be inefficient computationally.

In my case I subclass DataFrame in order to provide additional functionality that is only relevant with the data I am dealing with. That is, I want to keep the benefits of having a DataFrame, but add an additional function to operate on the values as well as have a bit of validation when the values do not make any sense any more.

I could easily change the function to be external to the class and take DataFrame as a parameter,
but doing some_data.foo() is a bit more convenient than foo(some_data).

I also could use something like MetaDataFrame to get the same some_data.foo(), but I am not too keen on this solution as it is kind of a hacky way of doing inheritance. This would make all functions that rely on the functionality of DataFrame be aware of the fact that MetaDataFrame is essentially the same thing.
That is, the following code would need to be changed:

def foo(data_frame):
    if isinstance(data_frame, DataFrame):
        return data_frame.mean()
    elif isinstance(data_frame, SomethingElse):
        return data_frame.foo()
    else:
       raise ValueError("foo() expects its input to be either DataFrame or SomethingElse. {0} given".format(type(data_frame))

That is one example where subclassing might be necessary.
Subclassing is necessary to be able to call protected methods too, in case people have to do so.
Of course, there are workarounds in both of these cases too.

I think it all boils down to whether the subclass is trying to be a DataFrame, or is just using a DataFrame.

But I do not think that this discussion changes the fact that the code should not be hardcoding the explicit constructor, and using things like cls where appropriate. At some point there might be a need to refactor
the whole class, move these bits somewhere else (maybe to subclasses or mixins) and it might be painful to do so with hardcoded things in it.

I have changed all the occurrences of DataFrame constructs in the class, and would think that maybe unifying calls to Series constructors might be useful as well.

@jreback
Copy link
Contributor

jreback commented Feb 13, 2013

I encourage to read the discussions I pointed to above
all of your points have been addressed

it is quite straightforward to add a function to do 'x' which makes it look native
and has the same call syntax

in python there are no 'protected' methods, u can call whatever you want

@hayd
Copy link
Contributor

hayd commented May 29, 2014

duplicate of those mentioned above.

@hayd hayd closed this as completed May 29, 2014
@vi3k6i5
Copy link

vi3k6i5 commented Mar 3, 2017

@lukauskas would you consider this as a possible solution to your question? http://stackoverflow.com/a/42580274/3027854

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants