-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Easier subclassing #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
since I've recently refactored DataFrame and DataMatrix into a single class, this should be much easier / consistent to do now. I'm curious how and why you're creating a subclass? Is it to add functionality that's not there? It might be easier in that case to monkey-patch in methods like:
if you give me some use cases it would be helpful! |
Thanks for the quick reply. Glad to hear this is easier in the latest version. I'm subclassing since I have a particular kind of dataset I want to work with. Namely, I'm creating an ecological survey class, inheriting from DataMatrix, where rows correspond to samples and columns to organisms (e.g. how many lions, zebras and giraffes where observed at 5 different water holes).
Hope this is clearer now. |
I pushed some changes today that should make subclassing easier, maybe update to the git HEAD and give it a shot. I don't have a good understanding of your point 2). Could you provide a bit more detail? Not immediately clear to me why you can't just store that data as columns in the DataFrame (unless you need to do row-oriented computations on the data-- though some of this is possible even if you have heterogeneously-dtyped columns). A lot of people do that. I just want to be sure that your need to subclass to solve point 2) is not caused by some deficiency in the data structure. Further, I would likely be interested in adding methods like you're describing that are sensitive to metadata-- if it's generic enough could be a nice addition. You can already do group by with a column containing an indicator of sorts, e.g.:
etc. |
Thanks, I'll update an give it a try. Everything below refers to v. 0.3.0. My motivation for adding metadata attributes, rather than appending additional cols/rows is the following: a) With mixed dtypes, non-numeric cols get dropped when an arithmetic operation i performed.
This is easy to fix, or at least give the option to set the behavior to just ignore non-numeric data, rather than omitting them. b) I often wish to treat metadata differently than the 'main' data, be it numeric or not. For example, consider the following object:
giving the number of dogs and cats in different cities, and the mean annual temperature in these city. Currently, I'm supporting metadata by adding attributes to the DataFrame/Matrix objects, which would contain metadata objects. These objects would themselves be DataFrame/Matrix objects. |
Cool let me know how it goes. I agree that maybe the sensible default behavior when you do arithmetic with mixed dtypes is to just ignore the non-numeric data and let everything else pass through. There are some issues I haven't thought much about like, what would DataFrame + Series yield if the DataFrame contains mixed-dtype data? In R land this isn't exactly a solved problem and is very much DIY, but crafting some kind of flexible solution would be nice. Like you might want something like:
so you can selectively apply transforms and leave the other columns unaltered. |
Currently both Series and DataFrame have a _constructor property. However this is not used consistently, which is needed for subclassing. The way i understand class MySeries(pandas.Series):
@property
def _constructor(self):
return MySeries Series:
DataFrame:
Why do i want to subclass?
|
You don't need to subclass to do the first two things on your list--you can just add methods to Series, e.g.:
This goes for modifying describe also. The last item is a bit trickier. It indeed might be nice to add more metadata to DataFrame-- I agree it should be easier to subclass, though. Subclassing DataFrame should be much more straightforward than subclassing Series, it's just a matter of consistency. The only real way is to write a test suite for a subclassed DataFrame and start hammering down all the issues. |
Was there a final decision on this? I see that the issue was closed but I don't see an explanation. I too would like to see DataFrames (and other pandas classes) become more amenable to subclassing. Monkeypatching seems much more hackish, and also doesn't allow for the case where you want custom initialization. The way pandas is now, with the class names hard coded in individual methods, instead of using type(self) or similar, is rather fragile. It would really be nice if it were possible to subclass pandas classes in such a way that everything transparently "worked" with creating the custom subclasses instead of the basic Pandas classes. This would make it possible to create custom DataFrames for different applications. These could, for instance, store extra metadata or computed statistics automatically. For my own case, I was hoping to extend DataFrame to allow a more succinct subsetting syntax, where essentially df._ColName(val) is shorthand for df.ix[df['ColName']==val], and df._ColName(func) is shorthand for df.ix[func(df['ColName'])]. I have a little data-frame library that I wrote myself that uses this approach, and it's very handy for interactive exploration and slicing of datasets. However, I wasn't able to accomplish this in a useful way, because indexing into my subclass returns a pandas DataFrame and not an instance of my subclass. It would be great if pandas allowed this. |
My position on this is that I would like Series/DataFrame/etc. to be easier to subclass, but it's not a priority for me and I can't afford to spend any time on it for the foreseeable future. If there were some financial support for it, that would be a different story. That being said, I will happily accept pull requests or otherwise code contributions that make the changes necessary to make DataFrame more amenable to subclassing. |
It seems there some changes that could be made pretty easily. As lodagro mentioned in an earlier comment, DataFrame has a _constructor method which appears to be set up to parameterize self-instantiation, but it's not used in most cases. There are a couple methods that call |
I started to create a subclass of Series to model discrete probability distributions before coming across this problem. It can definitely be done with composition instead, but I do think not being able to subclass easily is a trap which will surprise users. @wesm I see your post above regarding priority for this; just adding a +1 to show it would be appreciated if someone does it. |
+1 vote to make DataFrame easier to subclass ASAP. |
This was really confusing for me. I wasted an hour and half trying to figure out why In my case, I'm modifying the behavior of It would have been much more Pythonic if pandas used |
Converting this to an open issue until someone has a chance to work on it. It's still not a development priority for me |
Has there been any progress on this ? If not, I can try ... cause I need this as well. |
I feel dumb ... I don't even manage to build the library :
Any pointer ? |
make sure you update to the current master; a lot of things moved around recently (esp the cython/c code) |
Arr ... that's taking too much time. I was a bit too optimistic I guess, thinking that I could change everything without even knowing the code. For what it's worth, here is what I did : 98e1fadbf9395e88115044db3e5fc8e8f3a46012 there is ~ 5 errors which I couldn't solve, and ~15 failures in the tests. Sorry for all the fuss. I'll find a hack for my thing for the moment, but I'll follow the progress of this. |
closing...this is pretty easy now |
…_audit Add support non-audited transactions.
Hi,
Currently, sub-classing pandas objects is not as easy as it could be.
This is the result of many methods explicitly creating specific classes. E.g. the '_combine_const' method from 'matrix' explicitly returns a DataMatrix object:
def _combine_const(self, other, func):
if not self:
return self
Therefore, arithmetic operations (and some other operations) performed on a new class MyDataMatrix, inheriting form DataMatrix, return a DataMatrix object, rather than a MyDataMatrix object.
I can get around the problem using the ugly hack of overriding the problem methods, and forcing the output to the new class. E.g.,
def _combine_const(self, other, func):
temp = super(MyDataMatrix, self)._combine_const(other, func)
temp.class = self.class
return temp
However, I feel it would be easier to just change the original methods to return the class of the calling object, rather than a fixed class. E.g.,
def _combine_const(self, other, func):
if not self:
return self
Hope this makes sense, and I'm not missing something.
Thanks for a very useful package!
The text was updated successfully, but these errors were encountered: