Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subclassing Dataset and DataArray #706

Closed
rafa-guedes opened this issue Jan 5, 2016 · 8 comments
Closed

Subclassing Dataset and DataArray #706

rafa-guedes opened this issue Jan 5, 2016 · 8 comments

Comments

@rafa-guedes
Copy link
Contributor

Hi guys,

I have started writing some SpecArray class which inherits from DataArray and defines some methods useful for dealing with wave spectra, such as calculating spectral wave statistics like significant wave height, peak wave period, etc, interpolating, splitting, and performing some other tasks. I'd like to ask please if:

  • Is this something you guys would maybe be interested to add to your library?
  • Is there a simple way to ensure the methods I am defining are preserved when creating a Dataset out of this SpecArray object? currently I can create / add to a Dataset using this new object, but all new methods get lost by doing that.

Thanks,
Rafael

@rabernat
Copy link
Contributor

rabernat commented Jan 5, 2016

Hi Rafael,

I do lots of multidimensional spectral analysis on geophysical data (mostly
ocean satellite fields, this paper
http://journals.ametsoc.org/doi/abs/10.1175/JPO-D-14-0160.1, for
example), and I have recently started trying passing some of these
calculations through xray. An example is in this notebook
https://gist.github.com/rabernat/be4526e157eb1fc69f50, where I define a
function to compute an isotropic power spectrum over specified dimensions.

One huge source of confusion for students starting out with such
calculations is the questions, what are the spectral coordinates that come
out of fft? (E.g. is it "shifted"?, is there a 2 pi factor in the units?,
etc.) Because of xray's data model, these difficulties can be completely
bypassed by including verbose descriptions of the dimensions and
coordinates.

My view is that spectral analysis is out of scope for xray. However, I
think there is the need for a domain specific spectral analysis package
focused on geophysical data, which would naturally be built on xray. (As a
comparison, consider the nitime http://nipy.org/nitime/ package for
neuroimaging timeseries analysis.) This is something that I, and probably
many others, would be interested in collaborating on. Some features I would
like to see are:

  • wrapping of numpy fft to work on xray dataarrays, including proper
    handling of coordinates (pretty easy)
  • support for different windowing / multitaper methods
  • proper treatment of errors
  • built-in plotting
  • parallelization for out-of-core data (this is a hard one with fft but
    would be very useful)

I think such a package would really take off in popularity and would help
to displace MATLAB for this very common type of analysis. The question is
whether there really is enough common interest among different scientists
to justify a new package, as opposed to everyone just "rolling their own"
solution. Based on your email, it sounds like you might be interested in
such an effort.

Cheers,
Ryan Abernathey

.

On Tue, Jan 5, 2016 at 2:55 AM, Rafael Guedes notifications@github.com
wrote:

Hi guys,

I have started writing some SpecArray class which inherits from DataArray
and defines some methods useful for dealing with wave spectra, such as
calculating spectral wave statistics like significant wave height, peak
wave period, etc, interpolating, splitting, and performing some other
tasks. I'd like to ask please if:

  • Is this something you guys would maybe be interested to add to your
    library?
  • Is there a simple way to ensure the methods I am defining are
    preserved when creating a Dataset out of this SpecArray object? currently I
    can create / add to a Dataset using this new object, but all new methods
    get lost by doing that.

Thanks,
Rafael


Reply to this email directly or view it on GitHub
#706.

@shoyer
Copy link
Member

shoyer commented Jan 5, 2016

Back when I was doing spectroscopy in grad school, I wrote some routines to keep track of the units in Fourier transforms. I put this up on GitHub last year: https://github.com/shoyer/fourier-transform. I'm sure I'm not the only person to have written this code, but it still might be a useful point of departure.

As for xray, I agree that the full extent of what you're describing is probably out of scope for xarray itself. However, a basic labeled FFT does seem like it would be a useful addition to the core library.

Nevertheless, I am very interested in supporting external packages like this, either via subclassing or a similar mechanism.

One possibility would be a mechanism for registering "namespace" packages that define additional methods (as I have mentioned previously). You could write something like:

# this code exists in your library "specarray"
class SpecArray(object):
    def __init__(self, xray_obj):
        self.obj = xray_obj

    def fft(self):
        ...
        return freq, transformed_obj

xray.register_accessor('spec', SpecArray)

# this is what user code looks like
import specarray
import xray
ds = xray.DataArray(...)
ds.spec.fft()  # calls the SpecArray.fft method

This might be easier than maintaining a full subclass, which tends to require a lot of work and presents backwards compatibility issues when we update internal methods.

@rafa-guedes
Copy link
Contributor Author

Cool, thanks @shoyer. Yes @rabernat I totally agree with you and I would be very keen to collaborate on a library like that, I think that would be useful for many people.

@shoyer shoyer changed the title New SpecArray object that inherits from DataArray Subclassing Dataset and DataArray Mar 2, 2016
@fmaussion
Copy link
Member

I find @shoyer 's suggestion about custom accessor attributes very interesting!

the simplest of my use cases would be quite easy to implement:

# MyLib
class MyLibGis(object):
    def __init__(self, xray_obj):
        self.obj = xray_obj
        self.georef = read_georef(xray_obj)

    def subset(self, shapefile=None, roi=None):
        """Return a subset of DataSet (or DataArray)"""
        # compute regions of interests
        slicex, slicey = self.georef(stuff...)
        # return a sel of DataSet
        return self.obj.sel(x=slicex, y=slicey)

xray.register_accessor('gis', MyLibGis)

# user code
import mylib
import xray
ds = xray.DataArray(...)
ds = ds.gis.subset(shapefile='/path/to/shape') 

This would already be quite cool! But would the mechanism allow to pass arguments to the MyLibGis class at construction time? This might also be wordy, maybe something like

ds = xray.DataArray(data, gis={'arg1':42})?

I guess that with these two mechanisms, I would be able to do almost everything I want to do with my netcdf files.

However, one other very important use case for me would be to add lazy "diagnostic" variables to a netcdf dataset. For example, if an atmospheric model output file contains the variables P andPB, then the dataset automatically proposes a new variable TP, which is the sum of P andPB. From the user perspective, this variable is no different than a variable on file. Of course, the data should be computed only on demand. It doesn't seem possible to do this without subclassing, but maybe I missed something?

@shoyer
Copy link
Member

shoyer commented Mar 4, 2016

This would already be quite cool! But would the mechanism allow to pass arguments to the MyLibGis class at construction time? This might also be wordy, maybe something like ds = xray.DataArray(data, gis={'arg1':42})?

My suggested approach here would be to simply write functions instead, e.g.,

def make_gis_array(data, gis=None):
    data = xr.DataArray(data)
    data.attrs['gis'] = gis  # or whatever

This is similar to how I would suggest inserting lazy variables, i.e., write your own functions using dask.array:

def add_lazy_vars(data):
    if 'P' in data and 'PB' in data:
        data['TP'] = data['P'].chunk() + data['PB'].chunk()
    return data

@fmaussion
Copy link
Member

Thanks, this looks very good. Any timeline for the xarray.register_accessor() functionality? ;)

@lesommer
Copy link

@shoyer : the approach you propose for registering additional methods for datasets or dataarray would certainly open very nice applications for xarray. This is for instance something that would very useful to the library we have discussed here (see e.g. this issue about oocgcm). Is there a way how I could contribute to having this register functionality available in xarray ?

@rabernat : your idea of a spectral analysis package on the top of xarray is interesting. I am happy to contribute to this (probably in the frame of the library mentionned above ?). As many others I guess, I have my own script for this (here), but having a more robust and shared code is certainly a good way to go.

Julien

@lesommer
Copy link

@shoyer oops, just found that the new functionnality has already been pulled. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants