Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic multiIndex support and stack/unstack methods #702

Merged
merged 13 commits into from
Jan 18, 2016

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Jan 4, 2016

Fixes #164, #700

Example usage:

In [3]: df = pd.DataFrame({'foo': range(3),
   ...:                    'x': ['a', 'b', 'b'],
   ...:                    'y': [0, 0, 1]})
   ...: 

In [4]: s = df.set_index(['x', 'y'])['foo']

In [5]: arr = xray.DataArray(s, dims='z')

In [6]: arr
Out[6]: 
<xray.DataArray 'foo' (z: 3)>
array([0, 1, 2])
Coordinates:
  * z        (z) object ('a', 0) ('b', 0) ('b', 1)

In [7]: arr.indexes['z']
Out[7]: 
MultiIndex(levels=[[u'a', u'b'], [0, 1]],
           labels=[[0, 1, 1], [0, 0, 1]],
           names=[u'x', u'y'])

In [8]: arr.unstack('z')
Out[8]: 
<xray.DataArray 'foo' (x: 2, y: 2)>
array([[  0.,  nan],
       [  1.,   2.]])
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 0 1

In [9]: arr.unstack('z').stack(z=('x', 'y'))
Out[9]: 
<xray.DataArray 'foo' (z: 4)>
array([  0.,  nan,   1.,   2.])
Coordinates:
  * z        (z) object ('a', 0) ('a', 1) ('b', 0) ('b', 1)

TODO (maybe not necessary yet, but eventually):

  • Multi-index support working with .loc and .sel()
  • Multi-dimensional stack/unstack
  • Serialization to NetCDF
  • Better repr, showing level names/dtypes?
  • Make levels accessible as coordinate variables (e.g., ds['time'] can pull out the 'time' level of a multi-index)
  • Make isel_points/sel_points return objects with a MultiIndex? (probably after the previous TODO, so we can preserve basic backwards compatibility)
  • Add set_index/reset_index/swaplevel to make it easier to create and manipulate multi-indexes

It would be nice to eventually build a full example showing how stack can be combined with lazy loading / dask to do out-of-core PCA on a large geophysical dataset (e.g., identify El Nino).

cc @MaximilianR @jreback @jhamman

@shoyer
Copy link
Member Author

shoyer commented Jan 8, 2016

I'd like to this into the next release in something close to its current state. It's not as full featured as I would eventually like (see checklist above), but it's enough to be useful, and I'd like to get v0.7 (with the new name) out next week.

@rabernat
Copy link
Contributor

rabernat commented Jan 9, 2016

Big 👍 from me. This seems like a feature with enormous potential.

It would be nice to eventually build a full example showing how stack can be combined with lazy loading / dask to do out-of-core PCA on a large geophysical dataset (e.g., identify El Nino)

I have an example notebook for doing svd on a sea-surface-temperature field which should be pretty easy to adapt to these new methods. (Currently I just switch over to numpy for the actual svd.)

@shoyer
Copy link
Member Author

shoyer commented Jan 11, 2016

Any opinions, even on the API here? I'd like to merge this this week...

@rabernat
Copy link
Contributor

I think the api is great. Stack / unstack is a nice way to describe the operation of aggregating coordinates.

@jreback
Copy link

jreback commented Jan 13, 2016

couple of comments:

  • I think the repr, though technically accurate, is a bit misleading. lists of tuples is really only useful as a MI, so why not actually indicate that
  • stack/unstack (as in [9]) is not idempotent, as you are reconstituting the full cartesian product of levels. This seems a bit odd though (pandas can do this because its is separately tracking what is actually in the index, via the labels), I don't think you have this though?
  • these ops are really analogs of set_index/reset_index, rather than stack/unstack, so might be a bit confusing (though I think I get why you are doing it this way), it makes more sense esp for multi-dim. Maybe explain this in the pandas guide?

@shoyer
Copy link
Member Author

shoyer commented Jan 13, 2016

@jreback thanks for the comments!

I think the repr, though technically accurate, is a bit misleading. lists of tuples is really only useful as a MI, so why not actually indicate that

Agreed -- this is part of my "better repr" TODO.

stack/unstack (as in [9]) is not idempotent, as you are reconstituting the full cartesian product of levels. This seems a bit odd though (pandas can do this because its is separately tracking what is actually in the index, via the labels), I don't think you have this though?

This is true, and definitely worth noting as a compatibility break. But I do think we have a good reason for this: pandas's stack uses dropna (effectively) to drop unused levels, but this operation cannot be done lazily with dask.array. I am happy to force users to do a non-lazy dropna explicitly.

these ops are really analogs of set_index/reset_index, rather than stack/unstack, so might be a bit confusing (though I think I get why you are doing it this way), it makes more sense esp for multi-dim. Maybe explain this in the pandas guide?

I'm not quite sure what you mean here -- set_index/reset_index seem independent of these to me (though they would definitely also be worth adding!). The difference I see:

  • set_index: make 1d variables part of a (multi)-index along their existing axis
  • stack: combine orthogonal indexes (along different axes) into a multi-index

@jreback
Copy link

jreback commented Jan 13, 2016

hmm, is dask.array dropna not implemented? I don't see why it couldn't conceptually be done (though a bit unfamiliar with the impl)

  • set_index takes 'data' and makes it an 'index', so that is orthogonal. It would make a new Coordinate. reset_index would do the converse.
  • stack/unstack effectively take existing Coordinates and transform between them.

ok makes sense.

@shoyer
Copy link
Member Author

shoyer commented Jan 13, 2016

hmm, is dask.array dropna not implemented? I don't see why it couldn't conceptually be done (though a bit unfamiliar with the impl)

We have a dropna in xarray. The problem is that for dask arrays, you need to know the shape of the result. With dropna, you don't know the shape until you've actually done the computation, so it can't be done lazily.

@jreback
Copy link

jreback commented Jan 14, 2016

makes sense about dask.array.dropna

though I think you should dropna if at all possible (or have an option at least)

it IS a bit suprising to get back the full index
not sure how common that will be in practice
esp if u r stacking multiple levels

finally - think about only supporting sequential stacking as it conceptually makes more sense

@shoyer shoyer mentioned this pull request Jan 17, 2016
7 tasks
@jhamman
Copy link
Member

jhamman commented Jan 17, 2016

@shoyer - this is cool. I just breezed through the code and didn't see anything that jumped out at me. The main API comment I have is about the repr which has already been discussed and identified in #719.

@shoyer
Copy link
Member Author

shoyer commented Jan 17, 2016

Thanks for taking a look. I'm writing some docs on reshaping today, then will merge this and issue the new release / rename if I have time.

On Sat, Jan 16, 2016 at 11:23 PM, Joe Hamman notifications@github.com
wrote:

@shoyer - this is cool. I just breezed through the code and didn't see anything that jumped out at me. The main API comment I have is about the repr which has already been discussed and identified in #719.

Reply to this email directly or view it on GitHub:
#702 (comment)

shoyer added a commit that referenced this pull request Jan 18, 2016
Basic multiIndex support and stack/unstack methods
@astrojuanlu
Copy link

The docs say (http://xarray.pydata.org/en/stable/data-structures.html#creating-a-dataarray)

xarray does not (yet!) support labeling coordinate values with a pandas.MultiIndex (see gh-164)

Is that sentence still accurate given this PR?

@shoyer
Copy link
Member Author

shoyer commented Jun 1, 2016

No, that should be updated. Thanks for pointing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants