Basic multiIndex support and stack/unstack methods #702

shoyer · 2016-01-04T05:48:49Z

Example usage:

In [3]: df = pd.DataFrame({'foo': range(3),
   ...:                    'x': ['a', 'b', 'b'],
   ...:                    'y': [0, 0, 1]})
   ...: 

In [4]: s = df.set_index(['x', 'y'])['foo']

In [5]: arr = xray.DataArray(s, dims='z')

In [6]: arr
Out[6]: 
<xray.DataArray 'foo' (z: 3)>
array([0, 1, 2])
Coordinates:
  * z        (z) object ('a', 0) ('b', 0) ('b', 1)

In [7]: arr.indexes['z']
Out[7]: 
MultiIndex(levels=[[u'a', u'b'], [0, 1]],
           labels=[[0, 1, 1], [0, 0, 1]],
           names=[u'x', u'y'])

In [8]: arr.unstack('z')
Out[8]: 
<xray.DataArray 'foo' (x: 2, y: 2)>
array([[  0.,  nan],
       [  1.,   2.]])
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 0 1

In [9]: arr.unstack('z').stack(z=('x', 'y'))
Out[9]: 
<xray.DataArray 'foo' (z: 4)>
array([  0.,  nan,   1.,   2.])
Coordinates:
  * z        (z) object ('a', 0) ('a', 1) ('b', 0) ('b', 1)

TODO (maybe not necessary yet, but eventually):

Multi-index support working with .loc and .sel()
Multi-dimensional stack/unstack
Serialization to NetCDF
Better repr, showing level names/dtypes?
Make levels accessible as coordinate variables (e.g., ds['time'] can pull out the 'time' level of a multi-index)
Make isel_points/sel_points return objects with a MultiIndex? (probably after the previous TODO, so we can preserve basic backwards compatibility)
Add set_index/reset_index/swaplevel to make it easier to create and manipulate multi-indexes

It would be nice to eventually build a full example showing how stack can be combined with lazy loading / dask to do out-of-core PCA on a large geophysical dataset (e.g., identify El Nino).

cc @MaximilianR @jreback @jhamman

shoyer · 2016-01-08T23:48:31Z

I'd like to this into the next release in something close to its current state. It's not as full featured as I would eventually like (see checklist above), but it's enough to be useful, and I'd like to get v0.7 (with the new name) out next week.

rabernat · 2016-01-09T14:04:07Z

Big 👍 from me. This seems like a feature with enormous potential.

It would be nice to eventually build a full example showing how stack can be combined with lazy loading / dask to do out-of-core PCA on a large geophysical dataset (e.g., identify El Nino)

I have an example notebook for doing svd on a sea-surface-temperature field which should be pretty easy to adapt to these new methods. (Currently I just switch over to numpy for the actual svd.)

shoyer · 2016-01-11T21:11:40Z

Any opinions, even on the API here? I'd like to merge this this week...

rabernat · 2016-01-13T06:47:11Z

I think the api is great. Stack / unstack is a nice way to describe the operation of aggregating coordinates.

jreback · 2016-01-13T13:58:57Z

couple of comments:

I think the repr, though technically accurate, is a bit misleading. lists of tuples is really only useful as a MI, so why not actually indicate that
stack/unstack (as in [9]) is not idempotent, as you are reconstituting the full cartesian product of levels. This seems a bit odd though (pandas can do this because its is separately tracking what is actually in the index, via the labels), I don't think you have this though?
these ops are really analogs of set_index/reset_index, rather than stack/unstack, so might be a bit confusing (though I think I get why you are doing it this way), it makes more sense esp for multi-dim. Maybe explain this in the pandas guide?

shoyer · 2016-01-13T19:31:51Z

@jreback thanks for the comments!

I think the repr, though technically accurate, is a bit misleading. lists of tuples is really only useful as a MI, so why not actually indicate that

Agreed -- this is part of my "better repr" TODO.

stack/unstack (as in [9]) is not idempotent, as you are reconstituting the full cartesian product of levels. This seems a bit odd though (pandas can do this because its is separately tracking what is actually in the index, via the labels), I don't think you have this though?

This is true, and definitely worth noting as a compatibility break. But I do think we have a good reason for this: pandas's stack uses dropna (effectively) to drop unused levels, but this operation cannot be done lazily with dask.array. I am happy to force users to do a non-lazy dropna explicitly.

these ops are really analogs of set_index/reset_index, rather than stack/unstack, so might be a bit confusing (though I think I get why you are doing it this way), it makes more sense esp for multi-dim. Maybe explain this in the pandas guide?

I'm not quite sure what you mean here -- set_index/reset_index seem independent of these to me (though they would definitely also be worth adding!). The difference I see:

set_index: make 1d variables part of a (multi)-index along their existing axis
stack: combine orthogonal indexes (along different axes) into a multi-index

jreback · 2016-01-13T20:26:03Z

hmm, is dask.array dropna not implemented? I don't see why it couldn't conceptually be done (though a bit unfamiliar with the impl)

set_index takes 'data' and makes it an 'index', so that is orthogonal. It would make a new Coordinate. reset_index would do the converse.
stack/unstack effectively take existing Coordinates and transform between them.

ok makes sense.

shoyer · 2016-01-13T22:12:06Z

hmm, is dask.array dropna not implemented? I don't see why it couldn't conceptually be done (though a bit unfamiliar with the impl)

We have a dropna in xarray. The problem is that for dask arrays, you need to know the shape of the result. With dropna, you don't know the shape until you've actually done the computation, so it can't be done lazily.

jreback · 2016-01-14T02:13:04Z

makes sense about dask.array.dropna

though I think you should dropna if at all possible (or have an option at least)

it IS a bit suprising to get back the full index
not sure how common that will be in practice
esp if u r stacking multiple levels

finally - think about only supporting sequential stacking as it conceptually makes more sense

jhamman · 2016-01-17T07:23:44Z

@shoyer - this is cool. I just breezed through the code and didn't see anything that jumped out at me. The main API comment I have is about the repr which has already been discussed and identified in #719.

shoyer · 2016-01-17T18:41:16Z

Thanks for taking a look. I'm writing some docs on reshaping today, then will merge this and issue the new release / rename if I have time.

On Sat, Jan 16, 2016 at 11:23 PM, Joe Hamman notifications@github.com
wrote:

@shoyer - this is cool. I just breezed through the code and didn't see anything that jumped out at me. The main API comment I have is about the repr which has already been discussed and identified in #719.

Reply to this email directly or view it on GitHub:
#702 (comment)

Basic multiIndex support and stack/unstack methods

astrojuanlu · 2016-06-01T14:15:25Z

The docs say (http://xarray.pydata.org/en/stable/data-structures.html#creating-a-dataarray)

xarray does not (yet!) support labeling coordinate values with a pandas.MultiIndex (see gh-164)

Is that sentence still accurate given this PR?

shoyer · 2016-06-01T16:48:54Z

No, that should be updated. Thanks for pointing it!

This was referenced Jan 4, 2016

Partial indexing of a Panel pandas-dev/pandas#8906

Closed

Write a new doc page on reshaping / reorganizing data #705

Closed

shoyer mentioned this pull request Jan 13, 2016

Handle non-numpy dtypes without erroring #717

Merged

shoyer added 8 commits January 16, 2016 17:39

Basic support for MultiIndex

f7720ba

Stack xray.Variable dimensions

3564246

Add Variable.unstack

26deef1

Add Dataset.stack and Dataset.unstack

6b6d82e

Add DataArray.stack and .unstack

d8ce68e

add test for lazy stacking with dask

66cb580

Add an example to DataArray.stack

9364449

reindex in .unstack for pandas consistency

da94b4f

shoyer mentioned this pull request Jan 17, 2016

Follow-ups on MultIndex support #719

Closed

7 tasks

shoyer added 2 commits January 16, 2016 17:47

what's new updates

9ad4773

Fix pandas < v0.15.2 and GH700

8e3f188

shoyer added 3 commits January 17, 2016 13:13

Documentation for reshaping data

0b72895

what's new updates for v0.7.0

30400b4

add acknowledgments for v0.7

e034053

shoyer force-pushed the multiindex branch from ea0e7d7 to e034053 Compare January 18, 2016 00:01

shoyer added a commit that referenced this pull request Jan 18, 2016

Merge pull request #702 from shoyer/multiindex

62e74a7

Basic multiIndex support and stack/unstack methods

shoyer merged commit 62e74a7 into pydata:master Jan 18, 2016

shoyer deleted the multiindex branch January 18, 2016 00:11

shoyer mentioned this pull request Feb 18, 2016

MultiIndex and data selection #767

Closed

OXPHOS mentioned this pull request Mar 23, 2016

Pan Deng: Integrating pandas.Panel and xarray Features numfocus/gsoc#127

Merged

4 tasks

lesommer mentioned this pull request May 26, 2016

How to reimplement grid coarsening and statistics-in-boxes in oocgcm ? lesommer/oocgcm#25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Basic multiIndex support and stack/unstack methods #702

Basic multiIndex support and stack/unstack methods #702

Uh oh!

shoyer commented Jan 4, 2016

Uh oh!

shoyer commented Jan 8, 2016

Uh oh!

rabernat commented Jan 9, 2016

Uh oh!

shoyer commented Jan 11, 2016

Uh oh!

rabernat commented Jan 13, 2016

Uh oh!

jreback commented Jan 13, 2016

Uh oh!

shoyer commented Jan 13, 2016

Uh oh!

jreback commented Jan 13, 2016

Uh oh!

shoyer commented Jan 13, 2016

Uh oh!

jreback commented Jan 14, 2016

Uh oh!

jhamman commented Jan 17, 2016

Uh oh!

shoyer commented Jan 17, 2016

@shoyer - this is cool. I just breezed through the code and didn't see anything that jumped out at me. The main API comment I have is about the `repr` which has already been discussed and identified in #719.

Uh oh!

astrojuanlu commented Jun 1, 2016

Uh oh!

shoyer commented Jun 1, 2016

Uh oh!

Uh oh!

Uh oh!

Basic multiIndex support and stack/unstack methods #702

Basic multiIndex support and stack/unstack methods #702

Uh oh!

Conversation

shoyer commented Jan 4, 2016

Uh oh!

shoyer commented Jan 8, 2016

Uh oh!

rabernat commented Jan 9, 2016

Uh oh!

shoyer commented Jan 11, 2016

Uh oh!

rabernat commented Jan 13, 2016

Uh oh!

jreback commented Jan 13, 2016

Uh oh!

shoyer commented Jan 13, 2016

Uh oh!

jreback commented Jan 13, 2016

Uh oh!

shoyer commented Jan 13, 2016

Uh oh!

jreback commented Jan 14, 2016

Uh oh!

jhamman commented Jan 17, 2016

Uh oh!

shoyer commented Jan 17, 2016

@shoyer - this is cool. I just breezed through the code and didn't see anything that jumped out at me. The main API comment I have is about the repr which has already been discussed and identified in #719.

Uh oh!

astrojuanlu commented Jun 1, 2016

Uh oh!

shoyer commented Jun 1, 2016

Uh oh!

Uh oh!

@shoyer - this is cool. I just breezed through the code and didn't see anything that jumped out at me. The main API comment I have is about the `repr` which has already been discussed and identified in #719.