Skip to content

Rework DataArray internals #648

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 4, 2015
Merged

Conversation

shoyer
Copy link
Member

@shoyer shoyer commented Nov 9, 2015

Fixes #367
Fixes #634
Fixes #649

The internal data model used by DataArray has been rewritten to fix several outstanding issues (#367, #634 and this stackoverflow report). Namely, if a DataArray has the same name as one of its coordinates, the array and the coordinate no longer share the same data.

This means that creating a DataArray with the same name as one of its dimensions no longer automatically uses that array to label the corresponding coordinate. You will now need to provide coordinate labels explicitly. Here's the old behavior:

In [2]: xray.DataArray([4, 5, 6], dims='x', name='x')
Out[2]:
<xray.DataArray 'x' (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) int64 4 5 6

and the new behavior (compare the values of the x coordinate):

In [2]: xray.DataArray([4, 5, 6], dims='x', name='x')
Out[2]:
<xray.DataArray 'x' (x: 3)>
array([4, 5, 6])
Coordinates:
  * x        (x) int64 0 1 2

It's also no longer possible to convert a DataArray to a Dataset with DataArray.to_dataset if it is unnamed. This will now raise ValueError. If the array is unnamed, you need to supply the name argument.

@jhamman
Copy link
Member

jhamman commented Nov 10, 2015

@shoyer - I read this mainly trying to get a better idea of the internal DataArray data model. The code itself looks great. My main two comments on the refactor are:

  1. The use of the _ThisArray singleton seems a bit clunky to me. This is my first time interacting with this approach so maybe it is the best way, but I don't know much about it.
  2. This all seems to be complicated by the DataArrays's use of the Dataset internally to store coordinates. Not for this PR, but I wonder if there is a way to clear up the lines between the two objects.

All in all, impressive work.

@shoyer
Copy link
Member Author

shoyer commented Nov 10, 2015

Indeed, I wonder if it would make sense to decouple DataArray from Dataset by storing the state on two (protected) attributes:

  • variable: the variable for this array
  • coords: an ordered dict of coordinates

The main downside is that we add a bit more redundant code (e.g., to loop over all variables in .sel). But on the plus side our data model is much more similar to the public API, which would probably make things easier to understand, especially for new contributors -- they don't have to learn Dataset to learn DataArray.

@max-sixty
Copy link
Collaborator

Indeed, I wonder if it would make sense to decouple DataArray from Dataset by storing the state on two (protected) attributes:

As a newbie, 👍. I took some time to figure out why a DataArray contained a Dataset.

The main downside is that we add a bit more redundant code (e.g., to loop over all variables in .sel).

Low confidence, but you could have a common ancestor (XRayContainer, etc) which contained the similar code or a ._stuff attribute that's iterated over

@@ -233,11 +242,13 @@ def subset(dim, label):
del coords[dim]
return Dataset(variables, coords, self.attrs)

def _to_dataset_whole(self, name):
def _to_dataset_whole(self, name=None):
if name is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor, but name = name or self.name might be clearer / more pythonic than the repeated if name is None:

@jhamman
Copy link
Member

jhamman commented Nov 10, 2015

@shoyer - do you have a feel for how difficult it would be to go the variable/coords route?

@shoyer
Copy link
Member Author

shoyer commented Nov 10, 2015

@shoyer - do you have a feel for how difficult it would be to go the variable/coords route?

Hmm. Might not be so bad now that I've already gone through the trouble of thinking what these new tests should look like. I'll give it a shot tonight and see how it goes...

@shoyer
Copy link
Member Author

shoyer commented Nov 10, 2015

I realize now that changing the internal representation for DataArray doesn't mean we need to rewrite how every routine works. We can still convert dataarrays to a dataset when convenient -- it just means we'll need to use a method to do so instead of modifying ._dataset. For example, we currently have:

    def copy(self, deep=True):
        ds = self._dataset.copy(deep=deep)
        return self._with_replaced_dataset(ds)

and instead we could simply write:

    def copy(self, deep=True):
        ds = self._to_temp_dataset().copy(deep=deep)
        return self._new_from_temp_dataset(ds)

However, going forward it will give us more flexibility for how to write DataArray methods. For example, it might actually be clearer to write:

    def copy(self, deep=True):
        variable = self.variable.copy(deep=deep)
        coords = OrderedDict((k, v.copy(deep=deep))
                             for k, v in self._coords.items())
        return type(self)(variable, coords, name=name, fastpath=True)

@shoyer
Copy link
Member Author

shoyer commented Nov 11, 2015

OK, latest commit changes DataArray's internals to rely on _variable and _coords instead of _dataset. There's still a bit more _to_temp_dataset() than ideal, but hopefully the data model makes more sense now...

@shoyer shoyer force-pushed the DataArray-internals branch 2 times, most recently from 0aeea33 to edea054 Compare November 12, 2015 06:46
@shoyer shoyer force-pushed the DataArray-internals branch 2 times, most recently from 0e9b656 to 96eeb13 Compare November 30, 2015 08:34
@shoyer
Copy link
Member Author

shoyer commented Nov 30, 2015

This is ready for review if anyone wants to take another look.

dims = OrderedDict(zip(self.dims, self.shape))
return self._dataset._to_dataframe(dims)
unique_name = '__unique_name_identifier_z98xfz98xugfg73ho__'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's going on here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch -- needed explanation. Let me know if the comments I added help.

@jhamman
Copy link
Member

jhamman commented Dec 2, 2015

@shoyer - I don't have any more inline comments. There is one failing test and there are merge conflicts, once those are addressed, I'll take one more brief look.

@shoyer shoyer force-pushed the DataArray-internals branch from 5325ff9 to c00a72b Compare December 4, 2015 01:22
Fixes GH367
Fixes GH634

The internal data model used by :py:class:`~xray.DataArray` has been
rewritten to fix several outstanding issues (:issue:`367`, :issue:`634`,
`this stackoverflow report`_). Internally, ``DataArray`` is now implemented
in terms of ``._variable`` and ``._coords`` attributes instead of holding
variables in a ``Dataset`` object.
@shoyer shoyer force-pushed the DataArray-internals branch from 24b90c3 to f368046 Compare December 4, 2015 01:37
@shoyer
Copy link
Member Author

shoyer commented Dec 4, 2015

Rebased and tests are passing.

@jhamman
Copy link
Member

jhamman commented Dec 4, 2015

lgtm, go ahead and merge.

shoyer added a commit that referenced this pull request Dec 4, 2015
@shoyer shoyer merged commit 2c6a366 into pydata:master Dec 4, 2015
@shoyer shoyer deleted the DataArray-internals branch December 4, 2015 20:40
@max-sixty
Copy link
Collaborator

👏
I'll try and get my PR in this weekend for the pandas wrapping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error when using broadcast_arrays with coordinates Unexpected behavior by diff when applied to coordinate DataArray ds['time.time'] is broken
3 participants