Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: don't create indexes on multidimensional dimensions #2405

Closed

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Sep 6, 2018

This is just a start to the solution proposed in #2368. A surprisingly small number of tests broke in my local environment.

@@ -272,7 +272,8 @@ def summarize_datavar(name, var, col_width):


def summarize_coord(name, var, col_width):
is_index = name in var.dims
from .variable import IndexVariable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried moving this import to the top of formatting.py but, for reasons I can't understand, I got this error

    from .variable import IndexVariable
E   ImportError: cannot import name 'IndexVariable

Why does it work in the function but not at the top?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a circular import going on here, as .variable does import .common which does import .formatting Perhaps adding a self.is_index attribute to the Variable class could be a way to avoid this import?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have a circular import here. We could potentially refactor to avoid but I think this is also an OK work-around.

@rabernat
Copy link
Contributor Author

rabernat commented Sep 7, 2018

Seeking suggestions about what else needs to be tested here.

@shoyer
Copy link
Member

shoyer commented Sep 10, 2018

I turned up a few lines that probably need to be fixed, just by grepping for in\ \w+\.dims:

if name not in variable.dims:
# no need to protect IndexVariable objects

# don't include indexes in priority_vars, because we didn't align
# first
priority_vars = OrderedDict(
kv for kv in self.variables.items() if kv[0] not in self.dims)
variables = merge_coords_for_inplace_math(
[self.variables, other.variables], priority_vars=priority_vars)

return [self.coords, {d: self.coords[d] for d in self.dims},

It looks like this change broke align somehow -- possibly we have a bug where some indexes were not being created as IndexVariable objects?

@shoyer
Copy link
Member

shoyer commented Sep 10, 2018

It would be good to add a tests for these variables in:

  • The data_vars argument to the Dataset constructor (these objects should become data_vars, not coords)
  • The DataArray constructor.
  • Explicitly adding these variables:
    • With Dataset.__setitem__, Dataset.coords.__setitem__ and DataArray.coords.__setitem__
    • When there are no existing variables matching the dimension name, and when a 1D index variable matching the dimension name already exists (the existing index should be deleted/cleared)
  • Converting these objects to pandas with to_dataframe, to_series and to_pandas
  • Concatenating along an existing dimension whose name matches existing variables (e.g., concatenate along x when a variable x with dimensions ('x', 'y') exists)
  • Concatenating along a new dimension whose name matches existing variables (e.g., concatenate along x when a variable x with dimensions ('y',) exists)
  • More generally, test variables where the name matches a dimension but the variable does not include that dimension at all, e.g., a variable 'x' with dimensions ('y',) in a Dataset where x is a dimension.

@rabernat
Copy link
Contributor Author

With 68f170c there are basically no failing tests in the original test suite. I will now work on adding more tests along the lines outlined by @shoyer.

@rabernat rabernat force-pushed the no-indexes-from-multidim-coords branch from 68f170c to 40c8d36 Compare January 19, 2019 15:34
@pep8speaks
Copy link

Hello @rabernat! Thanks for updating the PR.

Line 275:45: E225 missing whitespace around operator

Line 55:77: E225 missing whitespace around operator
Line 55:80: E501 line too long (80 > 79 characters)

Line 68:1: E302 expected 2 blank lines, found 1
Line 251:9: E265 block comment should start with '# '

@rabernat
Copy link
Contributor Author

I had some spare time and started working on this again. I'm scared of how much internal refactoring it requires. For example, this function

def calculate_dimensions(variables):
"""Calculate the dimensions corresponding to a set of variables.
Returns dictionary mapping from dimension names to sizes. Raises ValueError
if any of the dimension sizes conflict.
"""
dims = OrderedDict()
last_used = {}
scalar_vars = set(k for k, v in iteritems(variables) if not v.dims)
for k, var in iteritems(variables):
for dim, size in zip(var.dims, var.shape):
if dim in scalar_vars:
raise ValueError('dimension %r already exists as a scalar '
'variable' % dim)
if dim not in dims:
dims[dim] = size
last_used[dim] = k
elif dims[dim] != size:
raise ValueError('conflicting sizes for dimension %r: '
'length %s on %r and length %s on %r' %
(dim, size, k, dims[dim], last_used[dim]))
return dims

just doesn't make sense any more if dimensions are not guaranteed to be one-dimensional.

I've never touched this part of the code base before. I have no idea how many places there are that make such an assumption.

@shoyer
Copy link
Member

shoyer commented Jan 20, 2019

I think this function mostly makes sense, but we would want to drop the stuff for scalar variables.

Unfortunately I don't know a good way to fix this stuff short of auditing a lot of code manually -- this is a hazard of data model changes. On the plus side, I've also started to do some of this for the explicit index refactor.

One thing that might turn up a few bugs is to try adding such a variable to the create_test_data() helper function which gets used in lots of places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants