-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tolerance for alignment #2217
Comments
I agree that this would be useful. One option that works currently would be to determine the proper grid (e.g., from one file) and then use the To do this systematically in xarray, we would want to update Ideally, we would do this work upstream in pandas, and utilize it downstream in xarray. Either way, someone will need to figure out and implement the appropriate algorithm to take an approximate union of two sets of points. This could be somewhat tricky when you start to consider sets where some but not all points are within |
An alternative approach to fixing this issue would be the long-discussed idea of a "fast path" for open_mfdataset (#1823). In this case, @naomi-henderson knows a-priori that the coordinates for these files should be the same, numerical noise notwithstanding. There should be a way to just skip the alignment check completely and override the coordinates with the values from the first file. For example xr.align(da1, da2, join='override') This would just check that the shapes of the different coordinates match and then replace |
I like this idea! This would be certainly be much easier to implement than general purpose approximate alignment. |
I was just pointed to this issue yesterday, and I have an immediate need for this feature in xarray for a work project. I'll take responsibility to implement this feature tomorrow. |
To be clear, my use-case would not be solved by |
`join='left'' will reindex all arguments to match the coordinates of the first object. In practice, that means that if coordinates differ by floating point noise, the second object would end up converted to all NaNs.
I guess another way to do this would be to include |
Well, I need this to work for join='outer', so, it is gonna happen one way or another... One concept I was toying with today was a distinction between aligning coords (which is what it does now) and aligning bounding boxes. |
@shoyer, I am thinking your original intuition was right about needing to introduce improve the Index classes to perhaps work with an optional epsilon argument to its constructor. How receptive do you think pandas would be to that? And even if they would accept such a feature, we probably would need to implement it a bit ourselves in situations where older pandas versions are used. |
I think a tolerance argument for set-methods like |
See pandas-dev/pandas#9817 and pandas-dev/pandas#9530 for the relevant pandas issues. |
Ok, I see how you implemented it for pandas's reindex. You essentially inserted an inexact filter within For xarray, though, I think we can work around backwards compatibility by having Dataset hold specialized subclasses of Index for floating-point data types that would have the needed changes to the Index class. We can have this specialized class have some default tolerance (say 100*finfo(dtype).resolution?), and it would have its methods use the stored tolerance by default, so it should be completely transparent to the end-user (hopefully). This way, |
Again, I think the first big challenge here is writing fast approximate
union/intersection algorithms. Then we can figure out how to wire them into
the pandas/xarray API :).
…On Fri, Jun 22, 2018 at 10:42 AM Benjamin Root ***@***.***> wrote:
Ok, I see how you implemented it for pandas's reindex. You essentially
inserted an inexact filter within .get_indexer(). And the intersection()
and union() uses these methods, so, in theory, one could pipe a tolerance
argument through them (as well as for the other set operations). The work
needs to be expanded a bit, though, as get_indexer_non_unique() needs the
tolerance parameter, too, I think.
For xarray, though, I think we can work around backwards compatibility by
having Dataset hold specialized subclasses of Index for floating-point data
types that would have the needed changes to the Index class. We can have
this specialized class have some default tolerance (say
100*finfo(dtype).resolution?), and it would have its methods use the stored
tolerance by default, so it should be completely transparent to the
end-user (hopefully). This way, xr.open_mfdataset() would "just work".
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2217 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1nkizKUQqKcLlnM0mn3rT_rqFfo5ks5t_SyGgaJpZM4UbV3q>
.
|
Actually, I disagree. Pandas's set operations methods are mostly index-based. For union and intersection, they have an optimization that dives down into some c-code when the Indexes are monotonic, but everywhere else, it all works off of results from from __future__ import print_function
import warnings
from pandas import Index
import numpy as np
from pandas.indexes.base import is_object_dtype, algos, is_dtype_equal
from pandas.indexes.base import _ensure_index, _concat, _values_from_object, _unsortable_types
from pandas.indexes.numeric import Float64Index
def _choose_tolerance(this, that, tolerance):
if tolerance is None:
tolerance = max(this.tolerance, getattr(that, 'tolerance', 0.0))
return tolerance
class ImpreciseIndex(Float64Index):
def astype(self, dtype, copy=True):
return ImpreciseIndex(self.values.astype(dtype=dtype, copy=copy), name=self.name,
dtype=dtype)
@property
def tolerance(self):
return self._tolerance
@tolerance.setter
def tolerance(self, tolerance):
self._tolerance = self._convert_tolerance(tolerance)
def union(self, other, tolerance=None):
self._assert_can_do_setop(other)
other = _ensure_index(other)
if len(other) == 0 or self.equals(other, tolerance=tolerance):
return self._get_consensus_name(other)
if len(self) == 0:
return other._get_consensus_name(self)
if not is_dtype_equal(self.dtype, other.dtype):
this = self.astype('O')
other = other.astype('O')
return this.union(other, tolerance=tolerance)
tolerance = _choose_tolerance(self, other, tolerance)
indexer = self.get_indexer(other, tolerance=tolerance)
indexer, = (indexer == -1).nonzero()
if len(indexer) > 0:
other_diff = algos.take_nd(other._values, indexer,
allow_fill=False)
result = _concat._concat_compat((self._values, other_diff))
try:
self._values[0] < other_diff[0]
except TypeError as e:
warnings.warn("%s, sort order is undefined for "
"incomparable objects" % e, RuntimeWarning,
stacklevel=3)
else:
types = frozenset((self.inferred_type,
other.inferred_type))
if not types & _unsortable_types:
result.sort()
else:
result = self._values
try:
result = np.sort(result)
except TypeError as e:
warnings.warn("%s, sort order is undefined for "
"incomparable objects" % e, RuntimeWarning,
stacklevel=3)
# for subclasses
return self._wrap_union_result(other, result)
def equals(self, other, tolerance=None):
if self.is_(other):
return True
if not isinstance(other, Index):
return False
if is_object_dtype(self) and not is_object_dtype(other):
# if other is not object, use other's logic for coercion
if isinstance(other, ImpreciseIndex):
return other.equals(self, tolerance=tolerance)
else:
return other.equals(self)
if len(self) != len(other):
return False
tolerance = _choose_tolerance(self, other, tolerance)
diff = np.abs(_values_from_object(self) -
_values_from_object(other))
return np.all(diff < tolerance)
def intersection(self, other, tolerance=None):
self._assert_can_do_setop(other)
other = _ensure_index(other)
if self.equals(other, tolerance=tolerance):
return self._get_consensus_name(other)
if not is_dtype_equal(self.dtype, other.dtype):
this = self.astype('O')
other = other.astype('O')
return this.intersection(other, tolerance=tolerance)
tolerance = _choose_tolerance(self, other, tolerance)
try:
indexer = self.get_indexer(other._values, tolerance=tolerance)
indexer = indexer.take((indexer != -1).nonzero()[0])
except:
# duplicates
# FIXME: get_indexer_non_unique() doesn't take a tolerance argument
indexer = Index(self._values).get_indexer_non_unique(
other._values)[0].unique()
indexer = indexer[indexer != -1]
taken = self.take(indexer)
if self.name != other.name:
taken.name = None
return taken
# TODO: Do I need to re-implement _get_unique_index()?
def get_loc(self, key, method=None, tolerance=None):
if tolerance is None:
tolerance = self.tolerance
if tolerance > 0 and method is None:
method = 'nearest'
return super(ImpreciseIndex, self).get_loc(key, method, tolerance)
def get_indexer(self, target, method=None, limit=None, tolerance=None):
if tolerance is None:
tolerance = self.tolerance
if tolerance > 0 and method is None:
method = 'nearest'
return super(ImpreciseIndex, self).get_indexer(target, method, limit, tolerance)
if __name__ == '__main__':
a = ImpreciseIndex([0.1, 0.2, 0.3, 0.4])
a.tolerance = 0.01
b = ImpreciseIndex([0.301, 0.401, 0.501, 0.601])
b.tolerance = 0.025
print(a, b)
print("a | b :", a.union(b))
print("a & b :", a.intersection(b))
print("a.get_indexer(b):", a.get_indexer(b))
print("b.get_indexer(a):", b.get_indexer(a)) Run this and get the following results:
This is mostly lifted from the |
@WeatherGod One problem with your definition of tolerance is that it isn't commutative, even if both indexes have the same tolerance: a = ImpreciseIndex([0.1, 0.2, 0.3, 0.4])
a.tolerance = 0.1
b = ImpreciseIndex([0.301, 0.401, 0.501, 0.601])
b.tolerance = 0.1
print(a.union(b)) # ImpreciseIndex([0.1, 0.2, 0.3, 0.4, 0.501, 0.601], dtype='float64')
print(b.union(a)) # ImpreciseIndex([0.1, 0.2, 0.301, 0.401, 0.501, 0.601], dtype='float64') If you try a little harder, you could even have cases where the result has a different size, e.g., a = ImpreciseIndex([1, 2, 3])
a.tolerance = 0.5
b = ImpreciseIndex([1, 1.9, 2.1, 3])
b.tolerance = 0.5
print(a.union(b)) # ImpreciseIndex([1.0, 2.0, 3.0], dtype='float64')
print(b.union(a)) # ImpreciseIndex([1.0, 1.9, 2.1, 3.0], dtype='float64') Maybe these aren't really problems in practice, but it's at least a little strange/surprising. |
I am not concerned about the non-commutativeness of the indexer itself.
There is no way around that. At some point, you have to choose values,
whether it is done by an indexer or done by some particular set operation.
As for the different sizes, that happens when the tolerance is greater than
half the smallest delta. I figure a final implementation would enforce such
a constraint on the tolerance.
…On Fri, Jun 22, 2018 at 5:56 PM, Stephan Hoyer ***@***.***> wrote:
@WeatherGod <https://github.com/WeatherGod> One problem with your
definition of tolerance is that it isn't commutative, even if both indexes
have the same tolerance:
a = ImpreciseIndex([0.1, 0.2, 0.3, 0.4])
a.tolerance = 0.1
b = ImpreciseIndex([0.301, 0.401, 0.501, 0.601])
b.tolerance = 0.1print(a.union(b)) # ImpreciseIndex([0.1, 0.2, 0.3, 0.4, 0.501, 0.601], dtype='float64')print(b.union(a)) # ImpreciseIndex([0.1, 0.2, 0.301, 0.401, 0.501, 0.601], dtype='float64')
If you try a little harder, you could even have cases where the result has
a different size, e.g.,
a = ImpreciseIndex([1, 2, 3])
a.tolerance = 0.5
b = ImpreciseIndex([1, 1.9, 2.1, 3])
b.tolerance = 0.5print(a.union(b)) # ImpreciseIndex([1.0, 2.0, 3.0], dtype='float64')print(b.union(a)) # ImpreciseIndex([1.0, 1.9, 2.1, 3.0], dtype='float64')
Maybe these aren't really problems in practice, but it's at least a little
strange/surprising.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2217 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AARy-BUsm4Pcs-LC7s1iNAhPvCVRrGtwks5t_WgDgaJpZM4UbV3q>
.
|
OK, I think I'm convinced. Now it's probably a good time to go back to the
pandas issues (or open a new one) with a proposal to add tolerance to
Float64Index.
On Fri, Jun 22, 2018 at 4:56 PM Benjamin Root <notifications@github.com>
wrote:
… I am not concerned about the non-commutativeness of the indexer itself.
There is no way around that. At some point, you have to choose values,
whether it is done by an indexer or done by some particular set operation.
As for the different sizes, that happens when the tolerance is greater than
half the smallest delta. I figure a final implementation would enforce such
a constraint on the tolerance.
On Fri, Jun 22, 2018 at 5:56 PM, Stephan Hoyer ***@***.***>
wrote:
> @WeatherGod <https://github.com/WeatherGod> One problem with your
> definition of tolerance is that it isn't commutative, even if both
indexes
> have the same tolerance:
>
> a = ImpreciseIndex([0.1, 0.2, 0.3, 0.4])
> a.tolerance = 0.1
> b = ImpreciseIndex([0.301, 0.401, 0.501, 0.601])
> b.tolerance = 0.1print(a.union(b)) # ImpreciseIndex([0.1, 0.2, 0.3, 0.4,
0.501, 0.601], dtype='float64')print(b.union(a)) # ImpreciseIndex([0.1,
0.2, 0.301, 0.401, 0.501, 0.601], dtype='float64')
>
> If you try a little harder, you could even have cases where the result
has
> a different size, e.g.,
>
> a = ImpreciseIndex([1, 2, 3])
> a.tolerance = 0.5
> b = ImpreciseIndex([1, 1.9, 2.1, 3])
> b.tolerance = 0.5print(a.union(b)) # ImpreciseIndex([1.0, 2.0, 3.0],
dtype='float64')print(b.union(a)) # ImpreciseIndex([1.0, 1.9, 2.1, 3.0],
dtype='float64')
>
> Maybe these aren't really problems in practice, but it's at least a
little
> strange/surprising.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#2217 (comment)>,
or mute
> the thread
> <
https://github.com/notifications/unsubscribe-auth/AARy-BUsm4Pcs-LC7s1iNAhPvCVRrGtwks5t_WgDgaJpZM4UbV3q
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2217 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1rA1V9mD7qzKWoeED-6wQatUyAXhks5t_YQ7gaJpZM4UbV3q>
.
|
Do we want to dive straight to that? Or, would it make more sense to first submit some PRs piping the support for a tolerance kwarg through more of the API? Or perhaps we should propose that a "tolerance" attribute should be an optional attribute that methods like In addition, we are likely going to have to implement a decent chunk of code ourselves for compatibility's sake, I think. |
Both of these sounds reasonable to me, but APIs for pandas are really best
discussed in a pandas issue. I'm happy to chime in over there, but I
haven't been an active pandas dev recently.
…On Mon, Jun 25, 2018 at 2:07 PM Benjamin Root ***@***.***> wrote:
Do we want to dive straight to that? Or, would it make more sense to first
submit some PRs piping the support for a tolerance kwarg through more of
the API? Or perhaps we should propose that a "tolerance" attribute should
be an optional attribute that methods like get_indexer() and such could
always check for? Not being a pandas dev, I am not sure how piecemeal we
should approach this.
In addition, we are likely going to have to implement a decent chunk of
code ourselves for compatibility's sake, I think.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2217 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABKS1iqtMyN_GLM5htF2ncoqJ2AjgcjVks5uASb2gaJpZM4UbV3q>
.
|
I have created a PR for my work-in-progress: pandas-dev/pandas#22043 |
any work around to this issue? |
Yes on xarray>=0.13.0, |
ah wonderful! I will update to 1.13.0 |
reopening since we have a PR to fix this properly. |
When using open_mfdataset on files which 'should' share a grid, there is often a small mismatch which results in the grid not aligning properly. This happens frequently when trying to read data from large climate models from multiple files of the same variable, same lon,lat grid and different time intervals. This silent behavior means that I always have to check the sizes of the lon,lat grids whenever I rely on mfdataset to concatenate the data in time.
Here is an example in which I create two 1d DataArrays which have slightly different coordinates:
First I save both DataArrays as netcdf files and then use open_mfdataset to load them:
So the x grid is now twice the size. This behavior is the same if I just use align with join='outer':
Request/ suggestion
What is needed is a user specified tolerance level to give to open_mfdataset and passed to
align which will accept these grids as the same
Possibly related to #2215
thanks, Naomi
The text was updated successfully, but these errors were encountered: