Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

override units for datetime64/timedelta64 variables to preserve integer dtype #8201

Merged
merged 22 commits into from
Sep 24, 2023

Conversation

kmuehlbauer
Copy link
Contributor

@kmuehlbauer kmuehlbauer commented Sep 18, 2023

@kmuehlbauer
Copy link
Contributor Author

Unfortunately this is more involved. Coming back later.

Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we want a test for this?

@@ -777,6 +777,11 @@ def encode(self, variable: Variable, name: T_Name = None) -> Variable:
safe_setitem(attrs, "units", units, name=name)
safe_setitem(attrs, "calendar", calendar, name=name)

# remove dtype from encoding to prevent unnecessary casts
# see GH #1064
if "dtype" in encoding:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "dtype" in encoding:
encoding.pop("dtype", None)

@headtr1ck
Copy link
Collaborator

Unfortunately this is more involved. Coming back later.

Ah sry, didn't see this. Good luck

@kmuehlbauer kmuehlbauer marked this pull request as ready for review September 19, 2023 08:02
@kmuehlbauer
Copy link
Contributor Author

Unfortunately this is more involved. Coming back later.

Ah sry, didn't see this. Good luck

Thanks, finally, that wasn't that complicated :-)

xarray/tests/test_coding_times.py Show resolved Hide resolved
xarray/coding/times.py Outdated Show resolved Hide resolved
doc/whats-new.rst Outdated Show resolved Hide resolved
kmuehlbauer and others added 4 commits September 19, 2023 15:25
Co-authored-by: Spencer Clark <spencerkclark@gmail.com>
Co-authored-by: Spencer Clark <spencerkclark@gmail.com>
@kmuehlbauer
Copy link
Contributor Author

Thanks a bunch @spencerkclark, I've addressed your suggestions.

@dcherian
Copy link
Contributor

Should we override here or warn the user instead?

xarray/coding/times.py Outdated Show resolved Hide resolved
@kmuehlbauer
Copy link
Contributor Author

Should we override here or warn the user instead?

Quoting from @NotSqrt's comment #1064 (comment):

If the resolution loss can't be fixed automatically, what would be nice in the warning is a link or a summary of what the user has to do to solve the resolution loss!

I'm totally open here. The question is, what takes precedence dtype or units? It doesn't make much sense to give units which contradict dtype (at least for faithful write). So these kind of issues mostly (that's debatable) are not because the user giving problematic encoding parameters, but due to inconsistencies while processing.

@kmuehlbauer
Copy link
Contributor Author

kmuehlbauer commented Sep 20, 2023

To add to my above comment, with #7827 we added a warning when times/timedeltas had to be encoded to float64 because of units not fitting.

In this PR for those cases where times/timedeltas had to be encoded to fit units and encoding["dtype"] != np.float64 and of integer dtype encoding["dtype"] will be dropped to make further cf encoding work.

If needed we could add information to the above mentioned warning that encoding["dtype"] will be overridden or that dtype/units combination does not fit. For that case it might be necessary to move that warning out of encode_cf_datetime.

Which approach should we take?

@kmuehlbauer kmuehlbauer added the plan to merge Final call for comments label Sep 21, 2023
@spencerkclark
Copy link
Member

Yeah, it seems like the two options are:

  • Warn and override (to @dcherian's point, it probably would be useful to include information that the encoding dtype is being overridden in the warning).
  • Raise an informative error if the specified encoding units and dtype are incompatible with each other.

The question is, what takes precedence dtype or units? It doesn't make much sense to give units which contradict dtype (at least for faithful write).

Exactly, I think this is the right way to frame it. I guess the answer is really up to the user. It's hard to know ahead of time what they care about more, so maybe the most conservative approach is to raise? Or do we feel having a default solution (with information on how to address it in other ways) is more helpful?

@NotSqrt
Copy link
Contributor

NotSqrt commented Sep 21, 2023

To give a little context, I initially encountered those problems in this setup:

  • I receive some data (similar to MQTT), and aggregate it in an xarray + serialized as netcdf.
  • the time between each data can be quite random, but it may happen that the first points of the file are separated by a round number of minutes. Which means that the serialization will decide to use 'minutes since REF_TIME'
  • when the subsequent data point is no longer separated by a round number of minutes, but only some seconds, bam, bug, the seconds were lost when aggregated into an existing file.

I don't care about the exact unit, I just don't want to randomly lose some precision on my data ..

@kmuehlbauer
Copy link
Contributor Author

It's hard to know ahead of time what they care about more, so maybe the most conservative approach is to raise? Or do we feel having a default solution (with information on how to address it in other ways) is more helpful?

@spencerkclark @dcherian Yes, it's probably best to raise with a meaningful error message.

@kmuehlbauer kmuehlbauer removed the plan to merge Final call for comments label Sep 21, 2023
@NotSqrt
Copy link
Contributor

NotSqrt commented Sep 21, 2023

It's hard to know ahead of time what they care about more, so maybe the most conservative approach is to raise? Or do we feel having a default solution (with information on how to address it in other ways) is more helpful?

@spencerkclark @dcherian Yes, it's probably best to raise with a meaningful error message.

Having a random exception pop up in production just because the delta between points has changed would have been very surprising.
At the time, I expected the serialization to automatically handle that it had to switch from "minutes since REF_TIME" into "seconds since REF_TIME"

@kmuehlbauer
Copy link
Contributor Author

@NotSqrt Thanks for the additional context.

From your use case you would prefer to automatically change the units and keep int64, right? That seems totally reasonable in this light.

So let's try to find a default solution if dtype and units do not play well with the given times:

  • Case 1. The user does not fiddle with encoding.

    • We would figure that the user works with datasets of similar provenance (dtype, but differing units). If the datasets are created by xarray from scratch, we can figure the same. On encoding we would just use the fitting units, which we've already calculated.
    • Note: We might need to take dtype if given in encoding into account for that, to check if the wanted/needed units can be represented in that dtype (wrt loss of resolution).
  • Case 2. The user set a specific units in encoding, but did not set/change dtype.

    • a. We could preserve the units in any case and use float64 and drop dtype from encoding. This is the current status of this PR.
    • b. We could raise a meaningful error.
  • Case 3. The user set a specific dtype in encoding, but did not set/change units.

    • a. We could change the units if needed, as in case 1.
    • b. We could raise a meaningful error.
  • Case 4. The user set a specific dtype and units in encoding.

    • a. We would raise a meaningful error.

Are there more cases?

@dcherian @spencerkclark Is it possible to check if encoding and which keys were given in .to_netcdf? In those cases I would raise if times can't be encoded with given units in given dtype. In all other cases (no encoding given in .to_netcdf) I would assume case 1.

BTW, this whole mess is another very strong argument for a removal of encoding (making encoding read only).

@dcherian
Copy link
Contributor

There is no way to decide if encoding was set in a file and not modified later.

I guess the answer is really up to the user. It's hard to know ahead of time what they care about more, so maybe the most conservative approach is to raise?

Yes, I agree with this framing.

@kmuehlbauer
Copy link
Contributor Author

@dcherian @spencerkclark I'm a bit biased here because my own workflows have similar issues as @NotSqrt's.

I'm now thinking that my first approach overriding dtype did not work out.

If provided units would require dtype change for faithful serialization why not just fix units, instead of raising? We will warn the user that units will be changed and can keep dtype.

@kmuehlbauer kmuehlbauer changed the title remove dtype from encoding for datetime64/timedelta64 variables override units for datetime64/timedelta64 variables to preserve integer dtype Sep 22, 2023
@kmuehlbauer
Copy link
Contributor Author

The last iteration aligns with my case 1 from above.

This is the flow now:

  1. check if times can be serialized faithfully with given units into int64.
    yes? -> all good, serialize to int64
    no? -> follow 2
  2. check encoding dtype:
    a. None -> units take precedence, just warn, serialize to float64
    b. integer -> warn and override units to preserve integer dtype (and with that nanosecond resolution) -> serialize to int64
    c. else (floating) -> serialize to float64 without warning

We could raise instead of warn in 2b, but I'm not sure that will do any good for automated workflows.

@spencerkclark
Copy link
Member

Thanks @kmuehlbauer. I agree, if we are going to override by default, overriding the units is more palatable than the dtype, since it guarantees round-trip accuracy (in other words if one is staying in the xarray ecosystem, it's fairly clearly the way to go).

While I can come up with hypothetical reasons for why one might want to preserve units encoding over the dtype, I don't think it would be a very common situation. So I'd say I am on board with your current approach, though we should make sure that the warning describes how one can preserve the units instead of the dtype by modifying the encoding if they'd like a different behavior.

@kmuehlbauer
Copy link
Contributor Author

@spencerkclark I've added the needed instructions for achieving different behaviour, plus howto silence these warning. This should be good to go then from my side.

It would be good to get this into the next version alongside the nanosecond fix.

@kmuehlbauer kmuehlbauer added plan to merge Final call for comments and removed needs discussion labels Sep 24, 2023
Copy link
Member

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, yeah, let's get this in. Thanks again @kmuehlbauer!

doc/whats-new.rst Outdated Show resolved Hide resolved
Co-authored-by: Spencer Clark <spencerkclark@gmail.com>
@kmuehlbauer kmuehlbauer enabled auto-merge (squash) September 24, 2023 14:39
@kmuehlbauer kmuehlbauer merged commit a4f80b2 into pydata:main Sep 24, 2023
24 checks passed
@kmuehlbauer kmuehlbauer deleted the drop-time-dtype branch September 24, 2023 15:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments topic-cftime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Differences on datetime values appears after writing reindexed variable on netCDF file
5 participants