-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default time encoding of nanoseconds is NOT good. #9154
Comments
Thanks for opening your first issue here at xarray! Be sure to follow the issue template! |
@ChrisBarker-NOAA could you provide a minimal working example? While there are select circumstances where xarray will do this, I do not think this should be the case in general. |
@ChrisBarker-NOAA In general xarray tries to encode times to the lowest possible resolution. So if the wanted resolution is say This also works if no resolution is given. Xarray tries to calculate that resolution where all data can be represented in. There might have been changes in these parts of the code or other changes in behaviour due to dependency updates. I second @spencerkclark here, a MCVE would help to debug. |
To add a little more context, @ChrisBarker-NOAA you are right to point out that my comment (#3942 (comment)) ignores considerations such as downstream (non-xarray or even non-Python) use-cases and (possibly) CF compliance, which I think may have been additional (perhaps more fundamental) reasons that xarray did not support encoding units finer than seconds previously (#4183 (comment)). If I were to write it again I would probably not have been as absolute. As @kmuehlbauer notes, for NumPy arrays of times we still try to use the coarsest possible resolution that enables exact encoding with For chunked arrays—and maybe this is what you are running into—there was a recent change to use |
@spencerkclark: yup, that's probably it. We've been bitten by this a fair bit lately -- always(?) when loading a Dataset from some kind a remote dataset: OPeNDAP, Zarr (via kerchunked data on AMS), or MFDataset, etc. Then the time encoding gets lost or a new time array is created for one reason or another, and we end up with the nanosecond-based-encoding. So yes, chunked. Another issue is that it's all too common for folks to use a float datatype in the time variable, which means you get essentially arbitrary precision. (would you even get nanoseconds? not sure, I haven't done the math.) So what to do?
-CHB |
Hmm -- if this is right (from a random unrelated issue on some other project): "Since 2.0.0 [pandas] actually supports timestamp precisions other than nanoseconds" Looks like it: So maybe there's the solution :-) |
@ChrisBarker-NOAA One reason to use datetime64[ns] internally is that we can keep the int64 representation in memory (and on disk) also in the presence of NaT. In case of encoded Please also see #7493 where the new pandas non-nanosecond datetimes are discussed. I did not look deeper into this new behaviour, but it should be possible to make xarray aware of non-nanosecond datetime resolution. I fear this will be a major undertaking, though. |
Which is a good reason to use datetime64 -- (and 64 bit integers in general) -- but numpy's datetime64 can choose what "unit" you want, so datetime64[ms] or datetime64[s] satisfies these same criteria.
Very good reasons not to covert to float -- so yes, but I'm not suggesting that. Frankly, numpy's datetime64 was not all that thoroughly reviewed before it was implemented, and has been trouble ever since :-( -- but that's water under the bridge. As for nanoseconds: I suspect that was chosen by pandas originally, because "why not" -- i.e. lots of extra bits to use. But whether xarray continues to use ns internally is another question -- has ANYONE ever asked for ns? But a breaking change is a breaking change, which is not good.
An easier option would probably be to go to datetime64[ms] instead, though somewhere out there is someone actually using nanoseconds that might complain :-( But it may not be THAT major an undertaking -- grepping the code, there are indeed a lot of references to so. maybe a lot to change, but mostly boilerplate :-( The first question is -- would this be a welcome change? If so, maybe I can take a look ... Option 1) Change [ns] to [ms] as the static xarray datetime.
So: Option(2) is probably the way to go -- let it be settable -- and then we can decide where to change the default -- e.g. I would suggest using [s], or maybe [ms] for default netcdf decoding, for instance. How good do we think the test coverage is ? Finally: option (3) -- just change the default behavior for encoding netcdf (or other similar formats (zarr) -- essentially any time you are doing a "units since datetime" encoding, use milliseconds. That would be relatively easy, and I suspect break very little -- and anyone that does need nanoseconds could specify that. |
Thanks @ChrisBarker-NOAA, that's good to know. For this issue, my first thought was something along the lines of your option (3) for chunked arrays. The idea would be to switch to using default units of Indeed as @kmuehlbauer notes, we are aware of the new flexibility in pandas. I agree that it would be a nontrivial amount of work to enable that flexibility in xarray. There probably is a lot of boilerplate, but I think there are a few things we'll need to think carefully about too. It is likely best to separate that from this issue. Footnotes
|
Should we keep this open or move the pandas impl to a new issue? |
I'd like to keep it open: There are two (maybe separate) issues here:
Granted, if (1) is done, then (2) could be whatever the user specified (though there would still be a default to decide)... But it will take a while to get (1) done -- so I still advocate for changing the default for the "CF style" time encoding. @spencerkclark: thanks for the link to the Panda's discussion -- very enlightening. A key take-away from from me: "realize that pandas's plurality use case (if we were truly omniscient) nowadays is in the analysis of data generated by business processes occurring since the advent of computers." That was true then, and may still be true now for Pandas -- but it's likely not true for xarray, and particular not true for folks using the CF-style encoding. Those users would rather a cftime compliant encoding, and don't need nanoseconds. (we know they don't because it's not actually supported anyway!) So I still think that xarray should change its default CF-style encoding to microseconds sooner than later, regardless of how (1) proceeds. |
For sure, @ChrisBarker-NOAA, I am in agreement that we should switch away from specifying nanoseconds units by default for Note that in the meantime xarray uses microseconds by default for chunked arrays of |
I'm a bit confused about dask vs chunked vs netcdf or ???. I would assume that you can use dask arrays themselves with the native numpy datetime64? yes? Anyway, my use case is saving out a xarray Dataset to netcdf4, when a time variable does not have unit defined -- it has been saving out in (nanoseconds since ...) which then breaks code that tries to read the resulting file (e.g. cftime python lib). But: Lines 834 to 836 in 20cde77
seems to indicate that it woule use microseconds -- did that code change recently? I'll go test now ....
Agreed -- silent data loss is bad. However, that would require actually checking all the values -- which may not be available, yes? Anyway, if it can be checked, then Error or Warning? Warning seems right to me, though people do tend to ignore them, particularly if it's running on some service, rather than interactively. But Error is a problem because then code that has worked fine with all the test data, etc, might then raise when passed some new data. (and if time was converted (or computed) from, e.g. floats, might have spurious sub-microseconds in it.) So: if there's going to be an Error, then it would be good to have a flag that you can set to turn it off:
or some such -- not sure where that flag would go though. Thanks, |
OK -- did some more experimenting, and I think I get it now why this is about chunked data -- if not chunked, xarray seems to examine the values and choose a reasonable encoding -- nice! Thanks to whoever write that code! But I presume with chunked arrays, you can't examine all the values at once, so can't follow that precident.
I'm trying to remember where we had this problem, because that was not our experience. I'll have to poke about some more, but I think it may have been:
My proposal: Define a flag (somewhere -- not sure where) along the lines of: When True, then microseconds would be the minimum unit used. When False, use current behavior. I would prefer True be the default, but if we do think we'll break folks' code, then False is OK. At least we'll have an easy fix when this comes up. |
While we indeed cannot infer the appropriate units to choose without examining the entire variable, we should be able to determine whether the a priori chosen units allow for exact encoding of each chunk independently and warn or raise accordingly within each subtask.
I see—with this workflow I believe it would still be possible to ensure The other workaround of course is to load / compute all the time-like variables before saving to disk—it is not that common that time arrays in climate datasets require a lot of memory or are computationally expensive to produce, so this generally is not so bad (for many years xarray happened to do this automatically—i.e. it did not lazily encode chunked time-like arrays). Up to you though whether these implicit workarounds are more intuitive than the explicit workaround of manually specifying the time encoding units, which is generally what I do if I know a downstream application is sensitive to the choice, even if the times are in memory :). |
OK -- that should be a warning, rather than a error -- it would be very bad for a workflow that works for years to suddenly fail due to one wonky value in a dataset :-) Unless there's a "allow truncate to milliseconds flag that's True by default"
well no -- never did a maybe there's a way to pass But anyway, we probably could find a better workflow (though setting the encoding isn't too bad, once we know we need to do it) but I'd still like to see better defaults. What this conversation has revealed to me is why this hasn't bitten a lot of people much of the time :-) Maybe this? Have a flag: If
If 'False" bey default:
We could get a bit more complicated and have it be
The goals here are:
|
What happened?
When you call
to_netcdf()
on a Dataset without an encoding for the time variable set, you get: "nanoseconds since ..."This is not good, as nanoseconds aren't support by a lot of downstream software, e.g. cftime, etc. (I'm not sure if it's CF compliant).
I see that this was changed somewhere after #3942
It is an improvement over integer days, that's for sure, but not the best option.
I understand that the internal panda datetime64 type is ns -- and this preserves any precision that may be there, but:
"practicality beats purity"
And this is not practical. Frankly, I wonder why pandas chose nanoseconds as the default, but it seems like the common xarray (and netcdf) use cases, ns is completely unneeded precision.
I'd suggest integer milliseconds, or, frankly even seconds.
Alternatively, use, say, floating point hours, which would buy you a lot of precision for small values -- I can't imagine who is using nanoseconds and also cares about years of timespan.
What did you expect to happen?
No response
Minimal Complete Verifiable Example
No response
MVCE confirmation
Relevant log output
No response
Anything else we need to know?
No response
Environment
The text was updated successfully, but these errors were encountered: