Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets consistency #336

Merged
merged 11 commits into from
Dec 18, 2023
Merged

Conversation

milancurcic
Copy link
Member

@milancurcic milancurcic commented Dec 6, 2023

This PR:

Adapters are not affected. If time needs to be converted at the adapters level, please use clouddrift.datasets._to_seconds_since_epoch().

@milancurcic milancurcic added the bug Something isn't working label Dec 6, 2023
@milancurcic milancurcic requested a review from selipot December 6, 2023 20:18
@milancurcic milancurcic self-assigned this Dec 6, 2023
@philippemiron
Copy link
Contributor

philippemiron commented Dec 7, 2023

Could we don't put an option decode_time and pass it to xarray instead?

Personally I can't think of a reason why I would rather have seconds over an array of datetime object.

@milancurcic
Copy link
Member Author

Because different datasets encode their time differently. For example, GLAD time is encoded as integer minutes since some time in 2012 (perhaps the start of deployment).

It's possible to do it with decode_times as well but it does need more code in a few different places.

After working some time with np.datetime64, I also slightly (60/40) prefer it over seconds since epoch. Seconds are more computationally efficient and make arithmetic easier with some other types, but they're not human readable, which for me is the main downside.

@philippemiron
Copy link
Contributor

I see the issue.

One thing to also weight in is if someone is used to work with one of those dataset, which encode the time in days, I'm not sure it's ideal for us to ingest it but then convert the time to seconds.

My preferences would be:

  • we add standard attributes (x unit since y) in the time
  • have the possibility to do decode_times=True/False, which would either return datetime or the original units of the dataset.
  • have a single function for datetime -> seconds that we can used if needed.

@milancurcic
Copy link
Member Author

I tend to agree. What do you think about this approach:

  • Each dataset function gets a decode_times=True keyword parameter; we default to times as np.datetime64[ns].
  • decode_times=False converts the time to seconds since epoch (behavior introduced in this PR).

I'm only unsure of this keyword parameter name. decode_times is nice because it's consistent with Xarray, and we return an Xarray Dataset. However, decode_times=False means that the time should be whatever are the original units and reference of the dataset (e.g. minutes since X in GLAD, seconds since epoch in GDP). But I think we should be consistent in what time units we return here across datasets, i.e. always in seconds since epoch. If we do that, then, this parameter shouldn't be called decode_times, but perhaps seconds_since_epoch with a default value of False.

@milancurcic
Copy link
Member Author

I think the questions boil down to:

  1. Default time as np.datetime64 objects or in physical units?
  2. If in physical units, should they be consistent across datasets in seconds since epoch, or should they be whatever is the source encoding of the dataset.

The source encoding of the dataset may be somewhat arbitrary and not necessarily about how data was stored in the upstream data, but how it was processed in the adapter. E.g. in case of GLAD, it's entirely due to how the datetime stamps in text files were ingested into Pandas DataFrame and then converted to Xarray Dataset.

@selipot chime in please on the 2 points above.

@selipot
Copy link
Member

selipot commented Dec 10, 2023

I think the questions boil down to:

  1. Default time as np.datetime64 objects or in physical units?
  2. If in physical units, should they be consistent across datasets in seconds since epoch, or should they be whatever is the source encoding of the dataset.

The source encoding of the dataset may be somewhat arbitrary and not necessarily about how data was stored in the upstream data, but how it was processed in the adapter. E.g. in case of GLAD, it's entirely due to how the datetime stamps in text files were ingested into Pandas DataFrame and then converted to Xarray Dataset.

@selipot chime in please on the 2 points above.

  1. I prefer physical units such as seconds. I dislike time objects in python because I never remember how to use them (but GitHub copilot helps now)
  2. Being consistent across all datasets seems cumbersome to us and may turn out impractical. I am thinking of a dataset of tectonic plate drift datasets with timescales of millions of years.

For point 1 above I like the idea of having a decode_times=True which would return time objects (if I understand this to be an option for the functions of the datasets module)

@selipot
Copy link
Member

selipot commented Dec 10, 2023

There are conflicts due to the versions. I probably created this, my apologies.

@milancurcic
Copy link
Member Author

OK, so it sounds like we'll go with

  1. Decode times to np.datetime64[ns] by default, and allow the user to choose to not decode.
  2. If not decoded, let datasets keep their upstream units as they are written in the file.

I think that's a fine approach to move forward with here.

@selipot
Copy link
Member

selipot commented Dec 15, 2023

I think the software version in pyproject.toml is outdated for this one?

@milancurcic
Copy link
Member Author

The version in the main branch is incorrect. This PR bumps from the current 0.28.0 to 0.29.0.

@milancurcic milancurcic merged commit 6f7d23e into Cloud-Drift:main Dec 18, 2023
15 checks passed
@milancurcic milancurcic deleted the datasets-consistency branch December 18, 2023 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants