Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gdp-update #282

Merged
merged 14 commits into from
Oct 4, 2023
Merged

Gdp-update #282

merged 14 commits into from
Oct 4, 2023

Conversation

selipot
Copy link
Member

@selipot selipot commented Sep 26, 2023

Modification of the GDP adapters function to reflect the hourly database update.

To merge once the DAC update completely the 2.01/ FTP folder.

Should the global variable be moved from gdp.py to gdp1h.py since it applies only to that product?

@selipot selipot marked this pull request as ready for review September 26, 2023 01:12
@selipot
Copy link
Member Author

selipot commented Sep 26, 2023

Not sure why this PR contains old commits?

@selipot selipot requested a review from milancurcic September 26, 2023 01:13
@selipot selipot marked this pull request as draft September 26, 2023 01:13
@selipot selipot self-assigned this Sep 26, 2023
@selipot selipot added the arhicved-label-data-adapters Adapters for custom datasets into CloudDrift label Sep 26, 2023
@milancurcic
Copy link
Member

Should the global variable be moved from gdp.py to gdp1h.py since it applies only to that product?

Yes. So the 6-hourly GDP is a completely different version track, i.e. we don't expect the 6-hourly GDP to follow the hourly one in terms of dataset updates/versions?

The 2.01 upstream data URL seems to contain NetCDF files. Is that still a generation in progress or we can actually test and (if no problems) merge?

@selipot
Copy link
Member Author

selipot commented Sep 26, 2023

Should the global variable be moved from gdp.py to gdp1h.py since it applies only to that product?

Yes. So the 6-hourly GDP is a completely different version track, i.e. we don't expect the 6-hourly GDP to follow the hourly one in terms of dataset updates/versions?

Correct, 6-hourly is on a different version track.

The 2.01 upstream data URL seems to contain NetCDF files. Is that still a generation in progress or we can actually test and (if no problems) merge?

I sent email to @RickLumpkin with @milancurcic copied. He is working on organizing the 2.01/ folder just like the 2.00/ folder so hold on.

@selipot
Copy link
Member Author

selipot commented Sep 26, 2023

the 2.01/ FTP folder has been organized like 2.00/ so we can proceed?

@milancurcic
Copy link
Member

Do you think it's worth it to test processing it just in case? (I won't be able to do it before Monday) Otherwise, looks good to me!

@milancurcic
Copy link
Member

We should also soon discuss how do we want to treat different dataset versions from the clouddrift.datasets point of view.

@selipot
Copy link
Member Author

selipot commented Sep 26, 2023

Do you think it's worth it to test processing it just in case? (I won't be able to do it before Monday) Otherwise, looks good to me!

Yes, we need to test.

@selipot
Copy link
Member Author

selipot commented Sep 26, 2023

We should also soon discuss how do we want to treat different dataset versions from the clouddrift.datasets point of view.

My proposition is to supply the latest by default but optionally supply previous ones.

@selipot
Copy link
Member Author

selipot commented Sep 26, 2023

For clouddrift.datasets.gdp1h() should an option be to open the dataset from the zarr or nc file?

  • netcdf dependencies can be problematic from my experience so this favors zarr
  • the zarr file has shown issues for low bandwidth so this favors netcdf

Thoughts?

@milancurcic
Copy link
Member

nc access was not yet supported by the NetCDF-C version that shipped with netCDF4 on PyPI, so Zarr was the only option when I introduced the datasets module.

Copy link
Contributor

@philippemiron philippemiron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested it and this fails:

import clouddrift as cd
ra = cd.adapters.gdp1h.to_raggedarray(n_random_id=10)

Adding the experimental URL works.

import clouddrift as cd
from clouddrift.adapters.gdp1h import GDP_DATA_URL_EXPERIMENTAL
ra = cd.adapters.gdp1h.to_raggedarray(n_random_id=10, url=GDP_DATA_URL_EXPERIMENTAL)

Should they update the metadata and remove the experimental URL if we are going to switch to v2.01?

@selipot
Copy link
Member Author

selipot commented Sep 27, 2023

Do you understand why it fails?

@RickLumpkin is now populating the experimental folder with the files for the 2.02 update. So moving forward the final files for each update/version will be in 2.xx/ and the files for the on-going update will be in experimental/

The idea is that the clouddrift library will update its code to reflect the data updates.

@philippemiron
Copy link
Contributor

philippemiron commented Sep 27, 2023

It's failing because of the filename_pattern (I think) that's in download() and to_raggedarray().

    if url == GDP_DATA_URL:
        pattern = "drifter_hourly_[0-9]*.nc"
        filename_pattern = "drifter_{id}.nc"
    elif url == GDP_DATA_URL_EXPERIMENTAL:
        pattern = "drifter_hourly_[0-9]*.nc"
        filename_pattern = "drifter_hourly_{id}.nc"

I don't have time to check now, but I guess it's just a matter of "drifter_{id}.nc" to "drifter_hourly_{id}.nc". And actually, maybe we don't need that condition anymore since for both v2.01 and v2.02 the patterns are the same.

@selipot
Copy link
Member Author

selipot commented Sep 27, 2023

It's failing because of the filename_pattern (I think) that's in download() and to_raggedarray().

    if url == GDP_DATA_URL:
        pattern = "drifter_hourly_[0-9]*.nc"
        filename_pattern = "drifter_{id}.nc"
    elif url == GDP_DATA_URL_EXPERIMENTAL:
        pattern = "drifter_hourly_[0-9]*.nc"
        filename_pattern = "drifter_hourly_{id}.nc"

I don't have time to check now, but I guess it's just a matter of "drifter_{id}.nc" to "drifter_hourly_{id}.nc". And actually, maybe we don't need that condition anymore since for both v2.01 and v2.02 the pattern is the same.

Yes, my last commit fixes that. But down the line there are more issues which I do not understand. I have no time either to pursue this today and will look next when i can.

@philippemiron
Copy link
Contributor

philippemiron commented Sep 27, 2023

I fix the fix. You missed two more patterns. Working for 2.01 and 2.02 now.

@@ -547,7 +548,7 @@ def to_raggedarray(
--------

Invoke `to_raggedarray` without any arguments to download all drifter data
from the 2.00 GDP feed:
from the 2.01 GDP feed:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make the docstring include automatically the GDP_VERSION variable here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's possible. The doc is also generated from the source code and it would have to be interpreted to get the value..

@@ -79,8 +80,8 @@ def download(
os.makedirs(tmp_path, exist_ok=True)

if url == GDP_DATA_URL:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not useful anymore but should we keep it to bring awareness of some possible changes upstream?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it in case eventually there is a difference with the two datasets. If we are sure it is not going to change I would remove it.

@philippemiron
Copy link
Contributor

philippemiron commented Sep 28, 2023

PS: there might be a more elegant way of doing it, but this should do it.

@selipot
Copy link
Member Author

selipot commented Sep 30, 2023

I have now tested the generation of ragged arrays for the entire content of the standard folder (2.01/) and the experimental folder for this PR. I have also tested the creation of ragged arrays for only 100 random ids in both cases and it seems that the creation of two different download folders for these cases is the way to go.

The one thing I have noticed is that the coordinate time is not as before a datetime object but a float variables which is the number of seconds since 1970-01-01. Is this intentional? Not sure what I prefer.

@philippemiron
Copy link
Contributor

Humm. It might not be converted at the time of the ragged array creation. Although if you save it to netCDF and reload it, does it work?

@philippemiron
Copy link
Contributor

philippemiron commented Oct 2, 2023

So I think there is no bug.

import clouddrift as cd
import xarray as xr
ds = cd.adapters.gdp1h.to_raggedarray(n_random_id=10).to_xarray()

if you do this, the values are returned in floats. But if you just do the following:

ds.to_netcdf("test.nc")
ds2 = xr.open_dataset("test.nc") # this automatically converts to datetime
ds2.time

the time values are now in datetime.

You can use this,

ds = xr.decode_cf(ds, decode_times=True)

to convert the time variables after creating the ragged arrays.

One possibility would be to add the decode_cf in our .to_xarray() method.

@milancurcic
Copy link
Member

As @philippemiron hinted, the difference between float and datetime64 times for GDP has to do with data representation after the dataset is loaded. In both scenarios the data in file is stored as float seconds.

So the question is really what should be the default (decode or don't decod) for RaggedArray.to_xarray() and for datasets.gdp*h() (It's decoded there by default).

I personally like simple float seconds because they're simple, but I know many people who like np.datetime64[ns].

Ideally, we should be consistent between adapters and datasets modules.

An important gotcha in context of clouddrift is that if you pass time in float seconds to velocity_from_position, you get velocities in m/s. If you pass time as np.datetime64[ns] you get velocities in m/ns. There won't be a warning because both are valid and expected. But the user may be surprised about the result (I was).

@selipot
Copy link
Member Author

selipot commented Oct 2, 2023

As @philippemiron hinted, the difference between float and datetime64 times for GDP has to do with data representation after the dataset is loaded. In both scenarios the data in file is stored as float seconds.

So the question is really what should be the default (decode or don't decod) for RaggedArray.to_xarray() and for datasets.gdp*h() (It's decoded there by default).

I personally like simple float seconds because they're simple, but I know many people who like np.datetime64[ns].

Ideally, we should be consistent between adapters and datasets modules.

Ok I confirm that I can reload also the ragged array I created with or without decoding the times!

I am not a fan of the datetime objects so my suggestion is to keep seconds as floats when creating the ragged arrays, if I understand correctly. And make it consistent across the data adapters and the data accessors if possible.

An important gotcha in context of clouddrift is that if you pass time in float seconds to velocity_from_position, you get velocities in m/s. If you pass time as np.datetime64[ns] you get velocities in m/ns. There won't be a warning because both are valid and expected. But the user may be surprised about the result (I was).

I am fine with this. I think it is reasonable to expect the user to 1) read the doc and 2) know what they input as argument.

@philippemiron
Copy link
Contributor

If you don't like the datetime object, you can always just do: xr.open_dataset("file.nc", decode_times=False) to get them in the original float values.

@philippemiron
Copy link
Contributor

What's holding this?

@selipot selipot marked this pull request as ready for review October 4, 2023 15:46
@selipot selipot merged commit af51004 into Cloud-Drift:main Oct 4, 2023
12 checks passed
@selipot selipot deleted the gdp-update branch October 4, 2023 15:46
philippemiron pushed a commit to philippemiron/clouddrift that referenced this pull request Nov 16, 2023
* update datasets.gdp1h()

* lint

* move GDP_VERSION

* file pattern change

* fix the fix

* typo

* adjust path with experimental url

* actually I prefer this

* forgot the default value

---------

Co-authored-by: Philippe Miron <philippe.miron@dtn.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arhicved-label-data-adapters Adapters for custom datasets into CloudDrift
Projects
None yet
Development

Successfully merging this pull request may close these issues.

different tmp_path in adapters.gdp1h for regular and experimental sources
3 participants