Gdp-update #282

selipot · 2023-09-26T01:11:59Z

Modification of the GDP adapters function to reflect the hourly database update.

To merge once the DAC update completely the 2.01/ FTP folder.

Should the global variable be moved from gdp.py to gdp1h.py since it applies only to that product?

selipot · 2023-09-26T01:13:26Z

Not sure why this PR contains old commits?

milancurcic · 2023-09-26T14:25:21Z

Should the global variable be moved from gdp.py to gdp1h.py since it applies only to that product?

Yes. So the 6-hourly GDP is a completely different version track, i.e. we don't expect the 6-hourly GDP to follow the hourly one in terms of dataset updates/versions?

The 2.01 upstream data URL seems to contain NetCDF files. Is that still a generation in progress or we can actually test and (if no problems) merge?

selipot · 2023-09-26T14:32:44Z

Should the global variable be moved from gdp.py to gdp1h.py since it applies only to that product?

Yes. So the 6-hourly GDP is a completely different version track, i.e. we don't expect the 6-hourly GDP to follow the hourly one in terms of dataset updates/versions?

Correct, 6-hourly is on a different version track.

The 2.01 upstream data URL seems to contain NetCDF files. Is that still a generation in progress or we can actually test and (if no problems) merge?

I sent email to @RickLumpkin with @milancurcic copied. He is working on organizing the 2.01/ folder just like the 2.00/ folder so hold on.

selipot · 2023-09-26T20:58:09Z

the 2.01/ FTP folder has been organized like 2.00/ so we can proceed?

milancurcic · 2023-09-26T21:44:31Z

Do you think it's worth it to test processing it just in case? (I won't be able to do it before Monday) Otherwise, looks good to me!

milancurcic · 2023-09-26T21:45:15Z

We should also soon discuss how do we want to treat different dataset versions from the clouddrift.datasets point of view.

selipot · 2023-09-26T21:46:54Z

Do you think it's worth it to test processing it just in case? (I won't be able to do it before Monday) Otherwise, looks good to me!

Yes, we need to test.

selipot · 2023-09-26T21:47:50Z

We should also soon discuss how do we want to treat different dataset versions from the clouddrift.datasets point of view.

My proposition is to supply the latest by default but optionally supply previous ones.

selipot · 2023-09-26T23:48:02Z

For clouddrift.datasets.gdp1h() should an option be to open the dataset from the zarr or nc file?

netcdf dependencies can be problematic from my experience so this favors zarr
the zarr file has shown issues for low bandwidth so this favors netcdf

Thoughts?

milancurcic · 2023-09-26T23:58:14Z

nc access was not yet supported by the NetCDF-C version that shipped with netCDF4 on PyPI, so Zarr was the only option when I introduced the datasets module.

philippemiron

I've tested it and this fails:

import clouddrift as cd
ra = cd.adapters.gdp1h.to_raggedarray(n_random_id=10)

Adding the experimental URL works.

import clouddrift as cd
from clouddrift.adapters.gdp1h import GDP_DATA_URL_EXPERIMENTAL
ra = cd.adapters.gdp1h.to_raggedarray(n_random_id=10, url=GDP_DATA_URL_EXPERIMENTAL)

Should they update the metadata and remove the experimental URL if we are going to switch to v2.01?

selipot · 2023-09-27T17:50:06Z

Do you understand why it fails?

@RickLumpkin is now populating the experimental folder with the files for the 2.02 update. So moving forward the final files for each update/version will be in 2.xx/ and the files for the on-going update will be in experimental/

The idea is that the clouddrift library will update its code to reflect the data updates.

philippemiron · 2023-09-27T19:11:51Z

It's failing because of the filename_pattern (I think) that's in download() and to_raggedarray().

    if url == GDP_DATA_URL:
        pattern = "drifter_hourly_[0-9]*.nc"
        filename_pattern = "drifter_{id}.nc"
    elif url == GDP_DATA_URL_EXPERIMENTAL:
        pattern = "drifter_hourly_[0-9]*.nc"
        filename_pattern = "drifter_hourly_{id}.nc"

I don't have time to check now, but I guess it's just a matter of "drifter_{id}.nc" to "drifter_hourly_{id}.nc". And actually, maybe we don't need that condition anymore since for both v2.01 and v2.02 the patterns are the same.

selipot · 2023-09-27T19:13:27Z

It's failing because of the filename_pattern (I think) that's in download() and to_raggedarray().
    if url == GDP_DATA_URL:
        pattern = "drifter_hourly_[0-9]*.nc"
        filename_pattern = "drifter_{id}.nc"
    elif url == GDP_DATA_URL_EXPERIMENTAL:
        pattern = "drifter_hourly_[0-9]*.nc"
        filename_pattern = "drifter_hourly_{id}.nc"
I don't have time to check now, but I guess it's just a matter of "drifter_{id}.nc" to "drifter_hourly_{id}.nc". And actually, maybe we don't need that condition anymore since for both v2.01 and v2.02 the pattern is the same.

Yes, my last commit fixes that. But down the line there are more issues which I do not understand. I have no time either to pursue this today and will look next when i can.

philippemiron · 2023-09-27T23:47:59Z

I fix the fix. You missed two more patterns. Working for 2.01 and 2.02 now.

selipot · 2023-09-28T15:23:30Z

clouddrift/adapters/gdp1h.py

@@ -547,7 +548,7 @@ def to_raggedarray(
    --------

    Invoke `to_raggedarray` without any arguments to download all drifter data
-    from the 2.00 GDP feed:
+    from the 2.01 GDP feed:


Could we make the docstring include automatically the GDP_VERSION variable here?

I don't think that's possible. The doc is also generated from the source code and it would have to be interpreted to get the value..

selipot · 2023-09-28T15:24:16Z

clouddrift/adapters/gdp1h.py

@@ -79,8 +80,8 @@ def download(
    os.makedirs(tmp_path, exist_ok=True)

    if url == GDP_DATA_URL:


This is not useful anymore but should we keep it to bring awareness of some possible changes upstream?

I kept it in case eventually there is a difference with the two datasets. If we are sure it is not going to change I would remove it.

philippemiron · 2023-09-28T22:34:38Z

PS: there might be a more elegant way of doing it, but this should do it.

selipot · 2023-09-30T23:55:59Z

I have now tested the generation of ragged arrays for the entire content of the standard folder (2.01/) and the experimental folder for this PR. I have also tested the creation of ragged arrays for only 100 random ids in both cases and it seems that the creation of two different download folders for these cases is the way to go.

The one thing I have noticed is that the coordinate time is not as before a datetime object but a float variables which is the number of seconds since 1970-01-01. Is this intentional? Not sure what I prefer.

philippemiron · 2023-10-01T04:37:45Z

Humm. It might not be converted at the time of the ragged array creation. Although if you save it to netCDF and reload it, does it work?

philippemiron · 2023-10-02T14:27:59Z

So I think there is no bug.

import clouddrift as cd
import xarray as xr
ds = cd.adapters.gdp1h.to_raggedarray(n_random_id=10).to_xarray()

if you do this, the values are returned in floats. But if you just do the following:

ds.to_netcdf("test.nc")
ds2 = xr.open_dataset("test.nc") # this automatically converts to datetime
ds2.time

the time values are now in datetime.

You can use this,

ds = xr.decode_cf(ds, decode_times=True)

to convert the time variables after creating the ragged arrays.

One possibility would be to add the decode_cf in our .to_xarray() method.

milancurcic · 2023-10-02T14:56:16Z

As @philippemiron hinted, the difference between float and datetime64 times for GDP has to do with data representation after the dataset is loaded. In both scenarios the data in file is stored as float seconds.

So the question is really what should be the default (decode or don't decod) for RaggedArray.to_xarray() and for datasets.gdp*h() (It's decoded there by default).

I personally like simple float seconds because they're simple, but I know many people who like np.datetime64[ns].

Ideally, we should be consistent between adapters and datasets modules.

An important gotcha in context of clouddrift is that if you pass time in float seconds to velocity_from_position, you get velocities in m/s. If you pass time as np.datetime64[ns] you get velocities in m/ns. There won't be a warning because both are valid and expected. But the user may be surprised about the result (I was).

selipot · 2023-10-02T15:27:37Z

As @philippemiron hinted, the difference between float and datetime64 times for GDP has to do with data representation after the dataset is loaded. In both scenarios the data in file is stored as float seconds.

So the question is really what should be the default (decode or don't decod) for RaggedArray.to_xarray() and for datasets.gdp*h() (It's decoded there by default).

I personally like simple float seconds because they're simple, but I know many people who like np.datetime64[ns].

Ideally, we should be consistent between adapters and datasets modules.

Ok I confirm that I can reload also the ragged array I created with or without decoding the times!

I am not a fan of the datetime objects so my suggestion is to keep seconds as floats when creating the ragged arrays, if I understand correctly. And make it consistent across the data adapters and the data accessors if possible.

An important gotcha in context of clouddrift is that if you pass time in float seconds to velocity_from_position, you get velocities in m/s. If you pass time as np.datetime64[ns] you get velocities in m/ns. There won't be a warning because both are valid and expected. But the user may be surprised about the result (I was).

I am fine with this. I think it is reasonable to expect the user to 1) read the doc and 2) know what they input as argument.

philippemiron · 2023-10-02T17:32:22Z

If you don't like the datetime object, you can always just do: xr.open_dataset("file.nc", decode_times=False) to get them in the original float values.

philippemiron · 2023-10-04T04:45:40Z

What's holding this?

* update datasets.gdp1h() * lint * move GDP_VERSION * file pattern change * fix the fix * typo * adjust path with experimental url * actually I prefer this * forgot the default value --------- Co-authored-by: Philippe Miron <philippe.miron@dtn.com>

selipot and others added 5 commits September 17, 2023 07:55

add wavelet module to doc?

59b6e62

fix the few warnings after generating the doc

4b39d15

lint

29da4ee

Merge branch 'Cloud-Drift:main' into main

425ca31

version 2.01

4313187

selipot marked this pull request as ready for review September 26, 2023 01:12

selipot requested a review from milancurcic September 26, 2023 01:13

selipot marked this pull request as draft September 26, 2023 01:13

selipot self-assigned this Sep 26, 2023

selipot added the arhicved-label-data-adapters Adapters for custom datasets into CloudDrift label Sep 26, 2023

milancurcic approved these changes Sep 26, 2023

View reviewed changes

selipot added 2 commits September 26, 2023 16:42

update datasets.gdp1h()

53895a5

lint

bdae72d

selipot requested a review from philippemiron September 27, 2023 14:56

move GDP_VERSION

b02e499

philippemiron reviewed Sep 27, 2023

View reviewed changes

file pattern change

d70048d

fix the fix

03c048c

typo

5839b09

selipot mentioned this pull request Sep 28, 2023

different tmp_path in adapters.gdp1h for regular and experimental sources #285

Closed

selipot commented Sep 28, 2023

View reviewed changes

adjust path with experimental url

ec34676

Philippe Miron added 2 commits September 28, 2023 18:36

actually I prefer this

de02a56

forgot the default value

5ee5945

philippemiron linked an issue Sep 28, 2023 that may be closed by this pull request

different tmp_path in adapters.gdp1h for regular and experimental sources #285

Closed

philippemiron requested a review from milancurcic September 28, 2023 22:47

milancurcic mentioned this pull request Oct 2, 2023

Make time consistent in floating-point seconds since epoch across adapters and datasets #287

Closed

selipot marked this pull request as ready for review October 4, 2023 15:46

selipot merged commit af51004 into Cloud-Drift:main Oct 4, 2023
12 checks passed

selipot deleted the gdp-update branch October 4, 2023 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gdp-update #282

Gdp-update #282

selipot commented Sep 26, 2023

selipot commented Sep 26, 2023

milancurcic commented Sep 26, 2023

selipot commented Sep 26, 2023

selipot commented Sep 26, 2023

milancurcic commented Sep 26, 2023

milancurcic commented Sep 26, 2023

selipot commented Sep 26, 2023

selipot commented Sep 26, 2023

selipot commented Sep 26, 2023

milancurcic commented Sep 26, 2023

philippemiron left a comment •

edited

Loading

selipot commented Sep 27, 2023

philippemiron commented Sep 27, 2023 •

edited

Loading

selipot commented Sep 27, 2023

philippemiron commented Sep 27, 2023 •

edited

Loading

selipot Sep 28, 2023

philippemiron Sep 28, 2023

selipot Sep 28, 2023

philippemiron Sep 28, 2023

philippemiron commented Sep 28, 2023 •

edited

Loading

selipot commented Sep 30, 2023

philippemiron commented Oct 1, 2023

philippemiron commented Oct 2, 2023 •

edited

Loading

milancurcic commented Oct 2, 2023

selipot commented Oct 2, 2023

philippemiron commented Oct 2, 2023

philippemiron commented Oct 4, 2023

		@@ -79,8 +80,8 @@ def download(
		os.makedirs(tmp_path, exist_ok=True)

		if url == GDP_DATA_URL:

Gdp-update #282

Gdp-update #282

Conversation

selipot commented Sep 26, 2023

selipot commented Sep 26, 2023

milancurcic commented Sep 26, 2023

selipot commented Sep 26, 2023

selipot commented Sep 26, 2023

milancurcic commented Sep 26, 2023

milancurcic commented Sep 26, 2023

selipot commented Sep 26, 2023

selipot commented Sep 26, 2023

selipot commented Sep 26, 2023

milancurcic commented Sep 26, 2023

philippemiron left a comment • edited Loading

Choose a reason for hiding this comment

selipot commented Sep 27, 2023

philippemiron commented Sep 27, 2023 • edited Loading

selipot commented Sep 27, 2023

philippemiron commented Sep 27, 2023 • edited Loading

selipot Sep 28, 2023

Choose a reason for hiding this comment

philippemiron Sep 28, 2023

Choose a reason for hiding this comment

selipot Sep 28, 2023

Choose a reason for hiding this comment

philippemiron Sep 28, 2023

Choose a reason for hiding this comment

philippemiron commented Sep 28, 2023 • edited Loading

selipot commented Sep 30, 2023

philippemiron commented Oct 1, 2023

philippemiron commented Oct 2, 2023 • edited Loading

milancurcic commented Oct 2, 2023

selipot commented Oct 2, 2023

philippemiron commented Oct 2, 2023

philippemiron commented Oct 4, 2023

philippemiron left a comment •

edited

Loading

philippemiron commented Sep 27, 2023 •

edited

Loading

philippemiron commented Sep 27, 2023 •

edited

Loading

philippemiron commented Sep 28, 2023 •

edited

Loading

philippemiron commented Oct 2, 2023 •

edited

Loading