-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gdp-update #282
Gdp-update #282
Conversation
Not sure why this PR contains old commits? |
Yes. So the 6-hourly GDP is a completely different version track, i.e. we don't expect the 6-hourly GDP to follow the hourly one in terms of dataset updates/versions? The 2.01 upstream data URL seems to contain NetCDF files. Is that still a generation in progress or we can actually test and (if no problems) merge? |
Correct, 6-hourly is on a different version track.
I sent email to @RickLumpkin with @milancurcic copied. He is working on organizing the 2.01/ folder just like the 2.00/ folder so hold on. |
the 2.01/ FTP folder has been organized like 2.00/ so we can proceed? |
Do you think it's worth it to test processing it just in case? (I won't be able to do it before Monday) Otherwise, looks good to me! |
We should also soon discuss how do we want to treat different dataset versions from the |
Yes, we need to test. |
My proposition is to supply the latest by default but optionally supply previous ones. |
For
Thoughts? |
nc access was not yet supported by the NetCDF-C version that shipped with netCDF4 on PyPI, so Zarr was the only option when I introduced the datasets module. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested it and this fails:
import clouddrift as cd
ra = cd.adapters.gdp1h.to_raggedarray(n_random_id=10)
Adding the experimental URL works.
import clouddrift as cd
from clouddrift.adapters.gdp1h import GDP_DATA_URL_EXPERIMENTAL
ra = cd.adapters.gdp1h.to_raggedarray(n_random_id=10, url=GDP_DATA_URL_EXPERIMENTAL)
Should they update the metadata and remove the experimental URL if we are going to switch to v2.01?
Do you understand why it fails? @RickLumpkin is now populating the experimental folder with the files for the 2.02 update. So moving forward the final files for each update/version will be in 2.xx/ and the files for the on-going update will be in experimental/ The idea is that the clouddrift library will update its code to reflect the data updates. |
It's failing because of the
I don't have time to check now, but I guess it's just a matter of |
Yes, my last commit fixes that. But down the line there are more issues which I do not understand. I have no time either to pursue this today and will look next when i can. |
I fix the fix. You missed two more patterns. Working for 2.01 and 2.02 now. |
@@ -547,7 +548,7 @@ def to_raggedarray( | |||
-------- | |||
|
|||
Invoke `to_raggedarray` without any arguments to download all drifter data | |||
from the 2.00 GDP feed: | |||
from the 2.01 GDP feed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make the docstring include automatically the GDP_VERSION variable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's possible. The doc is also generated from the source code and it would have to be interpreted to get the value..
@@ -79,8 +80,8 @@ def download( | |||
os.makedirs(tmp_path, exist_ok=True) | |||
|
|||
if url == GDP_DATA_URL: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not useful anymore but should we keep it to bring awareness of some possible changes upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept it in case eventually there is a difference with the two datasets. If we are sure it is not going to change I would remove it.
PS: there might be a more elegant way of doing it, but this should do it. |
I have now tested the generation of ragged arrays for the entire content of the standard folder (2.01/) and the experimental folder for this PR. I have also tested the creation of ragged arrays for only 100 random ids in both cases and it seems that the creation of two different download folders for these cases is the way to go. The one thing I have noticed is that the coordinate |
Humm. It might not be converted at the time of the ragged array creation. Although if you save it to netCDF and reload it, does it work? |
So I think there is no bug.
if you do this, the values are returned in floats. But if you just do the following:
the time values are now in You can use this,
to convert the time variables after creating the ragged arrays. One possibility would be to add the |
As @philippemiron hinted, the difference between float and datetime64 times for GDP has to do with data representation after the dataset is loaded. In both scenarios the data in file is stored as float seconds. So the question is really what should be the default (decode or don't decod) for I personally like simple float seconds because they're simple, but I know many people who like Ideally, we should be consistent between adapters and datasets modules. An important gotcha in context of clouddrift is that if you pass time in float seconds to |
Ok I confirm that I can reload also the ragged array I created with or without decoding the times! I am not a fan of the datetime objects so my suggestion is to keep seconds as floats when creating the ragged arrays, if I understand correctly. And make it consistent across the data adapters and the data accessors if possible.
I am fine with this. I think it is reasonable to expect the user to 1) read the doc and 2) know what they input as argument. |
If you don't like the |
What's holding this? |
* update datasets.gdp1h() * lint * move GDP_VERSION * file pattern change * fix the fix * typo * adjust path with experimental url * actually I prefer this * forgot the default value --------- Co-authored-by: Philippe Miron <philippe.miron@dtn.com>
Modification of the GDP adapters function to reflect the hourly database update.
To merge once the DAC update completely the 2.01/ FTP folder.
Should the global variable be moved from gdp.py to gdp1h.py since it applies only to that product?