load_cdf helper function #537

bourque · 2024-04-25T20:37:11Z

Change Summary

Overview

Motivated by the comment made by @greglucas here, this PR adds a load_cdf function to cdf.utils so that end users don't necessarily need cdflib to open CDF files. The function is currently pretty bare-bones. Any suggestions on how to expand it or add functionality is welcomed!

New Files

imap_processing/tests/cdf/data/imap_codice_l1a_hskp-20100101_v001.cdf
- A CDF file used for testing

Updated Files

imap_processing/cdf/utils.py
- Added load_cdf function
imap_processing/tests/cdf/test_utils.py
- Added test for load_cdf function. Also added some missing docstrings.

vmartinez-cu

Looks great!

greglucas

I think it is a good idea to have something like this.

It would be nice to see it used when we add it too though, rather than unused code. Are you able to use this in your tests or other instrument's tests so it also gets some use?

imap_processing/tests/cdf/test_utils.py

…within the testing framework, instead of using an "external" CDF file. Also fixed small docstring formatting issue

bourque · 2024-04-26T22:57:25Z

I think it is a good idea to have something like this.

It would be nice to see it used when we add it too though, rather than unused code. Are you able to use this in your tests or other instrument's tests so it also gets some use?

Sorry, I read past your comment on initial lookthrough. I can think of at least one instance under the CoDICE umbrella where I can use the new function. I will take a look around and see if I can spot any others and update this PR accordingly.

…processing into load-cdf-helper-function

bourque · 2024-04-29T18:08:51Z

I think this is ready to go pending final approval from @greglucas

greglucas

This is great! I really appreciate that you started using this everywhere too. I think it is nice to put all of cdflib in one location and then we can update things in just one spot in the future.

imap_processing/cdf/utils.py

greglucas · 2024-05-01T20:59:26Z

imap_processing/tests/cdf/test_utils.py

-    assert calc_start_time(0) == launch_time
-    assert calc_start_time(1) == launch_time + np.timedelta64(1, "s")
+@pytest.fixture()
+def test_dataset():


Nice! I like the re-use of this :)

I'm not sure how feasible/hard this would be, but can you do any assertion on round-tripping? You are testing loading/writing separately, but can you say anything about test_dataset == load_cdf(write_cdf(test_dataset)) Maybe testing it has the same variables or something like that if == doesn't work.

I think I follow what you are saying here but just want to be clear. Are you talking about adding a test like this?

def test_written_and_loaded_dataset(test_dataset): assert test_dataset == load_cdf(write_cdf(test_dataset))

to make sure that nothing i getting changed in the xarray object when passed through the write_cdf() and load_cdf() function?

Yep, exactly! Making sure it is a reversible operation and we are getting the same data in and out.

Gotcha. Unfortunately it is indeed breaking, so good that we checked!

Uh oh! Hopefully not too big of an issue? Does it need to be fixed in cdflib or on our end... there is hardly anything we are doing here in these functions.

I think I have identified that the transformation is happening within cdflib:

In [1]: import cdflib ...: from cdflib import xarray ...: import xarray as xr ...: import numpy as np ...: from imap_processing.cdf.global_attrs import ConstantCoordinates ...: from imap_processing.swe.swe_cdf_attrs import swe_l1a_global_attrs In [2]: filename = 'imap_swe_l1_sci_20100101_v001.cdf' In [3]: dataset = xr.Dataset( ...: { ...: "epoch": ( ...: "epoch", ...: [ ...: np.datetime64("2010-01-01T00:01:01", "ns"), ...: np.datetime64("2010-01-01T00:01:02", "ns"), ...: np.datetime64("2010-01-01T00:01:03", "ns"), ...: ], ...: ), ...: }, ...: attrs=swe_l1a_global_attrs.output() ...: | {"Logical_source": "imap_swe_l1_sci", "Data_version": "001"}, ...: ) ...: dataset["epoch"].attrs = ConstantCoordinates.EPOCH In [4]: dataset Out[4]: <xarray.Dataset> Size: 24B Dimensions: (epoch: 3) Coordinates: * epoch (epoch) datetime64[ns] 24B 2010-01-01T00:01:01 ... 2010-01-01T00... Data variables: *empty* Attributes: (12/15) Project: STP>Solar-Terrestrial Physics Source_name: IMAP>Interstellar Mapping and Acceleration P... Discipline: Solar Physics>Heliospheric Physics Mission_group: IMAP>Interstellar Mapping and Acceleration P... PI_name: Dr. David J. McComas PI_affiliation: ('Princeton Plasma Physics Laboratory', '100... ... ... TEXT: The Solar Wind Electron (SWE) instrument mea... Instrument_type: Particles (space) Logical_file_id: ['FILL ME IN AT FILE CREATION'] Data_type: L1A_SCI>Level-1A Science Data Logical_source: imap_swe_l1_sci Logical_source_description: IMAP Mission SWE Instrument Level-1A Data In [5]: cdflib.xarray.xarray_to_cdf(dataset, filename) In [6]: new_dataset = cdflib.xarray.cdf_to_xarray(filename) In [7]: new_dataset Out[7]: <xarray.Dataset> Size: 24B Dimensions: (record0: 3) Dimensions without coordinates: record0 Data variables: epoch (record0) int64 24B 1262304061000000000 ... 1262304063000000000 Attributes: (12/15) Project: ['STP>Solar-Terrestrial Physics'] Source_name: ['IMAP>Interstellar Mapping and Acceleration... Discipline: ['Solar Physics>Heliospheric Physics'] Mission_group: ['IMAP>Interstellar Mapping and Acceleration... PI_name: ['Dr. David J. McComas'] PI_affiliation: ['Princeton Plasma Physics Laboratory', '100... ... ... TEXT: ['The Solar Wind Electron (SWE) instrument m... Instrument_type: ['Particles (space)'] Logical_file_id: ['FILL ME IN AT FILE CREATION'] Data_type: ['L1A_SCI>Level-1A Science Data'] Logical_source: ['imap_swe_l1_sci'] Logical_source_description: ['IMAP Mission SWE Instrument Level-1A Data']

ping @bryan-harter I think this is a bug within cdflib, but maybe it isn't so much a bug as a limitation of CDF? Is there any way to keep the coordinates when loading the cdf from disk into an xarray dataset?

So the way cdf_to_xarray tells what the dimensions are is only through what the other variables in the file are pointing at (i.e. with a DEPEND_0/1/2/3/etc).

If there are no other variables in the file, then cdf_to_xarray doesn't consider "epoch" a dimension. So its more of a limitation of CDF I suppose. One way we can get around this is to add a DEPEND_0 to the epoch variable that just points to itself. So if you add the following, the code will identify the dimension correctly:

dataset["epoch"].attrs['DEPEND_0'] = 'epoch'

Alternatively you can just add some dummy variable, like:

dataset = xr.Dataset( { "epoch": ( "epoch", [ np.datetime64("2010-01-01T00:01:01", "ns"), np.datetime64("2010-01-01T00:01:02", "ns"), np.datetime64("2010-01-01T00:01:03", "ns"), ], ), "hello": ( "hello", [ 1, 2, 3, ], ), }, attrs=swe_l1a_global_attrs.output() | {"Logical_source": "imap_swe_l1_sci", "Data_version": "001"}, ) dataset["epoch"].attrs = ConstantCoordinates.EPOCH dataset["hello"].attrs = swe_metadata_attrs.output()

In this case, since the variable "hello" points to "epoch", then cdf_to_xarray correctly identifies "epoch" as a dimension

I guess on the cdflib side, I can try to make changes so that 1D "support_variables" are always their own dimension? I can try taking a look at that and see if that fundamentally breaks anything. But it handles time variables slightly differently than the other DEPENDs.

@bryan-harter Thanks for the info on that. Your simple fix of adding the DEPEND_0 appears to be working as intended!

One other thing that seems to be changing is that cdflib is writing and/or loading the CDF attributes into lists, i.e.:

Attributes: (12/15) Project: STP>Solar-Terrestrial Physics Source_name: IMAP>Interstellar Mapping and Acceleration P... Discipline: Solar Physics>Heliospheric Physics Mission_group: IMAP>Interstellar Mapping and Acceleration P... PI_name: Dr. David J. McComas PI_affiliation: ('Princeton Plasma Physics Laboratory', '100... ... ... TEXT: The Solar Wind Electron (SWE) instrument mea... Instrument_type: Particles (space) Logical_file_id: ['FILL ME IN AT FILE CREATION'] Data_type: L1A_SCI>Level-1A Science Data Logical_source: imap_swe_l1_sci Logical_source_description: IMAP Mission SWE Instrument Level-1A Data

gets turned into

Attributes: (12/15) Project: ['STP>Solar-Terrestrial Physics'] Source_name: ['IMAP>Interstellar Mapping and Acceleration... Discipline: ['Solar Physics>Heliospheric Physics'] Mission_group: ['IMAP>Interstellar Mapping and Acceleration... PI_name: ['Dr. David J. McComas'] PI_affiliation: ['Princeton Plasma Physics Laboratory', '100... ... ... TEXT: ['The Solar Wind Electron (SWE) instrument m... Instrument_type: ['Particles (space)'] Logical_file_id: ['FILL ME IN AT FILE CREATION'] Data_type: ['L1A_SCI>Level-1A Science Data'] Logical_source: ['imap_swe_l1_sci'] Logical_source_description: ['IMAP Mission SWE Instrument Level-1A Data']

Do you know if this is expected / purposeful?

tech3371 · 2024-05-03T22:09:17Z

imap_processing/cdf/utils.py

+    xr.Dataset
+        The ``xarray`` dataset for the CDF file
+    """
+    return cdf_to_xarray(file_path, kwargs)


I may have found a bug with this cdf_to_xarray. When we read cdf file using this, it returns global attrs as dictionary but all the values are returned as list which is ok if we are only reading the cdf and not using it for subsequent data level. But I want to carry l1a attrs for l1b with minor updates. I am sure others may want to use similar manner. If so, I wonder if we can extend this function to convert global attrs values back into strings instead of list here or update on cdflib. @bryan-harter may know which way is better. This is a comment. We don’t have to add this in this PR. We can extend this in upcoming PR.

Glad you added this common functions!

Yeah, I also noticed this, which also makes it a bit difficult to test. Good idea to put the fix into this function, though I wonder if the fix should be made 'upstream' in cdflib? Interested to hear @bryan-harter thoughts on this.

So by default the cdf reader function just assumes all attributes are lists because they technically are all lists from the CDF's point of view. When writing an attribute to a CDF, you always specify how long the attribute list is you are about to write. So there is technically no way for the CDF to keep information about if an attribute is simply a single item v.s. a list with a single item in it.

One thing I could do is have cdflib assume all lists with only 1 item are returned as not lists. But just be aware that means that if you DO feed in the attribute as a list with one item, this code change would cause the opposite problem and get rid of the list.

I'm open to ideas!

Thanks @bryan-harter!

My gut says that it makes sense to deal with the single-item vs list logic within write_cdf()/load_cdf() and keep cdflib to be "closer" to the CDF requirements / more generalized so that issues doesn't arise for other developers 🤷🏻

Thank you guys for already tackling this in this PR!

…licable

bourque · 2024-05-06T20:36:06Z

Ok, I got this working by adding these lines of code to load_cdf():

    # cdf_to_xarray converts single-value attributes to lists
    # convert these back to single values where applicable
    for attribute in dataset.attrs:
        value = dataset.attrs[attribute]
        if isinstance(value, list) and len(value) == 1:
            dataset.attrs[attribute] = value[0]

bourque added 3 commits April 25, 2024 14:21

Added function to load CDF file

3703611

Added missing docstrings

d8cb401

Added CDF file for testing

64b85f1

bourque requested review from sdhoyt, greglucas, subagonsouth, tech3371, vmartinez-cu, bryan-harter, laspsandoval and maxinelasp April 25, 2024 20:37

bourque self-assigned this Apr 25, 2024

bourque added the CDF Related to CDF files label Apr 25, 2024

Fixed issue with path to test data

80c6fb8

vmartinez-cu approved these changes Apr 25, 2024

View reviewed changes

tech3371 approved these changes Apr 25, 2024

View reviewed changes

greglucas requested changes Apr 26, 2024

View reviewed changes

imap_processing/tests/cdf/test_utils.py Outdated Show resolved Hide resolved

Now using pytest fixture to produce a dataset and writing a CDF file …

bc39c14

…within the testing framework, instead of using an "external" CDF file. Also fixed small docstring formatting issue

bourque requested a review from greglucas April 26, 2024 22:52

bourque added 3 commits April 29, 2024 11:02

Merge branch 'dev' of github.com:IMAP-Science-Operations-Center/imap_…

e713122

…processing into load-cdf-helper-function

Replaced references to cdf_to_xarray with new load_cdf function

513a576

Added kwargs support for load_cdf

c71ecf0

greglucas approved these changes May 1, 2024

View reviewed changes

Updated docstrings

2c331c0

tech3371 reviewed May 3, 2024

View reviewed changes

Added code to convert attribute lists back to single values where app…

6d57ec5

…licable

bourque merged commit de2b79f into IMAP-Science-Operations-Center:dev May 6, 2024
15 checks passed

bourque deleted the load-cdf-helper-function branch May 6, 2024 20:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

load_cdf helper function #537

load_cdf helper function #537

bourque commented Apr 25, 2024

vmartinez-cu left a comment

greglucas left a comment

bourque commented Apr 26, 2024

bourque commented Apr 29, 2024

greglucas left a comment

greglucas May 1, 2024

bourque May 1, 2024

greglucas May 1, 2024

bourque May 1, 2024

greglucas May 1, 2024

bourque May 2, 2024 •

edited

Loading

greglucas May 2, 2024

bryan-harter May 2, 2024 •

edited

Loading

bryan-harter May 2, 2024

bourque May 2, 2024

tech3371 May 3, 2024 •

edited

Loading

tech3371 May 3, 2024

bourque May 6, 2024

bryan-harter May 6, 2024 •

edited

Loading

bourque May 6, 2024

tech3371 May 9, 2024

bourque commented May 6, 2024

load_cdf helper function #537

load_cdf helper function #537

Conversation

bourque commented Apr 25, 2024

Change Summary

Overview

New Files

Updated Files

vmartinez-cu left a comment

Choose a reason for hiding this comment

greglucas left a comment

Choose a reason for hiding this comment

bourque commented Apr 26, 2024

bourque commented Apr 29, 2024

greglucas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bourque May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bryan-harter May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tech3371 May 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bryan-harter May 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bourque commented May 6, 2024

bourque May 2, 2024 •

edited

Loading

bryan-harter May 2, 2024 •

edited

Loading

tech3371 May 3, 2024 •

edited

Loading

bryan-harter May 6, 2024 •

edited

Loading