Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_cdf helper function #537

Merged

Conversation

bourque
Copy link
Collaborator

@bourque bourque commented Apr 25, 2024

Change Summary

Overview

Motivated by the comment made by @greglucas here, this PR adds a load_cdf function to cdf.utils so that end users don't necessarily need cdflib to open CDF files. The function is currently pretty bare-bones. Any suggestions on how to expand it or add functionality is welcomed!

New Files

  • imap_processing/tests/cdf/data/imap_codice_l1a_hskp-20100101_v001.cdf
    • A CDF file used for testing

Updated Files

  • imap_processing/cdf/utils.py
    • Added load_cdf function
  • imap_processing/tests/cdf/test_utils.py
    • Added test for load_cdf function. Also added some missing docstrings.

@bourque bourque self-assigned this Apr 25, 2024
@bourque bourque added the CDF Related to CDF files label Apr 25, 2024
Copy link
Contributor

@vmartinez-cu vmartinez-cu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

Copy link
Collaborator

@greglucas greglucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a good idea to have something like this.

It would be nice to see it used when we add it too though, rather than unused code. Are you able to use this in your tests or other instrument's tests so it also gets some use?

imap_processing/tests/cdf/test_utils.py Outdated Show resolved Hide resolved
…within the testing framework, instead of using an "external" CDF file. Also fixed small docstring formatting issue
@bourque bourque requested a review from greglucas April 26, 2024 22:52
@bourque
Copy link
Collaborator Author

bourque commented Apr 26, 2024

I think it is a good idea to have something like this.

It would be nice to see it used when we add it too though, rather than unused code. Are you able to use this in your tests or other instrument's tests so it also gets some use?

Sorry, I read past your comment on initial lookthrough. I can think of at least one instance under the CoDICE umbrella where I can use the new function. I will take a look around and see if I can spot any others and update this PR accordingly.

@bourque
Copy link
Collaborator Author

bourque commented Apr 29, 2024

I think this is ready to go pending final approval from @greglucas

Copy link
Collaborator

@greglucas greglucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I really appreciate that you started using this everywhere too. I think it is nice to put all of cdflib in one location and then we can update things in just one spot in the future.

imap_processing/cdf/utils.py Outdated Show resolved Hide resolved
imap_processing/cdf/utils.py Show resolved Hide resolved
assert calc_start_time(0) == launch_time
assert calc_start_time(1) == launch_time + np.timedelta64(1, "s")
@pytest.fixture()
def test_dataset():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I like the re-use of this :)

I'm not sure how feasible/hard this would be, but can you do any assertion on round-tripping? You are testing loading/writing separately, but can you say anything about test_dataset == load_cdf(write_cdf(test_dataset)) Maybe testing it has the same variables or something like that if == doesn't work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I follow what you are saying here but just want to be clear. Are you talking about adding a test like this?

def test_written_and_loaded_dataset(test_dataset):

    assert test_dataset == load_cdf(write_cdf(test_dataset))

to make sure that nothing i getting changed in the xarray object when passed through the write_cdf() and load_cdf() function?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, exactly! Making sure it is a reversible operation and we are getting the same data in and out.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Unfortunately it is indeed breaking, so good that we checked!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh oh! Hopefully not too big of an issue? Does it need to be fixed in cdflib or on our end... there is hardly anything we are doing here in these functions.

Copy link
Collaborator Author

@bourque bourque May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have identified that the transformation is happening within cdflib:

In [1]: import cdflib
   ...: from cdflib import xarray
   ...: import xarray as xr
   ...: import numpy as np
   ...: from imap_processing.cdf.global_attrs import ConstantCoordinates
   ...: from imap_processing.swe.swe_cdf_attrs import swe_l1a_global_attrs

In [2]: filename = 'imap_swe_l1_sci_20100101_v001.cdf'

In [3]:     dataset = xr.Dataset(
   ...:         {
   ...:             "epoch": (
   ...:                 "epoch",
   ...:                 [
   ...:                     np.datetime64("2010-01-01T00:01:01", "ns"),
   ...:                     np.datetime64("2010-01-01T00:01:02", "ns"),
   ...:                     np.datetime64("2010-01-01T00:01:03", "ns"),
   ...:                 ],
   ...:             ),
   ...:         },
   ...:         attrs=swe_l1a_global_attrs.output()
   ...:         | {"Logical_source": "imap_swe_l1_sci", "Data_version": "001"},
   ...:     )
   ...:     dataset["epoch"].attrs = ConstantCoordinates.EPOCH

In [4]: dataset
Out[4]: 
<xarray.Dataset> Size: 24B
Dimensions:  (epoch: 3)
Coordinates:
  * epoch    (epoch) datetime64[ns] 24B 2010-01-01T00:01:01 ... 2010-01-01T00...
Data variables:
    *empty*
Attributes: (12/15)
    Project:                     STP>Solar-Terrestrial Physics
    Source_name:                 IMAP>Interstellar Mapping and Acceleration P...
    Discipline:                  Solar Physics>Heliospheric Physics
    Mission_group:               IMAP>Interstellar Mapping and Acceleration P...
    PI_name:                     Dr. David J. McComas
    PI_affiliation:              ('Princeton Plasma Physics Laboratory', '100...
    ...                          ...
    TEXT:                        The Solar Wind Electron (SWE) instrument mea...
    Instrument_type:             Particles (space)
    Logical_file_id:             ['FILL ME IN AT FILE CREATION']
    Data_type:                   L1A_SCI>Level-1A Science Data
    Logical_source:              imap_swe_l1_sci
    Logical_source_description:  IMAP Mission SWE Instrument Level-1A Data

In [5]: cdflib.xarray.xarray_to_cdf(dataset, filename)

In [6]: new_dataset = cdflib.xarray.cdf_to_xarray(filename)

In [7]: new_dataset
Out[7]: 
<xarray.Dataset> Size: 24B
Dimensions:  (record0: 3)
Dimensions without coordinates: record0
Data variables:
    epoch    (record0) int64 24B 1262304061000000000 ... 1262304063000000000
Attributes: (12/15)
    Project:                     ['STP>Solar-Terrestrial Physics']
    Source_name:                 ['IMAP>Interstellar Mapping and Acceleration...
    Discipline:                  ['Solar Physics>Heliospheric Physics']
    Mission_group:               ['IMAP>Interstellar Mapping and Acceleration...
    PI_name:                     ['Dr. David J. McComas']
    PI_affiliation:              ['Princeton Plasma Physics Laboratory', '100...
    ...                          ...
    TEXT:                        ['The Solar Wind Electron (SWE) instrument m...
    Instrument_type:             ['Particles (space)']
    Logical_file_id:             ['FILL ME IN AT FILE CREATION']
    Data_type:                   ['L1A_SCI>Level-1A Science Data']
    Logical_source:              ['imap_swe_l1_sci']
    Logical_source_description:  ['IMAP Mission SWE Instrument Level-1A Data']

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @bryan-harter I think this is a bug within cdflib, but maybe it isn't so much a bug as a limitation of CDF? Is there any way to keep the coordinates when loading the cdf from disk into an xarray dataset?

Copy link
Contributor

@bryan-harter bryan-harter May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the way cdf_to_xarray tells what the dimensions are is only through what the other variables in the file are pointing at (i.e. with a DEPEND_0/1/2/3/etc).

If there are no other variables in the file, then cdf_to_xarray doesn't consider "epoch" a dimension. So its more of a limitation of CDF I suppose. One way we can get around this is to add a DEPEND_0 to the epoch variable that just points to itself. So if you add the following, the code will identify the dimension correctly:

dataset["epoch"].attrs['DEPEND_0'] = 'epoch'

Alternatively you can just add some dummy variable, like:

dataset = xr.Dataset(
         {
             "epoch": (
                 "epoch",
                 [
                     np.datetime64("2010-01-01T00:01:01", "ns"),
                     np.datetime64("2010-01-01T00:01:02", "ns"),
                     np.datetime64("2010-01-01T00:01:03", "ns"),
                 ],
             ),
             "hello": (
                 "hello",
                 [
                     1,
                     2,
                     3,
                 ],
             ),
         },
         attrs=swe_l1a_global_attrs.output()
         | {"Logical_source": "imap_swe_l1_sci", "Data_version": "001"},
     )
dataset["epoch"].attrs = ConstantCoordinates.EPOCH
dataset["hello"].attrs = swe_metadata_attrs.output()

In this case, since the variable "hello" points to "epoch", then cdf_to_xarray correctly identifies "epoch" as a dimension

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess on the cdflib side, I can try to make changes so that 1D "support_variables" are always their own dimension? I can try taking a look at that and see if that fundamentally breaks anything. But it handles time variables slightly differently than the other DEPENDs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryan-harter Thanks for the info on that. Your simple fix of adding the DEPEND_0 appears to be working as intended!

One other thing that seems to be changing is that cdflib is writing and/or loading the CDF attributes into lists, i.e.:

Attributes: (12/15)
    Project:                     STP>Solar-Terrestrial Physics
    Source_name:                 IMAP>Interstellar Mapping and Acceleration P...
    Discipline:                  Solar Physics>Heliospheric Physics
    Mission_group:               IMAP>Interstellar Mapping and Acceleration P...
    PI_name:                     Dr. David J. McComas
    PI_affiliation:              ('Princeton Plasma Physics Laboratory', '100...
    ...                          ...
    TEXT:                        The Solar Wind Electron (SWE) instrument mea...
    Instrument_type:             Particles (space)
    Logical_file_id:             ['FILL ME IN AT FILE CREATION']
    Data_type:                   L1A_SCI>Level-1A Science Data
    Logical_source:              imap_swe_l1_sci
    Logical_source_description:  IMAP Mission SWE Instrument Level-1A Data

gets turned into

Attributes: (12/15)
    Project:                     ['STP>Solar-Terrestrial Physics']
    Source_name:                 ['IMAP>Interstellar Mapping and Acceleration...
    Discipline:                  ['Solar Physics>Heliospheric Physics']
    Mission_group:               ['IMAP>Interstellar Mapping and Acceleration...
    PI_name:                     ['Dr. David J. McComas']
    PI_affiliation:              ['Princeton Plasma Physics Laboratory', '100...
    ...                          ...
    TEXT:                        ['The Solar Wind Electron (SWE) instrument m...
    Instrument_type:             ['Particles (space)']
    Logical_file_id:             ['FILL ME IN AT FILE CREATION']
    Data_type:                   ['L1A_SCI>Level-1A Science Data']
    Logical_source:              ['imap_swe_l1_sci']
    Logical_source_description:  ['IMAP Mission SWE Instrument Level-1A Data']

Do you know if this is expected / purposeful?

xr.Dataset
The ``xarray`` dataset for the CDF file
"""
return cdf_to_xarray(file_path, kwargs)
Copy link
Contributor

@tech3371 tech3371 May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have found a bug with this cdf_to_xarray. When we read cdf file using this, it returns global attrs as dictionary but all the values are returned as list which is ok if we are only reading the cdf and not using it for subsequent data level. But I want to carry l1a attrs for l1b with minor updates. I am sure others may want to use similar manner. If so, I wonder if we can extend this function to convert global attrs values back into strings instead of list here or update on cdflib. @bryan-harter may know which way is better. This is a comment. We don’t have to add this in this PR. We can extend this in upcoming PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad you added this common functions!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I also noticed this, which also makes it a bit difficult to test. Good idea to put the fix into this function, though I wonder if the fix should be made 'upstream' in cdflib? Interested to hear @bryan-harter thoughts on this.

Copy link
Contributor

@bryan-harter bryan-harter May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So by default the cdf reader function just assumes all attributes are lists because they technically are all lists from the CDF's point of view. When writing an attribute to a CDF, you always specify how long the attribute list is you are about to write. So there is technically no way for the CDF to keep information about if an attribute is simply a single item v.s. a list with a single item in it.

One thing I could do is have cdflib assume all lists with only 1 item are returned as not lists. But just be aware that means that if you DO feed in the attribute as a list with one item, this code change would cause the opposite problem and get rid of the list.

I'm open to ideas!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bryan-harter!

My gut says that it makes sense to deal with the single-item vs list logic within write_cdf()/load_cdf() and keep cdflib to be "closer" to the CDF requirements / more generalized so that issues doesn't arise for other developers 🤷🏻

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you guys for already tackling this in this PR!

@bourque
Copy link
Collaborator Author

bourque commented May 6, 2024

Ok, I got this working by adding these lines of code to load_cdf():

    # cdf_to_xarray converts single-value attributes to lists
    # convert these back to single values where applicable
    for attribute in dataset.attrs:
        value = dataset.attrs[attribute]
        if isinstance(value, list) and len(value) == 1:
            dataset.attrs[attribute] = value[0]

@bourque bourque merged commit de2b79f into IMAP-Science-Operations-Center:dev May 6, 2024
15 checks passed
@bourque bourque deleted the load-cdf-helper-function branch May 6, 2024 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDF Related to CDF files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants