-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
load_cdf helper function #537
load_cdf helper function #537
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a good idea to have something like this.
It would be nice to see it used when we add it too though, rather than unused code. Are you able to use this in your tests or other instrument's tests so it also gets some use?
…within the testing framework, instead of using an "external" CDF file. Also fixed small docstring formatting issue
Sorry, I read past your comment on initial lookthrough. I can think of at least one instance under the CoDICE umbrella where I can use the new function. I will take a look around and see if I can spot any others and update this PR accordingly. |
…processing into load-cdf-helper-function
I think this is ready to go pending final approval from @greglucas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! I really appreciate that you started using this everywhere too. I think it is nice to put all of cdflib in one location and then we can update things in just one spot in the future.
assert calc_start_time(0) == launch_time | ||
assert calc_start_time(1) == launch_time + np.timedelta64(1, "s") | ||
@pytest.fixture() | ||
def test_dataset(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I like the re-use of this :)
I'm not sure how feasible/hard this would be, but can you do any assertion on round-tripping? You are testing loading/writing separately, but can you say anything about test_dataset == load_cdf(write_cdf(test_dataset))
Maybe testing it has the same variables or something like that if ==
doesn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I follow what you are saying here but just want to be clear. Are you talking about adding a test like this?
def test_written_and_loaded_dataset(test_dataset):
assert test_dataset == load_cdf(write_cdf(test_dataset))
to make sure that nothing i getting changed in the xarray
object when passed through the write_cdf()
and load_cdf()
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, exactly! Making sure it is a reversible operation and we are getting the same data in and out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. Unfortunately it is indeed breaking, so good that we checked!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh oh! Hopefully not too big of an issue? Does it need to be fixed in cdflib or on our end... there is hardly anything we are doing here in these functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I have identified that the transformation is happening within cdflib
:
In [1]: import cdflib
...: from cdflib import xarray
...: import xarray as xr
...: import numpy as np
...: from imap_processing.cdf.global_attrs import ConstantCoordinates
...: from imap_processing.swe.swe_cdf_attrs import swe_l1a_global_attrs
In [2]: filename = 'imap_swe_l1_sci_20100101_v001.cdf'
In [3]: dataset = xr.Dataset(
...: {
...: "epoch": (
...: "epoch",
...: [
...: np.datetime64("2010-01-01T00:01:01", "ns"),
...: np.datetime64("2010-01-01T00:01:02", "ns"),
...: np.datetime64("2010-01-01T00:01:03", "ns"),
...: ],
...: ),
...: },
...: attrs=swe_l1a_global_attrs.output()
...: | {"Logical_source": "imap_swe_l1_sci", "Data_version": "001"},
...: )
...: dataset["epoch"].attrs = ConstantCoordinates.EPOCH
In [4]: dataset
Out[4]:
<xarray.Dataset> Size: 24B
Dimensions: (epoch: 3)
Coordinates:
* epoch (epoch) datetime64[ns] 24B 2010-01-01T00:01:01 ... 2010-01-01T00...
Data variables:
*empty*
Attributes: (12/15)
Project: STP>Solar-Terrestrial Physics
Source_name: IMAP>Interstellar Mapping and Acceleration P...
Discipline: Solar Physics>Heliospheric Physics
Mission_group: IMAP>Interstellar Mapping and Acceleration P...
PI_name: Dr. David J. McComas
PI_affiliation: ('Princeton Plasma Physics Laboratory', '100...
... ...
TEXT: The Solar Wind Electron (SWE) instrument mea...
Instrument_type: Particles (space)
Logical_file_id: ['FILL ME IN AT FILE CREATION']
Data_type: L1A_SCI>Level-1A Science Data
Logical_source: imap_swe_l1_sci
Logical_source_description: IMAP Mission SWE Instrument Level-1A Data
In [5]: cdflib.xarray.xarray_to_cdf(dataset, filename)
In [6]: new_dataset = cdflib.xarray.cdf_to_xarray(filename)
In [7]: new_dataset
Out[7]:
<xarray.Dataset> Size: 24B
Dimensions: (record0: 3)
Dimensions without coordinates: record0
Data variables:
epoch (record0) int64 24B 1262304061000000000 ... 1262304063000000000
Attributes: (12/15)
Project: ['STP>Solar-Terrestrial Physics']
Source_name: ['IMAP>Interstellar Mapping and Acceleration...
Discipline: ['Solar Physics>Heliospheric Physics']
Mission_group: ['IMAP>Interstellar Mapping and Acceleration...
PI_name: ['Dr. David J. McComas']
PI_affiliation: ['Princeton Plasma Physics Laboratory', '100...
... ...
TEXT: ['The Solar Wind Electron (SWE) instrument m...
Instrument_type: ['Particles (space)']
Logical_file_id: ['FILL ME IN AT FILE CREATION']
Data_type: ['L1A_SCI>Level-1A Science Data']
Logical_source: ['imap_swe_l1_sci']
Logical_source_description: ['IMAP Mission SWE Instrument Level-1A Data']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @bryan-harter I think this is a bug within cdflib, but maybe it isn't so much a bug as a limitation of CDF? Is there any way to keep the coordinates when loading the cdf from disk into an xarray dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the way cdf_to_xarray tells what the dimensions are is only through what the other variables in the file are pointing at (i.e. with a DEPEND_0/1/2/3/etc).
If there are no other variables in the file, then cdf_to_xarray doesn't consider "epoch" a dimension. So its more of a limitation of CDF I suppose. One way we can get around this is to add a DEPEND_0 to the epoch variable that just points to itself. So if you add the following, the code will identify the dimension correctly:
dataset["epoch"].attrs['DEPEND_0'] = 'epoch'
Alternatively you can just add some dummy variable, like:
dataset = xr.Dataset(
{
"epoch": (
"epoch",
[
np.datetime64("2010-01-01T00:01:01", "ns"),
np.datetime64("2010-01-01T00:01:02", "ns"),
np.datetime64("2010-01-01T00:01:03", "ns"),
],
),
"hello": (
"hello",
[
1,
2,
3,
],
),
},
attrs=swe_l1a_global_attrs.output()
| {"Logical_source": "imap_swe_l1_sci", "Data_version": "001"},
)
dataset["epoch"].attrs = ConstantCoordinates.EPOCH
dataset["hello"].attrs = swe_metadata_attrs.output()
In this case, since the variable "hello" points to "epoch", then cdf_to_xarray correctly identifies "epoch" as a dimension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess on the cdflib side, I can try to make changes so that 1D "support_variables" are always their own dimension? I can try taking a look at that and see if that fundamentally breaks anything. But it handles time variables slightly differently than the other DEPENDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bryan-harter Thanks for the info on that. Your simple fix of adding the DEPEND_0
appears to be working as intended!
One other thing that seems to be changing is that cdflib
is writing and/or loading the CDF attributes into lists, i.e.:
Attributes: (12/15)
Project: STP>Solar-Terrestrial Physics
Source_name: IMAP>Interstellar Mapping and Acceleration P...
Discipline: Solar Physics>Heliospheric Physics
Mission_group: IMAP>Interstellar Mapping and Acceleration P...
PI_name: Dr. David J. McComas
PI_affiliation: ('Princeton Plasma Physics Laboratory', '100...
... ...
TEXT: The Solar Wind Electron (SWE) instrument mea...
Instrument_type: Particles (space)
Logical_file_id: ['FILL ME IN AT FILE CREATION']
Data_type: L1A_SCI>Level-1A Science Data
Logical_source: imap_swe_l1_sci
Logical_source_description: IMAP Mission SWE Instrument Level-1A Data
gets turned into
Attributes: (12/15)
Project: ['STP>Solar-Terrestrial Physics']
Source_name: ['IMAP>Interstellar Mapping and Acceleration...
Discipline: ['Solar Physics>Heliospheric Physics']
Mission_group: ['IMAP>Interstellar Mapping and Acceleration...
PI_name: ['Dr. David J. McComas']
PI_affiliation: ['Princeton Plasma Physics Laboratory', '100...
... ...
TEXT: ['The Solar Wind Electron (SWE) instrument m...
Instrument_type: ['Particles (space)']
Logical_file_id: ['FILL ME IN AT FILE CREATION']
Data_type: ['L1A_SCI>Level-1A Science Data']
Logical_source: ['imap_swe_l1_sci']
Logical_source_description: ['IMAP Mission SWE Instrument Level-1A Data']
Do you know if this is expected / purposeful?
imap_processing/cdf/utils.py
Outdated
xr.Dataset | ||
The ``xarray`` dataset for the CDF file | ||
""" | ||
return cdf_to_xarray(file_path, kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have found a bug with this cdf_to_xarray
. When we read cdf file using this, it returns global attrs as dictionary but all the values are returned as list which is ok if we are only reading the cdf and not using it for subsequent data level. But I want to carry l1a attrs for l1b with minor updates. I am sure others may want to use similar manner. If so, I wonder if we can extend this function to convert global attrs values back into strings instead of list here or update on cdflib. @bryan-harter may know which way is better. This is a comment. We don’t have to add this in this PR. We can extend this in upcoming PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Glad you added this common functions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also noticed this, which also makes it a bit difficult to test. Good idea to put the fix into this function, though I wonder if the fix should be made 'upstream' in cdflib
? Interested to hear @bryan-harter thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So by default the cdf reader function just assumes all attributes are lists because they technically are all lists from the CDF's point of view. When writing an attribute to a CDF, you always specify how long the attribute list is you are about to write. So there is technically no way for the CDF to keep information about if an attribute is simply a single item v.s. a list with a single item in it.
One thing I could do is have cdflib assume all lists with only 1 item are returned as not lists. But just be aware that means that if you DO feed in the attribute as a list with one item, this code change would cause the opposite problem and get rid of the list.
I'm open to ideas!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bryan-harter!
My gut says that it makes sense to deal with the single-item vs list logic within write_cdf()
/load_cdf()
and keep cdflib
to be "closer" to the CDF requirements / more generalized so that issues doesn't arise for other developers 🤷🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you guys for already tackling this in this PR!
Ok, I got this working by adding these lines of code to # cdf_to_xarray converts single-value attributes to lists
# convert these back to single values where applicable
for attribute in dataset.attrs:
value = dataset.attrs[attribute]
if isinstance(value, list) and len(value) == 1:
dataset.attrs[attribute] = value[0] |
Change Summary
Overview
Motivated by the comment made by @greglucas here, this PR adds a
load_cdf
function tocdf.utils
so that end users don't necessarily needcdflib
to open CDF files. The function is currently pretty bare-bones. Any suggestions on how to expand it or add functionality is welcomed!New Files
imap_processing/tests/cdf/data/imap_codice_l1a_hskp-20100101_v001.cdf
Updated Files
imap_processing/cdf/utils.py
load_cdf
functionimap_processing/tests/cdf/test_utils.py
load_cdf
function. Also added some missing docstrings.