-
Notifications
You must be signed in to change notification settings - Fork 19
ENH/MNT: Add a common initial dataset creation routine #687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH/MNT: Add a common initial dataset creation routine #687
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I like this implementation but I think there may need to be some features added in order to be able to replace all usages of the create_dataset
function. I am also curious to see what it looks like to update variable attributes using the ImapCdfAttributes class.
@@ -254,3 +255,86 @@ def update_epoch_to_datetime(dataset: xr.Dataset) -> xr.Dataset: | |||
dataset = dataset.assign_coords(epoch=epoch) | |||
|
|||
return dataset | |||
|
|||
|
|||
def packet_file_to_datasets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see quite a bit of overlap with the create_dataset
function. Some differences are:
- This function creates a dataset for each apid in the input file
create_dataset
has an option for including the packet header or notcreate_dataset
brings in variable attributes from a source external to the xtce. Right now that source has not been updated to use the ImapCdfAttributes class.create_dataset
allows for the option to skip certain fields in the packet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, this is heavily based on that function. Would you want all of these capabilities? My thought is that sometimes too many options is not great either and sometimes standardizing on a fixed layout is useful.
I'd actually rather not "skip" the specific fields and complicate this function, but rather just have a user do dataset.drop("variable")
to remove whatever they want afterwards. So I kind of view this as wanting to give the user everything and then the user can start winnowing down to what they need or transforming from there.
If it would be useful to add some ignore keywords I am open to it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was really just making an observation about the overlap. I am in agreement that keeping this function simple makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think it's gonna save lot of repeated steps that we all are doing separately. This function seems much better. It's decomming, creating dataset and then returning it per apid. Then every processing that happens after that can decide what to keep and what to drop in terms of the actual data. And it's already in the format we need to write to cdf except we need to do minor things to add attrs stuff.
@@ -254,3 +255,86 @@ def update_epoch_to_datetime(dataset: xr.Dataset) -> xr.Dataset: | |||
dataset = dataset.assign_coords(epoch=epoch) | |||
|
|||
return dataset | |||
|
|||
|
|||
def packet_file_to_datasets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be useful to optionally skip an apid. The motivation for this is packets with variable length binary data fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This handles variable length binary data!
I can see the argument here as well, but again I think that it might be easier for a user to just not access that apid in the dictionary lookup?
@subagonsouth thanks for the comments. I updated |
@@ -254,3 +255,86 @@ def update_epoch_to_datetime(dataset: xr.Dataset) -> xr.Dataset: | |||
dataset = dataset.assign_coords(epoch=epoch) | |||
|
|||
return dataset | |||
|
|||
|
|||
def packet_file_to_datasets( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I think it's gonna save lot of repeated steps that we all are doing separately. This function seems much better. It's decomming, creating dataset and then returning it per apid. Then every processing that happens after that can decide what to keep and what to drop in terms of the actual data. And it's already in the format we need to write to cdf except we need to do minor things to add attrs stuff.
This adds a common interface for returning xarray datasets from a packet file. There is one dataset per apid, sorted by time, with derived values automatically converted.
- Remove some attributes from XTCE to avoid duplication with yaml - Add use_derived_value as a boolean option for whether or not to use the derived_value or keep it as the raw bits.
This updates the Hi utilities to use the dataset creator. There is still some work that can be done to improve and numpy broadcast some of the routines, but this at least shows the basics.
This removes the change to SWAPI to use the derived values. It also abstracts the Enums out of the tests so future updates if we want to use the derived values will be easier.
3a56b68
to
9d0c5c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this! Looks great!
@subagonsouth, as an FYI, I did update some of Hi's code in this as well and left some TODO comments here. Let me know if you want me to remove/change any of the Hi updates. |
|
||
# unpack the packets data into the Dataset | ||
# (npackets, 24 * 90 * 12) | ||
# TODO: Look into avoiding the for-loops below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes... Good recommendation!
dataset.epoch.data[i_epoch] = met_to_j2000ns(packet.data["CCSDS_MET"].raw_value) | ||
dataset.ccsds_met[i_epoch] = packet.data["CCSDS_MET"].raw_value | ||
dataset.esa_stepping_num[i_epoch] = packet.data["ESA_STEP"].raw_value | ||
# TODO: Move into the allocate dataset function? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes look fine for now. The allocate_histogram_dataset
function was really just my way of avoiding having an intermediate data storage. I wanted to go directly from the packet into the xr.DataSet arrays. With your new function, this is no longer achieving that goal. I will write a ticket to look at this Hi code and address the TODOs.
1e5692b
into
IMAP-Science-Operations-Center:dev
Change Summary
Overview
This adds a common interface for returning xarray datasets from a packet file. There is one dataset per apid, sorted by time, with derived values automatically converted.
Each dataset is essentially a L0.5, but my hope is that this will be easier for everyone to work with (especially for housekeeping dataset creation). Rather than everyone implementing something fairly similar but slightly different inside their packet iteration routine:
It would be:
I updated SWAPI to use this new function, which simplified things IMO.