ENH/MNT: Add a common initial dataset creation routine #687

greglucas · 2024-07-08T16:46:35Z

Change Summary

Overview

This adds a common interface for returning xarray datasets from a packet file. There is one dataset per apid, sorted by time, with derived values automatically converted.

Each dataset is essentially a L0.5, but my hope is that this will be easier for everyone to work with (especially for housekeeping dataset creation). Rather than everyone implementing something fairly similar but slightly different inside their packet iteration routine:

for packet in packets:
    # do a bunch of checks like sorting, apid checks, extraction

It would be:

datasets_by_apid = packet_file_to_datasets(packet_file, instrument_definition)
ds = datasets_by_apid[my_instrument.SCIENCE_APID]

... Then later do some work on the dataset
# The dataset is already referenced to our required epoch and is the length of the number of packets in the file with this apid
# This will allow easier iteration as the entire dataset is referenced to the proper dimension of the packet epoch.
ds["my_derirved_variable"] = do_something(ds["my_packet_variable"])

I updated SWAPI to use this new function, which simplified things IMO.

subagonsouth

Overall, I like this implementation but I think there may need to be some features added in order to be able to replace all usages of the create_dataset function. I am also curious to see what it looks like to update variable attributes using the ImapCdfAttributes class.

imap_processing/utils.py

subagonsouth · 2024-07-09T20:32:07Z

imap_processing/utils.py

@@ -254,3 +255,86 @@ def update_epoch_to_datetime(dataset: xr.Dataset) -> xr.Dataset:
    dataset = dataset.assign_coords(epoch=epoch)

    return dataset
+
+
+def packet_file_to_datasets(


I see quite a bit of overlap with the create_dataset function. Some differences are:

This function creates a dataset for each apid in the input file

create_dataset has an option for including the packet header or not

create_dataset brings in variable attributes from a source external to the xtce. Right now that source has not been updated to use the ImapCdfAttributes class.

create_dataset allows for the option to skip certain fields in the packet.

I agree, this is heavily based on that function. Would you want all of these capabilities? My thought is that sometimes too many options is not great either and sometimes standardizing on a fixed layout is useful.

I'd actually rather not "skip" the specific fields and complicate this function, but rather just have a user do dataset.drop("variable") to remove whatever they want afterwards. So I kind of view this as wanting to give the user everything and then the user can start winnowing down to what they need or transforming from there.

If it would be useful to add some ignore keywords I am open to it though.

I was really just making an observation about the overlap. I am in agreement that keeping this function simple makes sense.

I agree. I think it's gonna save lot of repeated steps that we all are doing separately. This function seems much better. It's decomming, creating dataset and then returning it per apid. Then every processing that happens after that can decide what to keep and what to drop in terms of the actual data. And it's already in the format we need to write to cdf except we need to do minor things to add attrs stuff.

subagonsouth · 2024-07-09T20:36:01Z

imap_processing/utils.py

@@ -254,3 +255,86 @@ def update_epoch_to_datetime(dataset: xr.Dataset) -> xr.Dataset:
    dataset = dataset.assign_coords(epoch=epoch)

    return dataset
+
+
+def packet_file_to_datasets(


It may be useful to optionally skip an apid. The motivation for this is packets with variable length binary data fields.

This handles variable length binary data!

I can see the argument here as well, but again I think that it might be easier for a user to just not access that apid in the dictionary lookup?

greglucas · 2024-07-10T20:59:39Z

@subagonsouth thanks for the comments. I updated IMAP-Hi now with the new function to remove the convert_time function as well. I hope that gives you an idea of how this will look for others, but I'm hoping others can make use of this and then hopefully we can even broadcast numpy array results in your Hi functions ;)

imap_processing/cdf/utils.py

tech3371 · 2024-07-16T22:05:56Z

imap_processing/utils.py

@@ -254,3 +255,86 @@ def update_epoch_to_datetime(dataset: xr.Dataset) -> xr.Dataset:
    dataset = dataset.assign_coords(epoch=epoch)

    return dataset
+
+
+def packet_file_to_datasets(


I agree. I think it's gonna save lot of repeated steps that we all are doing separately. This function seems much better. It's decomming, creating dataset and then returning it per apid. Then every processing that happens after that can decide what to keep and what to drop in terms of the actual data. And it's already in the format we need to write to cdf except we need to do minor things to add attrs stuff.

imap_processing/utils.py

imap_processing/swapi/l1/swapi_l1.py

imap_processing/tests/swapi/test_swapi_l1.py

This adds a common interface for returning xarray datasets from a packet file. There is one dataset per apid, sorted by time, with derived values automatically converted.

- Remove some attributes from XTCE to avoid duplication with yaml - Add use_derived_value as a boolean option for whether or not to use the derived_value or keep it as the raw bits.

This updates the Hi utilities to use the dataset creator. There is still some work that can be done to improve and numpy broadcast some of the routines, but this at least shows the basics.

This removes the change to SWAPI to use the derived values. It also abstracts the Enums out of the tests so future updates if we want to use the derived values will be easier.

tech3371

Thank you for doing this! Looks great!

greglucas · 2024-07-22T16:11:55Z

@subagonsouth, as an FYI, I did update some of Hi's code in this as well and left some TODO comments here. Let me know if you want me to remove/change any of the Hi updates.

subagonsouth · 2024-07-22T21:09:34Z

imap_processing/hi/l1a/histogram.py


+    # unpack the packets data into the Dataset
+    # (npackets, 24 * 90 * 12)
+    # TODO: Look into avoiding the for-loops below


Yes... Good recommendation!

subagonsouth · 2024-07-22T21:14:51Z

imap_processing/hi/l1a/histogram.py

-        dataset.epoch.data[i_epoch] = met_to_j2000ns(packet.data["CCSDS_MET"].raw_value)
-        dataset.ccsds_met[i_epoch] = packet.data["CCSDS_MET"].raw_value
-        dataset.esa_stepping_num[i_epoch] = packet.data["ESA_STEP"].raw_value
+    # TODO: Move into the allocate dataset function?


These changes look fine for now. The allocate_histogram_dataset function was really just my way of avoiding having an intermediate data storage. I wanted to go directly from the packet into the xr.DataSet arrays. With your new function, this is no longer achieving that goal. I will write a ticket to look at this Hi code and address the TODOs.

greglucas added the enhancement New feature or request label Jul 8, 2024

greglucas requested review from tech3371 and a team July 8, 2024 16:46

greglucas force-pushed the create_dataset branch from 3efa0da to 19df700 Compare July 8, 2024 16:52

subagonsouth reviewed Jul 9, 2024

View reviewed changes

tech3371 reviewed Jul 16, 2024

View reviewed changes

greglucas added 5 commits July 19, 2024 15:03

ENH/MNT: Add a common initial dataset creation routine

e6b531a

This adds a common interface for returning xarray datasets from a packet file. There is one dataset per apid, sorted by time, with derived values automatically converted.

MNT: Add option to not use derived value from XTCE

385b4c0

- Remove some attributes from XTCE to avoid duplication with yaml - Add use_derived_value as a boolean option for whether or not to use the derived_value or keep it as the raw bits.

MNT: Change IMAP-Hi over to use new dataset accessor

cbdd6ec

This updates the Hi utilities to use the dataset creator. There is still some work that can be done to improve and numpy broadcast some of the routines, but this at least shows the basics.

MNT: Remove unused SWAPI epoch variable

dc18e11

MNT: Revert to using raw values for SWAPI

9d0c5c6

This removes the change to SWAPI to use the derived values. It also abstracts the Enums out of the tests so future updates if we want to use the derived values will be easier.

greglucas force-pushed the create_dataset branch from 3a56b68 to 9d0c5c6 Compare July 19, 2024 21:21

tech3371 approved these changes Jul 19, 2024

View reviewed changes

subagonsouth approved these changes Jul 22, 2024

View reviewed changes

greglucas mentioned this pull request Jul 22, 2024

Refactor Hi L1a histogram #700

Closed

greglucas merged commit 1e5692b into IMAP-Science-Operations-Center:dev Jul 22, 2024
17 checks passed

greglucas deleted the create_dataset branch July 22, 2024 21:31

This was referenced Jul 23, 2024

Remove/consolidate utility functions #702

Open

Various CoDICE L1 Updates #711

Merged

greglucas mentioned this pull request Aug 22, 2024

CoDICE: Add catdescs and fieldnames to housekeeping CDF attributes #776

Closed

ENH/MNT: Add a common initial dataset creation routine #687

ENH/MNT: Add a common initial dataset creation routine #687

Uh oh!

Conversation

greglucas commented Jul 8, 2024

Change Summary

Overview

Uh oh!

subagonsouth left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greglucas commented Jul 10, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tech3371 left a comment

Choose a reason for hiding this comment

Uh oh!

greglucas commented Jul 22, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!