Add flexibility to model groups #64

pochedls · 2024-09-05T17:59:15Z

Is your feature request related to a problem? Please describe.
Model groups currently group by source_id, member_id, and grid_label by default. Users might want different groupings. For example, if you wanted to analyze both tas and tos (e.g., to create a blended 2m temperature and SST dataset), these variables are often placed into a gn and gr group (resulting in two separate groupings for these variables). This would make it hard to apply built-in tools (e.g., .remove_incomplete) to check if a model realization has both tas and tos. A user might also want to group by other sets (e.g., source_id, member_id, and experiment_id).

Describe the solution you'd like
One possible solution would be to allow the user to define groups of interest when calling model_groups, e.g., cat.model_groups(groupby=['source_id', 'member_id']).

Describe alternatives you've considered
This issue arose from a preview of intake_esgf so I am not yet a user and do not know what workarounds might be available.

Additional context
Another related issue is what happens if you have multiple datasets of the same variable in a group (e.g., ta for a given model in the AERmonZ table and the CMIP table [39 versus 19 vertical levels, I think]). Can you drop datasets within a group? Or is this de-duplicated on the initial search? If you can't drop duplicate datasets within a group, an alternate approach might be to take two catalog searches, merge them, and then group them by user-defined groupings. I'm not sure if this makes technical sense, but I foresee a general challenge in trying to get groupings to work across some facets.

The text was updated successfully, but these errors were encountered:

nocollier · 2024-09-06T13:43:01Z

Thanks for these suggestions. I wanted to make sure I understand the problem you are seeing and am not as familiar with atmosphere data. The issue is that sometimes you search for a variable and it is replicated (why oh why did we do that?) in several tables? For example, I see hfls in 3 different tables in the search below:

In [3]: cat.search(experiment_id="historical",variable_id="hfls",frequency="mon",source_id="CESM2",member_id="r1i1p1f1")
   Searching indices: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████|2/2 [    1.92index/s]
Out[3]: 
Summary information for 3 results:
institution_id                      [NCAR]
experiment_id                 [historical]
member_id                       [r1i1p1f1]
mip_era                            [CMIP6]
grid_label                            [gn]
table_id          [ImonGre, Amon, ImonAnt]
activity_drs                        [CMIP]
variable_id                         [hfls]
source_id                          [CESM2]
project                            [CMIP6]

If that isn't the issue, can you put together a search for me which explains the issue more clearly?

pochedls · 2024-09-06T19:53:28Z

One clear example of what I am concerned about:

from intake_esgf import ESGFCatalog
cat = ESGFCatalog()
cat.search(experiment_id="historical", frequency="mon", variable_id=["tos", "tas"], source_id="CIESM", member_id="r1i1p1f1")
cat.model_groups()

The tas and tos are broken into two separate groups, but you might want to ensure that you have both variables for a given simulation. I don't think there is a way (currently) to do that if the variables are broken into two (because of the different grid_label).

source_id  member_id  grid_label
CIESM      r1i1p1f1   gn            1
                      gr            1
Name: variable_id, dtype: int64

In the example above you might also run into issue. You might want to search for data within one table (e.g., Amon) for one variable, but this might rule out value for another variable you care about. These are separate issues, but I think flexibility in groupings (or some way to compare across groupings or merge searches based on particular facets) would be helpful for these types of problems.

nocollier · 2024-09-06T20:36:31Z

Perfect! Thanks, having these clear stories helps immensely. I designed the tooling for the analysis that I was used to but it needs to address everyone's problems. I will give this some thought and ping you again when I have a better idea.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flexibility to model groups #64

Add flexibility to model groups #64

pochedls commented Sep 5, 2024

nocollier commented Sep 6, 2024

pochedls commented Sep 6, 2024

nocollier commented Sep 6, 2024

Add flexibility to model groups #64

Add flexibility to model groups #64

Comments

pochedls commented Sep 5, 2024

nocollier commented Sep 6, 2024

pochedls commented Sep 6, 2024

nocollier commented Sep 6, 2024