Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flexibility to model groups #64

Open
pochedls opened this issue Sep 5, 2024 · 3 comments
Open

Add flexibility to model groups #64

pochedls opened this issue Sep 5, 2024 · 3 comments

Comments

@pochedls
Copy link

pochedls commented Sep 5, 2024

Is your feature request related to a problem? Please describe.
Model groups currently group by source_id, member_id, and grid_label by default. Users might want different groupings. For example, if you wanted to analyze both tas and tos (e.g., to create a blended 2m temperature and SST dataset), these variables are often placed into a gn and gr group (resulting in two separate groupings for these variables). This would make it hard to apply built-in tools (e.g., .remove_incomplete) to check if a model realization has both tas and tos. A user might also want to group by other sets (e.g., source_id, member_id, and experiment_id).

Describe the solution you'd like
One possible solution would be to allow the user to define groups of interest when calling model_groups, e.g., cat.model_groups(groupby=['source_id', 'member_id']).

Describe alternatives you've considered
This issue arose from a preview of intake_esgf so I am not yet a user and do not know what workarounds might be available.

Additional context
Another related issue is what happens if you have multiple datasets of the same variable in a group (e.g., ta for a given model in the AERmonZ table and the CMIP table [39 versus 19 vertical levels, I think]). Can you drop datasets within a group? Or is this de-duplicated on the initial search? If you can't drop duplicate datasets within a group, an alternate approach might be to take two catalog searches, merge them, and then group them by user-defined groupings. I'm not sure if this makes technical sense, but I foresee a general challenge in trying to get groupings to work across some facets.

@nocollier
Copy link
Member

Thanks for these suggestions. I wanted to make sure I understand the problem you are seeing and am not as familiar with atmosphere data. The issue is that sometimes you search for a variable and it is replicated (why oh why did we do that?) in several tables? For example, I see hfls in 3 different tables in the search below:

In [3]: cat.search(experiment_id="historical",variable_id="hfls",frequency="mon",source_id="CESM2",member_id="r1i1p1f1")
   Searching indices: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████|2/2 [    1.92index/s]
Out[3]: 
Summary information for 3 results:
institution_id                      [NCAR]
experiment_id                 [historical]
member_id                       [r1i1p1f1]
mip_era                            [CMIP6]
grid_label                            [gn]
table_id          [ImonGre, Amon, ImonAnt]
activity_drs                        [CMIP]
variable_id                         [hfls]
source_id                          [CESM2]
project                            [CMIP6]

If that isn't the issue, can you put together a search for me which explains the issue more clearly?

@pochedls
Copy link
Author

pochedls commented Sep 6, 2024

One clear example of what I am concerned about:

from intake_esgf import ESGFCatalog
cat = ESGFCatalog()
cat.search(experiment_id="historical", frequency="mon", variable_id=["tos", "tas"], source_id="CIESM", member_id="r1i1p1f1")
cat.model_groups()

The tas and tos are broken into two separate groups, but you might want to ensure that you have both variables for a given simulation. I don't think there is a way (currently) to do that if the variables are broken into two (because of the different grid_label).

source_id  member_id  grid_label
CIESM      r1i1p1f1   gn            1
                      gr            1
Name: variable_id, dtype: int64

In the example above you might also run into issue. You might want to search for data within one table (e.g., Amon) for one variable, but this might rule out value for another variable you care about. These are separate issues, but I think flexibility in groupings (or some way to compare across groupings or merge searches based on particular facets) would be helpful for these types of problems.

@nocollier
Copy link
Member

Perfect! Thanks, having these clear stories helps immensely. I designed the tooling for the analysis that I was used to but it needs to address everyone's problems. I will give this some thought and ping you again when I have a better idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants