Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseline test for deploy-action and CMIP6 recipe #4

Merged
merged 50 commits into from
Aug 7, 2023

Conversation

jbusecke
Copy link
Collaborator

@jbusecke jbusecke commented Aug 2, 2023

This is an attempt to systematically debug the CMIP6 recipe deployment on dataflow with the latest deploy action.
I have hardcoded the urls/iids to establish a baseline. I want one dataset to go through without any of the new stuff I have developed:

  • Dynamic chunking (EDIT: This now very much includes the dynamic chunking)
  • Url parsing/reading to/from BigQuery

I ran into a TON of trouble trying to develop this locally, which I still think is an important option that people should have (more to be discussed).

cc @cisaacstern

@jbusecke
Copy link
Collaborator Author

jbusecke commented Aug 2, 2023

@cisaacstern I took pangeo-forge-esgf entirely out of the picture, but now I am running into a different issue:

"/srv/conda/envs/notebook/lib/python3.9/site-packages/apache_beam/internal/cloudpickle_pickler.py", line 95, in loads unpickled = cloudpickle.loads(s) ModuleNotFoundError: No module named 'pangeo_forge_recipes.types'

Is this the same issue of a wrong pgf-recipes version?

@jbusecke
Copy link
Collaborator Author

jbusecke commented Aug 2, 2023

My local laptop run failed with a server disconnect, possibly due to my terrible internet connection 😁. I should probably try to run this on the LEAP hub with an specified environment file?

@jbusecke
Copy link
Collaborator Author

jbusecke commented Aug 2, 2023

Tried to add #egg=pangeo_forge_recipes to the requirements.txt

@jbusecke
Copy link
Collaborator Author

jbusecke commented Aug 3, 2023

Hmmm when I change the aspect ratio i see failures like this:

 line 48, in _store_data raise ValueError( ValueError: Region (slice(0, 1, None), slice(0, 990, None), slice(None, None, None), slice(None, None, None)) does not align with Zarr chunks (990, 180, 288). [while running 'Create|OpenURLWithFSSpec|OpenWithXarray|KeepOnlyVariableId|StoreToZarr/StoreToZarr/StoreDatasetFragments/Map(store_dataset_fragment)-ptransform-32'] Worker ID: cmip6-template-618127503--08031046-3zjq-harness-h1h3

I am wondering if the dynamic chunking is failing silently? I am not seeing any of the print statements that might give me a clue here. Any way I could inspect these on dataflow @cisaacstern?

@cisaacstern
Copy link
Contributor

Awesome progress, @jbusecke. Lmk if there's something specific I can do to help. A few responses below.

I am not seeing any of the print statements that might give me a clue here.

Changing these to logger.info calls should surface them into the dataflow logs. See https://github.com/pangeo-forge/pangeo-forge-recipes/pull/548/files for an example.

Does the rechunking/combining fail on datasets with a non-coordinate dim? @cisaacstern I think we should investigate this more systematically on the pgf-recipes side?

If you can create an MRE or test for this on pangeo-forge-recipes that would be awesome. Happy to take a look at it over there. Probably unrelated but here's an old issue related to dimensions without coordinates: pangeo-forge/pangeo-forge-recipes#214

@jbusecke
Copy link
Collaborator Author

jbusecke commented Aug 5, 2023

If you can create an MRE or test for this on pangeo-forge-recipes that would be awesome. Happy to take a look at it over there. Probably unrelated but here's an old issue related to dimensions without coordinates: pangeo-forge/pangeo-forge-recipes#214

I think I was starting down that road here but that turned out to add a lot of complexity to the end-to-end tests. Ill look into that further and try to come up with a MRE.

FYI, I have the suspicion that this might actually not be unrelated!

@jbusecke
Copy link
Collaborator Author

jbusecke commented Aug 5, 2023

Ok I have a preliminary conclusion here:

I was able to reproduce the failure above for both iids used when I set the target_chunks={..., 'bnds':1} (tested with multiple values for the other dims). If I set target_chunks={..., 'bnds':2} I have so far not observed a single failure. I have traced back most failures with the dynamic chunking to this pattern too, and thus conclude this is not an issue with the dynamic chunking, but instead reveals some issues further upstream. My hint from above about the dimension having coordinates/no coordinates seemed to have been a wrong lead too. There is something is still going wrong with the rechunking if the dataset has dimensions on coordinates which are not present on the data_variable.

The good news here is that the dynamic chunking works pretty well, and once the upstream bug is fixed should work as intended here.

I think I can split these issues up into several different parallel tasks:

  • Make a MRE for this bug and submit an issue about this. Tracked here
  • Reduce the scope of this PR for now, go back to only chunking in time (and leaving all other dims unchunked) for now. With the changes implemented here I think Ill actually be able to do time, x, y chunking after all. Ill just make sure to never specify the 'bnds' dimension in target_chunks_aspect_ratio.

@jbusecke jbusecke mentioned this pull request Aug 6, 2023
1 task
@jbusecke
Copy link
Collaborator Author

jbusecke commented Aug 6, 2023

@cisaacstern I think the most feasible way to package these iids in batches would be to have dictionary object like we did here a while back.
I naively tried to implement this with two iids above, but the submission failed. I am not sure if this workflow is even compatible with the deploy action? Ideally I want to submit one dataflow job per dict key-value pair.

@jbusecke
Copy link
Collaborator Author

jbusecke commented Aug 6, 2023

Hmm I might have spoken too soon. [These changes](modify aspect ratio) still reproduce something similar to above. It must be more complex than just whether to chunk the bounds or not. I wonder if it has to do with the fact if the spatial coordinates are chunked or not. Trying now with pure time chunking alone....
EDIT: That seems to work fine for now. I think that while we ultimately want chunking in space to work, it is absolutely fine to move ahead here with time chunking. After all that preserves some consistency with the already uploaded datasets.

@jbusecke
Copy link
Collaborator Author

jbusecke commented Aug 7, 2023

I now have two recipes running with the deployment action and will merge this, since I think the purpose of a baseline example has been achieved. I will work in seperate PRs to scale this effort up to a first request batch (probably keeping dynamic chunking in time only), and resolving the upstream issues with certain chunking inputs on the PGF-recipe side.

@jbusecke jbusecke merged commit 919ded9 into main Aug 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants