Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change mksurfdata_map/mksurfdata_esmf Makefile to build single-point datasets using subset_data #1674

Closed
ekluzek opened this issue Mar 3, 2022 · 31 comments · Fixed by #1812
Labels
enhancement new capability or improved behavior of existing capability
Milestone

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 3, 2022

Currently mksurfdata.pl is used to create the single point surface datasets. We are moving that capability over to the new subset_data tool. So we need to change the Makefile in the mksurfdata_map/mksurfdata_esmf tool directory to use subset_data to build the single point datasets.

Relates to:

#1664

Blockers for this are: #1665 #1673

@ekluzek ekluzek added the enhancement new capability or improved behavior of existing capability label Mar 3, 2022
@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 3, 2022

A question for this is what global surface dataset these should start from? Both which resolution and also should it use the surface dataset just created by mksurfdata_map or use the current one in the XML file? To do the former the dependency to the global dataset would need to be added to the Makefile. The later would be independent of other mksurfdata_map files, but also wouldn't be using the most updated dataset.

@mvertens
Copy link

mvertens commented Mar 3, 2022

@ekluzek - if the move to mksurdata_esmf is coming soon - why are we still putting changes in mksurfdata_map.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 3, 2022

@mvertens this is something that applies to both. It's really completely independent of the mksurfdata_map/mksurfdata_esmf code. This is something that could either go onto the ctsm5.2 branch or onto master. And I'm actually not sure right now which one it should go in. So I called it mksurfdata_map so that it would apply to either one.

@billsacks
Copy link
Member

I'd like to understand something here: With mksurfdata_esmf, will it still be possible to make a single-point surface dataset without overrides? For example, could it be used to directly create the numaIA surface dataset, which I believe doesn't do any overrides?

The reason I ask is: if this is still possible, then it seems like the most straightforward thing to do could be to create a single point dataset using mksurfdata_map pointing to a single-point mesh file for the output, then using modify_fsurdat to do the appropriate modifications. That sidesteps the issue of needing to choose some arbitrary global resolution to then subset for these out-of-the-box single-point datasets.

Or is the capability to directly create a single point surface dataset going away?

@mvertens
Copy link

mvertens commented Mar 3, 2022

@billsacks - it is really inefficient to create a single point dataset with mksurfdata_esmf. As we already discovered yesterday, getting the mappings for very low resolution output grids can be very costly. Sam and I already discussed this and we fill the most straightforward way is to create global datasets and then use the subset capability to extract a single point. I'm happy to discuss this offline if that would be helpful. The bottom line is that this capability is not going away - but would be expensive to use.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 3, 2022

If you have a mesh file for your single point site you'll be able to use the new mksurfdata_esmf to create a single point dataset. So you could do that for numaIA for example (and 1x1 brazil). I don't see that going away -- just the ability to override after you've done that inside of mksurfdata. But, for most of the single point sites we wanted to eliminate having to create a mesh file for them. So the standard procedure we are now recommending is to use subset_data to create single point surface datasets.

One of the goals for single point sites is to have a mechanism that's fairly standard. So it works for NEON and plumber sites, and isn't that much different for a user defined tower site as well.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 3, 2022

@billsacks I do appreciate your comment about the roles of fsurdat_modifier vs. subset_data, it's a valid question. That is what I hope to workout in our subgroup meeting and have a recommendation of how everything relates to each other. Let me know if you would like to be added to that subgroup meeting. I think we'll still have some discussion of the recommendations in CTSM software, but that's the heart of that subgroup discussion.

@ekluzek ekluzek changed the title Change mksurfdata_map Makefile to build single-point datasets using subset_data Change mksurfdata_map/mksurfdata_esmf Makefile to build single-point datasets using subset_data Mar 3, 2022
@mvertens
Copy link

mvertens commented Mar 3, 2022

@ekluzek - it is very costly to create the route handle from a 1 km high resolution data set (such as for elevation) to a very low resolution dataset (such as 10x15). That is because you cannot scale out the output mesh to many processors and the input mesh has millions of points. It is very fast on the other hand to create a very high resolution surface data set (I can generate a 7.5km surface dataset in under 7 minutes) and use that high resolution surface dataset to extract single points. I see that as a much better way to move forwards. Again - I am happy to meet to talk about this.

@mvertens
Copy link

mvertens commented Mar 3, 2022

@ekluzek - I would like to be a member of any subgroup that is formed to discuss these issues.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 3, 2022

OK, I just talked to @mvertens and as she is having trouble with 10x15, we think single point will be even worse. So we should add some logic to say "don't use mksurfdata_esmf for low grid count grids -- use subset data". This is already something we had decided in going away from PTCLM and moving towards subset_data. So you would never use mksurfdata_esmf for a single point site, you'd always use subset_data. This will mean some of our sites will have a slight change in answers, so we might want to bring this in before ctsm5.2 comes in actually.

@billsacks
Copy link
Member

Okay, sounds fine. This feels weird that we can now make a high-resolution dataset with no problem but can't make a coarse-resolution or single-point dataset – naively, it feels like it should be possible to use a different decomposition / parallelization strategy that would enable parallelization across the source domain (instead of the destination domain) for the generation of the mesh & route handle in these cases to provide good performance – but I can see how this isn't a use case that is worth optimizing for.

@wwieder
Copy link
Contributor

wwieder commented Mar 4, 2022

I'm fine with the decision to use subset data for regional and single point cases, I thought that's why we were making this tool.

I agree with Bill, that's its odd we can't make a coarse resolution grid easily. Is that just because creating the mapping file is too memory intensive?

@ekluzek to your suggestion about answer changing single point runs, and "we might want to bring this in before ctsm5.2 comes in actually". How critical is this, especially if we're about to upend a bunch of the underlying datasets used for surface data? Are we wanting to maintain backwards compatibility for PLUMBER2 simulations? Is it critical do understand how modifications to our mksurfdata workflow are changing answers at sites where Gordon and @olyson are regularly running single point cases? , I kind of assume these will be larger changes than the particulars

@wwieder
Copy link
Contributor

wwieder commented Mar 4, 2022

One more thing I'd like to weigh in on here, is that I don't think it's critical to carry around a high resolution surface datasets for the purposes of single point simulations. The current workflow of using subset data on a 1 degree surface dataset and overwriting with site specific information if necessary seems fine. That said, subset_data should also work on those higher resolution (7.5 km) datasets, but I don't think it needs to be a standard or default way we use this tool.

@mvertens
Copy link

mvertens commented Mar 4, 2022

@wwieder - the current culprit for the coarse dataset creation (the only real problem is 10x15 - nothing else) - is trying to generate the route handle (i.e. online mapping file) to map a 1km dataset to a grid that only has 400 points. I believe the issue is that there are simply not enough degrees of freedom for a 10x15 to scale this out. I'm reaching out to Bob Oehmke today to verify my assumption. My assumption is that you can't scale things out if you don't have a big enough target grid.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 4, 2022

@wwieder yes lets talk about this more at out next CTSM software meeting. Especially the bit about changing surface datasets. For that I'm just talking about changing our testing datasets: 1x1_brazil, 5x5_amazon, 1x1_numaIA,1x1_vancouverCAN,1x1_mexicocityMEX, and 1x1_urbanc_alpha. One of the reasons I want to do that now is just to show that we can get this to work before we do the big change of all datasets, where all surface datasets will change. I want to separate any possible problems with this part of the change, from the general change in all surface datasets. Otherwise, the general ctsm5.2 change, might hide problems in moving from mksurfdata to subset_data.

@wwieder
Copy link
Contributor

wwieder commented Mar 4, 2022

This makes sense for testing, @ekluzek, and thanks for clarifying, @mvertens.

@mvertens
Copy link

mvertens commented Mar 4, 2022

@ekluzek @wwieder - After talking to Bob Oehmke this afternoon, it appears that regional and single point should not have the same problems as the coarse global 10x15. I will try to verify this shortly.

@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Mar 9, 2022
@billsacks billsacks removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Mar 10, 2022
@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 14, 2022

We talked about this issue in our CTSM software meeting. Note, from that discussion (also on the wiki) are:

For single-point cases, our recommended / supported workflow is to use subset_data.

For regional, it is case-by-case: in some cases it makes sense to subset, and in other cases it makes sense to make a surface dataset directly at your resolution with mksurfdata.

The reason why single-point generally / always will use subset_data is because you generally will override the important properties like vegetation type anyway.

Erik asks if it makes sense to switch over the process for creating our out-of-the-box single-point surface datasets now. General feeling is yes.

So I'm going to move forward with using subset data to create the single point datasets as the standard for them.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 14, 2022

For this task to be complete #1673 needs to be dealt with first. I can still make progress on it though, and when the other is finished this can be finalized.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 19, 2023

I have this working for mexicocity and the kind of differences I see are as follows:

$CPRDIR/cprnc.cheyenne -m surfdata_1x1_mexicocityMEX_hist_16pfts_Irrig_CMIP6_simyr2000_c230118.nc $CSMDATA/lnd/clm2/surfdata_map/release-clm5.0.18/surfdata_1x1_mexicocityMEX_hist_16pfts_Irrig_CMIP6_simyr2000_c190214.nc | grep RMS
 RMS lsmlon                           2.5950E+02            NORMALIZED  1.9847E+00
 RMS lsmlat                           1.8500E+01            NORMALIZED  1.8049E+00
 RMS ORGANIC                          1.9963E+00            NORMALIZED  7.6737E-02
 RMS FMAX                             3.7520E-01            NORMALIZED  2.0000E+00
 RMS LANDFRAC_PFT                     2.2204E-16            NORMALIZED  2.2204E-16
 RMS AREA                             2.0903E+03            NORMALIZED  1.6458E-01
 RMS EF1_BTR                          1.9631E+03            NORMALIZED  6.6536E-02
 RMS EF1_FET                          5.6026E+02            NORMALIZED  1.0230E+00
 RMS EF1_FDT                          4.7036E+01            NORMALIZED  3.0915E-02
 RMS EF1_SHR                          6.4047E+02            NORMALIZED  6.8183E-02
 RMS EF1_GRS                          1.4003E+00            NORMALIZED  3.6822E-03
 RMS EF1_CRP                          1.7764E-15            NORMALIZED  1.4803E-16
 RMS zbedrock                         4.7641E+00            NORMALIZED  7.9643E-01
 RMS gdp                              1.4103E-06            NORMALIZED  3.7086E-07
 RMS SLOPE                            1.2564E+00            NORMALIZED  3.3904E-01
 RMS STD_ELEV                         3.1316E+02            NORMALIZED  1.7735E+00
 RMS LAKEDEPTH                        1.2434E-14            NORMALIZED  1.2434E-15
 RMS CONST_GRAZING                    1.8190E-12            NORMALIZED  1.8192E-16
 RMS MONTHLY_LAI                      1.3681E-01            NORMALIZED  1.9061E-01
 RMS MONTHLY_SAI                      5.4830E-02            NORMALIZED  2.1746E-01
 RMS MONTHLY_HEIGHT_TOP               1.2759E+00            NORMALIZED  1.3290E-01
 RMS MONTHLY_HEIGHT_BOT               6.3558E-01            NORMALIZED  2.1629E-01

Almost none of the above differences in teh surface dataset will matter, with 100% urban coverage. I thought perhaps some soil related things might matter for pervious road: zbedrock, FMAX, ORGANIC?

@olyson
Copy link
Contributor

olyson commented Jan 19, 2023

Yes, differences in zbedrock, FMAX, and ORGANIC will matter for pervious road.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 19, 2023

@olyson, OK good to know. Then in your opinion is it OK to use the 1-degree grid cell averages for these? Or should we adjust them to the site? We could use the values used previously which came out of mksurfdata for a site smaller than a 1-degree grid cell so would be "more" accurate. But, still unless we have local site date it's not going to be that accurate. If you have a source of these data for the sites, we could use that. So which sounds like the way to go to you?

  • Use the 1-degree grid-cell values (easiest)
  • Use the current values from the previous files
  • Use specific data from the sites (if there is a source for it)

@olyson
Copy link
Contributor

olyson commented Jan 19, 2023

Let's use the easiest (1-deg grid-cell values). I don't have site-specific data for those variables and it's probably not worth introducing additional complexity into the process to get the values from the previous file. mexicocity is mostly used in the test suite and by myself occasionally to assess differences due to model changes.
Although, I'm surprised the lat/lon is so different. Is that because we had site-specific lat/lon in the original file?
Could you point me to the location of the new file you created? I'd like to look more closely at the differences if you don't mind.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 19, 2023

The file is here: /glade/work/erik/ctsm_worktrees/answer_changes/tools/mksurfdata_map

lsmlat and lsmlon are different because they are integer indices (both 1) in the original and floats for the actual latitude/longitude value in the new one. LATIXY and LONGXY are identical which is the important thing (that and lat and lon variables). lsmlat, and lsmlon aren't actually used.

@olyson
Copy link
Contributor

olyson commented Jan 19, 2023

Got it, thanks. I see that FMAX is exactly zero in the new dataset. Is that perhaps a new feature of single-point datasets (I seem to recall Sean arguing for something like that at some point) or just a coincidence?
Otherwise, I'm fine with this if you are.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 20, 2023

@olyson this is because I ran subset_data with "--cap-saturation" (which sets FMAX==0). That is how I setup all the single-point datasets. It is something I could remove though. I also used "--uniform-snowpack" for all the single point datasets (sets STD_ELEV==20).

I think this is the way we should do things though, so I'll leave it like that.

@olyson
Copy link
Contributor

olyson commented Jan 20, 2023

Sounds good, thanks.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 23, 2023

The answer changing part of this is that the non-urban single point sites (smallvilleIA, numaIA, brazil) change from half degree grid-cells from mksurfdata to 1-degree for the first two, and then 2-degree to 1-degree for brazil. I think this is OK though, the sites are still similar in their characteristics, and don't drastically change from the original ones. The brazil site can't just use the f19 fsurdat file to get the same results either, as the previous case was close to the f19 gridcell, but rounded off and made to be an exact 2 degree by 2 degree grid cell.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 23, 2023

The answer changing part as mentioned above is that we are using --cap-saturation and --uniform-snowpack for the single point sites.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 23, 2023

We just discussed this in the standup, but we figure the urban datasets should use the 78pft version rather than the 16pft version because it won't matter for the urban datasets and we want to move to always using the 78pft versions rather than having to have both.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 23, 2023

To make sure things are working as expected, I copied the earlier values of the following variables to the new dataset (for vancouverCAN) and showed that I get identical answers to ctsm5.1.dev115: FMAX,STD_ELEV,zbedrock,ORGANIC, and SLOPE. This means things are working as we think they are, which is good to know. For urban there are other fields that are different, but they don't matter for a 100% urban case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement new capability or improved behavior of existing capability
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

5 participants