Change mksurfdata_map/mksurfdata_esmf Makefile to build single-point datasets using subset_data #1674

ekluzek · 2022-03-03T21:06:13Z

Currently mksurfdata.pl is used to create the single point surface datasets. We are moving that capability over to the new subset_data tool. So we need to change the Makefile in the mksurfdata_map/mksurfdata_esmf tool directory to use subset_data to build the single point datasets.

Relates to:

#1664

Blockers for this are: #1665 #1673

ekluzek · 2022-03-03T21:08:55Z

A question for this is what global surface dataset these should start from? Both which resolution and also should it use the surface dataset just created by mksurfdata_map or use the current one in the XML file? To do the former the dependency to the global dataset would need to be added to the Makefile. The later would be independent of other mksurfdata_map files, but also wouldn't be using the most updated dataset.

mvertens · 2022-03-03T21:24:15Z

@ekluzek - if the move to mksurdata_esmf is coming soon - why are we still putting changes in mksurfdata_map.

ekluzek · 2022-03-03T22:03:14Z

@mvertens this is something that applies to both. It's really completely independent of the mksurfdata_map/mksurfdata_esmf code. This is something that could either go onto the ctsm5.2 branch or onto master. And I'm actually not sure right now which one it should go in. So I called it mksurfdata_map so that it would apply to either one.

billsacks · 2022-03-03T22:54:23Z

I'd like to understand something here: With mksurfdata_esmf, will it still be possible to make a single-point surface dataset without overrides? For example, could it be used to directly create the numaIA surface dataset, which I believe doesn't do any overrides?

The reason I ask is: if this is still possible, then it seems like the most straightforward thing to do could be to create a single point dataset using mksurfdata_map pointing to a single-point mesh file for the output, then using modify_fsurdat to do the appropriate modifications. That sidesteps the issue of needing to choose some arbitrary global resolution to then subset for these out-of-the-box single-point datasets.

Or is the capability to directly create a single point surface dataset going away?

mvertens · 2022-03-03T23:05:52Z

@billsacks - it is really inefficient to create a single point dataset with mksurfdata_esmf. As we already discovered yesterday, getting the mappings for very low resolution output grids can be very costly. Sam and I already discussed this and we fill the most straightforward way is to create global datasets and then use the subset capability to extract a single point. I'm happy to discuss this offline if that would be helpful. The bottom line is that this capability is not going away - but would be expensive to use.

ekluzek · 2022-03-03T23:06:22Z

If you have a mesh file for your single point site you'll be able to use the new mksurfdata_esmf to create a single point dataset. So you could do that for numaIA for example (and 1x1 brazil). I don't see that going away -- just the ability to override after you've done that inside of mksurfdata. But, for most of the single point sites we wanted to eliminate having to create a mesh file for them. So the standard procedure we are now recommending is to use subset_data to create single point surface datasets.

One of the goals for single point sites is to have a mechanism that's fairly standard. So it works for NEON and plumber sites, and isn't that much different for a user defined tower site as well.

ekluzek · 2022-03-03T23:09:44Z

@billsacks I do appreciate your comment about the roles of fsurdat_modifier vs. subset_data, it's a valid question. That is what I hope to workout in our subgroup meeting and have a recommendation of how everything relates to each other. Let me know if you would like to be added to that subgroup meeting. I think we'll still have some discussion of the recommendations in CTSM software, but that's the heart of that subgroup discussion.

mvertens · 2022-03-03T23:11:07Z

@ekluzek - it is very costly to create the route handle from a 1 km high resolution data set (such as for elevation) to a very low resolution dataset (such as 10x15). That is because you cannot scale out the output mesh to many processors and the input mesh has millions of points. It is very fast on the other hand to create a very high resolution surface data set (I can generate a 7.5km surface dataset in under 7 minutes) and use that high resolution surface dataset to extract single points. I see that as a much better way to move forwards. Again - I am happy to meet to talk about this.

mvertens · 2022-03-03T23:11:56Z

@ekluzek - I would like to be a member of any subgroup that is formed to discuss these issues.

ekluzek · 2022-03-03T23:34:00Z

OK, I just talked to @mvertens and as she is having trouble with 10x15, we think single point will be even worse. So we should add some logic to say "don't use mksurfdata_esmf for low grid count grids -- use subset data". This is already something we had decided in going away from PTCLM and moving towards subset_data. So you would never use mksurfdata_esmf for a single point site, you'd always use subset_data. This will mean some of our sites will have a slight change in answers, so we might want to bring this in before ctsm5.2 comes in actually.

billsacks · 2022-03-04T03:52:41Z

Okay, sounds fine. This feels weird that we can now make a high-resolution dataset with no problem but can't make a coarse-resolution or single-point dataset – naively, it feels like it should be possible to use a different decomposition / parallelization strategy that would enable parallelization across the source domain (instead of the destination domain) for the generation of the mesh & route handle in these cases to provide good performance – but I can see how this isn't a use case that is worth optimizing for.

wwieder · 2022-03-04T12:39:54Z

I'm fine with the decision to use subset data for regional and single point cases, I thought that's why we were making this tool.

I agree with Bill, that's its odd we can't make a coarse resolution grid easily. Is that just because creating the mapping file is too memory intensive?

@ekluzek to your suggestion about answer changing single point runs, and "we might want to bring this in before ctsm5.2 comes in actually". How critical is this, especially if we're about to upend a bunch of the underlying datasets used for surface data? Are we wanting to maintain backwards compatibility for PLUMBER2 simulations? Is it critical do understand how modifications to our mksurfdata workflow are changing answers at sites where Gordon and @olyson are regularly running single point cases? , I kind of assume these will be larger changes than the particulars

wwieder · 2022-03-04T12:47:30Z

One more thing I'd like to weigh in on here, is that I don't think it's critical to carry around a high resolution surface datasets for the purposes of single point simulations. The current workflow of using subset data on a 1 degree surface dataset and overwriting with site specific information if necessary seems fine. That said, subset_data should also work on those higher resolution (7.5 km) datasets, but I don't think it needs to be a standard or default way we use this tool.

mvertens · 2022-03-04T16:11:16Z

@wwieder - the current culprit for the coarse dataset creation (the only real problem is 10x15 - nothing else) - is trying to generate the route handle (i.e. online mapping file) to map a 1km dataset to a grid that only has 400 points. I believe the issue is that there are simply not enough degrees of freedom for a 10x15 to scale this out. I'm reaching out to Bob Oehmke today to verify my assumption. My assumption is that you can't scale things out if you don't have a big enough target grid.

ekluzek · 2022-03-04T16:53:06Z

@wwieder yes lets talk about this more at out next CTSM software meeting. Especially the bit about changing surface datasets. For that I'm just talking about changing our testing datasets: 1x1_brazil, 5x5_amazon, 1x1_numaIA,1x1_vancouverCAN,1x1_mexicocityMEX, and 1x1_urbanc_alpha. One of the reasons I want to do that now is just to show that we can get this to work before we do the big change of all datasets, where all surface datasets will change. I want to separate any possible problems with this part of the change, from the general change in all surface datasets. Otherwise, the general ctsm5.2 change, might hide problems in moving from mksurfdata to subset_data.

wwieder · 2022-03-04T19:22:27Z

This makes sense for testing, @ekluzek, and thanks for clarifying, @mvertens.

mvertens · 2022-03-04T22:46:02Z

@ekluzek @wwieder - After talking to Bob Oehmke this afternoon, it appears that regional and single point should not have the same problems as the coarse global 10x15. I will try to verify this shortly.

ekluzek · 2022-03-14T23:20:49Z

We talked about this issue in our CTSM software meeting. Note, from that discussion (also on the wiki) are:

For single-point cases, our recommended / supported workflow is to use subset_data.

For regional, it is case-by-case: in some cases it makes sense to subset, and in other cases it makes sense to make a surface dataset directly at your resolution with mksurfdata.

The reason why single-point generally / always will use subset_data is because you generally will override the important properties like vegetation type anyway.

Erik asks if it makes sense to switch over the process for creating our out-of-the-box single-point surface datasets now. General feeling is yes.

So I'm going to move forward with using subset data to create the single point datasets as the standard for them.

ekluzek · 2022-03-14T23:25:22Z

For this task to be complete #1673 needs to be dealt with first. I can still make progress on it though, and when the other is finished this can be finalized.

ekluzek · 2023-01-19T20:22:31Z

I have this working for mexicocity and the kind of differences I see are as follows:

$CPRDIR/cprnc.cheyenne -m surfdata_1x1_mexicocityMEX_hist_16pfts_Irrig_CMIP6_simyr2000_c230118.nc $CSMDATA/lnd/clm2/surfdata_map/release-clm5.0.18/surfdata_1x1_mexicocityMEX_hist_16pfts_Irrig_CMIP6_simyr2000_c190214.nc | grep RMS
 RMS lsmlon                           2.5950E+02            NORMALIZED  1.9847E+00
 RMS lsmlat                           1.8500E+01            NORMALIZED  1.8049E+00
 RMS ORGANIC                          1.9963E+00            NORMALIZED  7.6737E-02
 RMS FMAX                             3.7520E-01            NORMALIZED  2.0000E+00
 RMS LANDFRAC_PFT                     2.2204E-16            NORMALIZED  2.2204E-16
 RMS AREA                             2.0903E+03            NORMALIZED  1.6458E-01
 RMS EF1_BTR                          1.9631E+03            NORMALIZED  6.6536E-02
 RMS EF1_FET                          5.6026E+02            NORMALIZED  1.0230E+00
 RMS EF1_FDT                          4.7036E+01            NORMALIZED  3.0915E-02
 RMS EF1_SHR                          6.4047E+02            NORMALIZED  6.8183E-02
 RMS EF1_GRS                          1.4003E+00            NORMALIZED  3.6822E-03
 RMS EF1_CRP                          1.7764E-15            NORMALIZED  1.4803E-16
 RMS zbedrock                         4.7641E+00            NORMALIZED  7.9643E-01
 RMS gdp                              1.4103E-06            NORMALIZED  3.7086E-07
 RMS SLOPE                            1.2564E+00            NORMALIZED  3.3904E-01
 RMS STD_ELEV                         3.1316E+02            NORMALIZED  1.7735E+00
 RMS LAKEDEPTH                        1.2434E-14            NORMALIZED  1.2434E-15
 RMS CONST_GRAZING                    1.8190E-12            NORMALIZED  1.8192E-16
 RMS MONTHLY_LAI                      1.3681E-01            NORMALIZED  1.9061E-01
 RMS MONTHLY_SAI                      5.4830E-02            NORMALIZED  2.1746E-01
 RMS MONTHLY_HEIGHT_TOP               1.2759E+00            NORMALIZED  1.3290E-01
 RMS MONTHLY_HEIGHT_BOT               6.3558E-01            NORMALIZED  2.1629E-01

Almost none of the above differences in teh surface dataset will matter, with 100% urban coverage. I thought perhaps some soil related things might matter for pervious road: zbedrock, FMAX, ORGANIC?

olyson · 2023-01-19T21:00:26Z

Yes, differences in zbedrock, FMAX, and ORGANIC will matter for pervious road.

ekluzek · 2023-01-19T21:37:09Z

@olyson, OK good to know. Then in your opinion is it OK to use the 1-degree grid cell averages for these? Or should we adjust them to the site? We could use the values used previously which came out of mksurfdata for a site smaller than a 1-degree grid cell so would be "more" accurate. But, still unless we have local site date it's not going to be that accurate. If you have a source of these data for the sites, we could use that. So which sounds like the way to go to you?

Use the 1-degree grid-cell values (easiest)
Use the current values from the previous files
Use specific data from the sites (if there is a source for it)

olyson · 2023-01-19T21:46:16Z

Let's use the easiest (1-deg grid-cell values). I don't have site-specific data for those variables and it's probably not worth introducing additional complexity into the process to get the values from the previous file. mexicocity is mostly used in the test suite and by myself occasionally to assess differences due to model changes.
Although, I'm surprised the lat/lon is so different. Is that because we had site-specific lat/lon in the original file?
Could you point me to the location of the new file you created? I'd like to look more closely at the differences if you don't mind.

ekluzek · 2023-01-19T21:56:12Z

The file is here: /glade/work/erik/ctsm_worktrees/answer_changes/tools/mksurfdata_map

lsmlat and lsmlon are different because they are integer indices (both 1) in the original and floats for the actual latitude/longitude value in the new one. LATIXY and LONGXY are identical which is the important thing (that and lat and lon variables). lsmlat, and lsmlon aren't actually used.

olyson · 2023-01-19T22:18:20Z

Got it, thanks. I see that FMAX is exactly zero in the new dataset. Is that perhaps a new feature of single-point datasets (I seem to recall Sean arguing for something like that at some point) or just a coincidence?
Otherwise, I'm fine with this if you are.

ekluzek · 2023-01-20T20:09:01Z

@olyson this is because I ran subset_data with "--cap-saturation" (which sets FMAX==0). That is how I setup all the single-point datasets. It is something I could remove though. I also used "--uniform-snowpack" for all the single point datasets (sets STD_ELEV==20).

I think this is the way we should do things though, so I'll leave it like that.

olyson · 2023-01-20T22:39:30Z

Sounds good, thanks.

ekluzek · 2023-01-23T16:43:11Z

The answer changing part of this is that the non-urban single point sites (smallvilleIA, numaIA, brazil) change from half degree grid-cells from mksurfdata to 1-degree for the first two, and then 2-degree to 1-degree for brazil. I think this is OK though, the sites are still similar in their characteristics, and don't drastically change from the original ones. The brazil site can't just use the f19 fsurdat file to get the same results either, as the previous case was close to the f19 gridcell, but rounded off and made to be an exact 2 degree by 2 degree grid cell.

ekluzek · 2023-01-23T17:00:25Z

The answer changing part as mentioned above is that we are using --cap-saturation and --uniform-snowpack for the single point sites.

ekluzek · 2023-01-23T18:54:25Z

We just discussed this in the standup, but we figure the urban datasets should use the 78pft version rather than the 16pft version because it won't matter for the urban datasets and we want to move to always using the 78pft versions rather than having to have both.

ekluzek · 2023-01-23T19:28:13Z

To make sure things are working as expected, I copied the earlier values of the following variables to the new dataset (for vancouverCAN) and showed that I get identical answers to ctsm5.1.dev115: FMAX,STD_ELEV,zbedrock,ORGANIC, and SLOPE. This means things are working as we think they are, which is good to know. For urban there are other fields that are different, but they don't matter for a 100% urban case.

ekluzek added the enhancement new capability or improved behavior of existing capability label Mar 3, 2022

ekluzek changed the title ~~Change mksurfdata_map Makefile to build single-point datasets using subset_data~~ Change mksurfdata_map/mksurfdata_esmf Makefile to build single-point datasets using subset_data Mar 3, 2022

ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Mar 9, 2022

billsacks removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Mar 10, 2022

ekluzek added this to the ctsm5.2.0 milestone Apr 27, 2022

This was referenced Jul 14, 2022

Add ability to name a different default config file for subset_data #1809

Closed

Get single point surface datasets from subset_data rather than mksurfdata #1812

Merged

ekluzek mentioned this issue Oct 11, 2022

Bring in Makefile for ctsm5.2 branch to build all datasets with multiple batch submissions at the same time #1869

Closed

ekluzek closed this as completed in #1812 Jan 26, 2023

ekluzek added this to Updated datasets for CTSM5.2 (ctsm5.2.mksurfdata branch and tags) Aug 1, 2024

ekluzek moved this to Done in Updated datasets for CTSM5.2 (ctsm5.2.mksurfdata branch and tags) Aug 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change mksurfdata_map/mksurfdata_esmf Makefile to build single-point datasets using subset_data #1674

Change mksurfdata_map/mksurfdata_esmf Makefile to build single-point datasets using subset_data #1674

ekluzek commented Mar 3, 2022 •

edited

Loading

ekluzek commented Mar 3, 2022

mvertens commented Mar 3, 2022

ekluzek commented Mar 3, 2022

billsacks commented Mar 3, 2022

mvertens commented Mar 3, 2022

ekluzek commented Mar 3, 2022

ekluzek commented Mar 3, 2022

mvertens commented Mar 3, 2022

mvertens commented Mar 3, 2022

ekluzek commented Mar 3, 2022

billsacks commented Mar 4, 2022

wwieder commented Mar 4, 2022

wwieder commented Mar 4, 2022

mvertens commented Mar 4, 2022

ekluzek commented Mar 4, 2022

wwieder commented Mar 4, 2022

mvertens commented Mar 4, 2022

ekluzek commented Mar 14, 2022

ekluzek commented Mar 14, 2022

ekluzek commented Jan 19, 2023

olyson commented Jan 19, 2023

ekluzek commented Jan 19, 2023

olyson commented Jan 19, 2023

ekluzek commented Jan 19, 2023

olyson commented Jan 19, 2023

ekluzek commented Jan 20, 2023

olyson commented Jan 20, 2023

ekluzek commented Jan 23, 2023

ekluzek commented Jan 23, 2023

ekluzek commented Jan 23, 2023

ekluzek commented Jan 23, 2023

Change mksurfdata_map/mksurfdata_esmf Makefile to build single-point datasets using subset_data #1674

Change mksurfdata_map/mksurfdata_esmf Makefile to build single-point datasets using subset_data #1674

Comments

ekluzek commented Mar 3, 2022 • edited Loading

ekluzek commented Mar 3, 2022

mvertens commented Mar 3, 2022

ekluzek commented Mar 3, 2022

billsacks commented Mar 3, 2022

mvertens commented Mar 3, 2022

ekluzek commented Mar 3, 2022

ekluzek commented Mar 3, 2022

mvertens commented Mar 3, 2022

mvertens commented Mar 3, 2022

ekluzek commented Mar 3, 2022

billsacks commented Mar 4, 2022

wwieder commented Mar 4, 2022

wwieder commented Mar 4, 2022

mvertens commented Mar 4, 2022

ekluzek commented Mar 4, 2022

wwieder commented Mar 4, 2022

mvertens commented Mar 4, 2022

ekluzek commented Mar 14, 2022

ekluzek commented Mar 14, 2022

ekluzek commented Jan 19, 2023

olyson commented Jan 19, 2023

ekluzek commented Jan 19, 2023

olyson commented Jan 19, 2023

ekluzek commented Jan 19, 2023

olyson commented Jan 19, 2023

ekluzek commented Jan 20, 2023

olyson commented Jan 20, 2023

ekluzek commented Jan 23, 2023

ekluzek commented Jan 23, 2023

ekluzek commented Jan 23, 2023

ekluzek commented Jan 23, 2023

ekluzek commented Mar 3, 2022 •

edited

Loading