Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient single-point dataset capability for subset_data #1673

Open
2 of 4 tasks
ekluzek opened this issue Mar 3, 2022 · 35 comments
Open
2 of 4 tasks

Transient single-point dataset capability for subset_data #1673

ekluzek opened this issue Mar 3, 2022 · 35 comments
Assignees
Labels
enhancement new capability or improved behavior of existing capability

Comments

@ekluzek
Copy link
Collaborator

ekluzek commented Mar 3, 2022

In moving from mksurfdata.pl to using subset for single point datasets, one capability we have removed is the ability to do transient single-point datasets. We have this in place for smallville testing of dynamic landunits and we have one test for a tower site with transient landuse changes.

This relates to:

#1664

Definition of done:

  • Implement a simple bash script to do this on the ctsm5.2 branch
  • Add to the Makefile on the ctsm5.2 branch
  • Add this as a capability into subset_data
  • Investigate: Smallville hist tests in the ctsm5.2 branch ctsm5.2.0 -- ctsm5.2.mksurfdata #2372 indicate that pct_nat_pft for 1850 is inconsistent between the fsurdat and landuse files generated by subset_data
@wwieder
Copy link
Contributor

wwieder commented Mar 4, 2022

This is potentially a useful feature to maintain, although I suspect it's not used often.

Can a user still run a generic single point case with global DATM inputs and the global land use time series?
If so, this may provide enough functionality for the majority of cases?

Would being able to subset the land use time series file for a single grid provide additional flexibility for users to configure their own specific land use time series?

Are their urban configuration, especially with transient urban now enabled, that would be helpful @olyson?

@wwieder
Copy link
Contributor

wwieder commented Mar 4, 2022

One more note, wasn't this something that @swensosc 's old subset_data script had a simple way of doing this that can be brought over into @negin513 's new script?

if create_landuse:

Again, this may not be high priority, but we can discuss at our CLM meeting next week in conjunction with a broader datasets conversation.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 4, 2022

You can use subset_data to subset from your landuse.timeseries file, but that's only going to work if you aren't overriding the PFT's for your data (same with the original subset_surfdata you show above). We have two sites we do that for: 1x1_numaIA and 1x1_brazil (so we can keep those two working). The issue with that is that in a normal tower site case you do override the PFT's, so the current transient capability is only going to work if you can happen to find a point from a global dataset that has the right PFT's to begin with. Possibly using a higher resolution global grid will help, but you are still likely to get a mix of PFT's even at our finest resolutions.

For the smallville site we constructed specific landuse changes that happened over a few years. We don't have any capacity to construct transient changes like that now. PTCLM also had the ability to do transient for the US-Ha1 tower site. What it had was a harvest that happened at 1946, and that's all the transient timeseries file does is add that one harvest. But, that is a useful feature. Again that's the kind of thing that you have to construct rather than use the global landuse.timeseries file.

@olyson
Copy link
Contributor

olyson commented Mar 4, 2022

For urban and in general, single-point transient capability would be useful for troubleshooting problems encountered in global or regional simulations. I can't think of any specific urban use cases other than that.

@negin513
Copy link
Contributor

negin513 commented Mar 4, 2022

One more note, wasn't this something that @swensosc 's old subset_data script had a simple way of doing this that can be brought over into @negin513 's new script?

if create_landuse:

Again, this may not be high priority, but we can discuss at our CLM meeting next week in conjunction with a broader datasets conversation.

Hello!
I think this capability already exists in subset_data script using --create-landuse option.
For example, if you run this, it will create the landuse file:

./subset_data point --create-landuse --include-nonveg --verbose

The code corresponding to it is here:

# -- Create CTSM transient landuse data file
if single_point.create_landuse:
single_point.create_landuse_at_point(file_dict["fluse_dir"], file_dict["fluse_in"],
args.user_mods_dir)
# -- Create single point atmospheric forcing data

and here:

def create_landuse_at_point(self, indir, file, user_mods_dir):
"""
Create landuse file at a single point.
"""
logger.info(
"----------------------------------------------------------------------"
)
logger.info(
"Creating land use file at %s, %s.",
self.plon.__str__(),
self.plat.__str__(),
)
# specify files
fluse_in = os.path.join(indir, file)
fluse_out = add_tag_to_filename(fluse_in, self.tag)
logger.info("fluse_in: %s", fluse_in)
logger.info("fluse_out: %s", os.path.join(self.out_dir, fluse_out))
# create 1d coordinate variables to enable sel() method
f_in = self.create_1d_coord(fluse_in, "LONGXY", "LATIXY", "lsmlon", "lsmlat")
# extract gridcell closest to plon/plat
f_out = f_in.sel(lsmlon=self.plon, lsmlat=self.plat, method="nearest")
# expand dimensions
f_out = f_out.expand_dims(["lsmlat", "lsmlon"])
# specify dimension order
f_out = f_out.transpose("time", "cft", "natpft", "lsmlat", "lsmlon")
# revert expand dimensions of YEAR
year = np.squeeze(np.asarray(f_out["YEAR"]))
temp_xr = xr.DataArray(
year, coords={"time": f_out["time"]}, dims="time", name="YEAR"
)
temp_xr.attrs["units"] = "unitless"
temp_xr.attrs["long_name"] = "Year of PFT data"
f_out["YEAR"] = temp_xr
# update attributes
self.update_metadata(f_out)
f_out.attrs["Created_from"] = fluse_in
wfile = os.path.join(self.out_dir, fluse_out)
self.write_to_netcdf(f_out, wfile)
logger.info("Successfully created file (fluse_out), %s", wfile)
f_in.close()
f_out.close()
# write to user_nl_clm data if specified
if self.create_user_mods:
with open(os.path.join(user_mods_dir, "user_nl_clm"), "a") as nl_clm:
line = "flanduse_timeseries = '${}'".format(
os.path.join(USRDAT_DIR, fluse_out)
)
self.write_to_file(line, nl_clm)

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 4, 2022

@negin513 yes as I say above the ability to subset landuse.timeseries files exists. But, it's not going to function in a useable way if you are over-ridding the PFT's for the site. The landuse timeseries file from the global dataset is going to have a different PFT distribution that won't line up with what you want to override it with. To both override the PFT's and allow a transient change in time, there needs to be a mechanism to not only override the PFT's, but also give how it's going to change in time. And you also might want to specify the harvest for each year as well. I say this elsewhere -- we can use this capability for some specific sites: 1x1_numaIA, and 1x1_brazil (since we don't override the PFT's there). We can't use it for constructed transient changes like we do for 1x1_smallvilleIA and 1x1_US-Ha1. To catch the joke, Smallville IA is the place where Superman was raised, so it's not a real place, the PFT's and transient changes are completely made up. :-)

@negin513
Copy link
Contributor

negin513 commented Mar 4, 2022

Oh! I see what you mean here. Thanks for clarifying it.

But, it's not going to function in a useable way if you are over-ridding the PFT's for the site. The landuse timeseries file from the global dataset is going to have a different PFT distribution that won't line up with what you want to override it with.

  1. So possibly we should print out a warning/or error if the user specify --dompft and --create-landuse at the same time?

To both override the PFT's and allow a transient change in time, there needs to be a mechanism to not only override the PFT's, but also give how it's going to change in time. And you also might want to specify the harvest for each year as well.

  1. I understand now. That is an interesting idea. We probably want to think about the best way to implement this feature if possible.

To catch the joke, Smallville IA is the place where Superman was raised, so it's not a real place, the PFT's and transient changes are completely made up. :-)

Haha! I did not know about Smallville. I kept thinking it is a real place. 😄

@ekluzek ekluzek added the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Mar 9, 2022
@negin513
Copy link
Contributor

In the ctsm SE meeting, we have decided on the following format for now:

year                  pft                   pft_weight
1950                  1, 5                  0.5, 0.5
1951                  18, 22                0.3, 0.7
1952                  18, 25                0.3, 0.7

An additional feature would be filling in the years if they don't exist using the previous line.

@billsacks
Copy link
Member

Thanks, @negin513 - that looks like a very good format. One minor detail is that I'd probably get rid of the spaces within a given area (like the 1, 5) and allow any mix of whitespace (spaces or tabs) in between the areas. That would let you do an initial split on whitespace, followed by a split on commas.

@mvertens
Copy link

@erik - I'm not sure what you mean by having the mksurfdata_esmf Makefile build single point datasets.
Currently CMake is being used - and a temporary Makefile exists just to build the executable. This Makefile will disappear once the next ESMF release takes place and we can point to a stable ESMF library. I'd like to understand this requirement in more detail - maybe in a meeting with @slevis and @ekluzek .

@ekluzek
Copy link
Collaborator Author

ekluzek commented Mar 15, 2022

This is not a requirement for mksurfdata_esmf it's a requirement for the subset_data tool.

@mvertens
Copy link

@ekluzek - thanks for clarifying. That makes sense.

@billsacks billsacks added enhancement new capability or improved behavior of existing capability and removed type: -discussion next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Mar 17, 2022
@ekluzek ekluzek changed the title Transient single-point dataset capability? Transient single-point dataset capability Apr 27, 2022
@ekluzek ekluzek added this to the ctsm5.2.0 milestone Apr 27, 2022
@ekluzek
Copy link
Collaborator Author

ekluzek commented Aug 4, 2022

We talked about this some at the CTSM software meeting this morning as this is needed to create single-point transient datasets. @negin513 and I are meeting on this tomorrow.

@ekluzek ekluzek changed the title Transient single-point dataset capability Transient single-point dataset capability for subset_data Aug 4, 2022
@ekluzek
Copy link
Collaborator Author

ekluzek commented Aug 5, 2022

@negin513 and I met on this, and she has more comments coming. We worked out the UI for how this should work. She will also do the work needed for this,. There is only one file that we need for this, for surface dataset generation, so we can wait on it for later.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 26, 2023

@wwieder this was something that Negin was going to do, but obviously can't now. This is important for the CTSM5.2 in that there is one test dataset that needs this capability. Keeping this testing is important long term, but we maybe don't need to hold CTSM5.2 for it. I haven't looked into how long this would take to accomplish. But, do you have thoughts on if we should make it a requirement for CTSM5.2 or wait until post CTSM5.2?

@billsacks
Copy link
Member

My feeling is that we want to have the capability long-term, but if it isn't in place for CTSM5.2 we can probably pretty easily put together the needed transient dataset(s) through a manual / one-off process.

@wwieder
Copy link
Contributor

wwieder commented Jan 30, 2023

I agree with @billsacks here, this is something we want long term, but that doesn't need to hold up the CTSM5.2 development (or release). We can create the dataset needed for testing, with the understanding that at some point users will request this functionality with a modern code base.

Should we close this issue with a 'won't fix' label (for now) or leave it open?

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 30, 2023

@wwieder let's leave it open although I will put low priority for now. I needed to know what the plan for it was to know if it was something that CTSM5.2 should be held up for. So I'll adjust the CTSM5.2 project board as well.

@ekluzek
Copy link
Collaborator Author

ekluzek commented Aug 24, 2023

As a way to do this in the short term I'm going to initially implement this with a simple bash script using NCO and the older file:

See

#1869 (comment)

@ekluzek
Copy link
Collaborator Author

ekluzek commented Nov 14, 2023

Our current plan is for @slevis-lmwg do the first step of doing this in a bash script.

@ekluzek ekluzek removed the priority: low Background task that doesn't need to be done right away. label Nov 14, 2023
@slevis-lmwg
Copy link
Contributor

...sorry for my confusion about the card associated with this issue. I put it back where I found it.

@wwieder
Copy link
Contributor

wwieder commented Dec 6, 2023

After talking to @slevis-lmwg we wanted to clarify the scope for this issue.

  • Scientifically, I'd imagine that users can point to a global land use timeseries for a single point simulation that runs OK (we're already pointing global lightening streams data). This make me think the focus of this issue is more related to
  • Software testing (e.g., making sure we have smallville and tower tests that are working with CLM5 surface data).

@ekluzek can you help clarify if this assessment is accurate. If so, how critical is this capability before we bring in the CTSM5.2 tag?

@ekluzek
Copy link
Collaborator Author

ekluzek commented Dec 6, 2023

@wwieder unfortunately I think this is important for our testing, and so critical to do. If it was just regular transient time-series it probably wouldn't be a big deal. But, this is how we test both transient-lake and transient-urban. See the test directories that use smallville...

smallville_dynlakes_monthly/user_nl_clm:flanduse_timeseries = '$DIN_LOC_ROOT/lnd/clm2/surfdata_map/landuse.timeseries_1x1_smallvilleIA_hist_78pfts_simyr1850-1855_dynLakes_c200928.nc'
smallville_dynurban_monthly/user_nl_clm:! The flanduse_timeseries file was created with the following NCL script (a copy of this script is in cime_config/testdefs/testmods_dirs/clm/smallville_dynurban_monthly):
smallville_dynurban_monthly/user_nl_clm:!flanduse_timeseries = '$DIN_LOC_ROOT/lnd/clm2/surfdata_map/landuse.timeseries_1x1_smallvilleIA_hist_78pfts_simyr1850-1855_dynUrban_c220223.nc'
smallville_dynurban_monthly/user_nl_clm:flanduse_timeseries = '$DIN_LOC_ROOT/lnd/clm2/surfdata_map/landuse.timeseries_1x1_smallvilleIA_hist_78pfts_simyr1850-1855_dynUrban_c220223.nc'

So this is important software testing, but also important testing of scientific features we need to keep working.

However, as I write this I realize that CTSM5.2 have transient lake and urban already in. So actually maybe we could remove those two tests (or modify them to do this with global datasets)?

It's probably OK to not have a single point test of transient flanduse_timeseries files, I'm pretty confident that is likely to be OK. Although long term we still want this capability.

So possibly the task is to make sure we have tests that ensure transient lake and urban are working? There could also be tests to make sure you can turn just those features on.

@wwieder and @slevis-lmwg what do you think?

@wwieder
Copy link
Contributor

wwieder commented Dec 6, 2023

OK, so the issue is really focused on testing. This will help us decide that prioritization for @slevis-lmwg to do this.

For testing purposes, it seems like this can kind of be a one-off, we just need a tool to crates the land use time series for point simulations that exercise lake, urban, (and other) features?

I agree that testing is important and will defer to you, Sam and @olyson about the best way to ensure good testing coverage for transient features we want to support with the CTSM5.2 dataset.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Dec 11, 2023

  1. In
    /glade/work/slevis/git/mksurfdata_toolchain/cime_config/testdefs/testmods_dirs/clm/smallville_dynlakes_monthly
    follow this order:
    a) subset_data to generate landuse.nc for smallville by picking the correct lat/lon. From Erik's makefile:
    SUBSETDATA_1X1_SMALL := --lat 40.6878 --lon 267.0228 --site 1x1_smallvilleIA
    b) trim output global file to 1850-1855 (can subset_data do that for me?)
    c) ncap2 -s PCT_LAKE=array(0.0,0.0,PCT_CROP); PCT_LAKE={0.,50.,25.,25.,25.,25.}; HASLAKE=array(1.,1.,AREA);
    PCT_CROP=array(0.0,0.0,PCT_LAKE); PCT_CROP={0.,25.,12.,12.,12.,12.} landuse.timeseries_1x1_smallvilleIA_hist_78pfts_simyr1850-1855_cNEW_FILE.nc landuse.timeser
    ies_1x1_smallvilleIA_hist_78pfts_simyr1850-1855_dynLakes_cNEWEST_FILE.nc"

  2. Repeat for smallville_dynuban_monthly and likely a 3rd case:
    c) The ncap command will differ.

Make reproducible by placing in a script (bash) and test by running the smallville tests from testlists. Do this to test on derecho:

git show ctsm5.1.dev158:Externals.cfg > Externals.cfg
manage_externals/checkout_externals

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Jan 3, 2024

Update

  1. a) In /glade/work/slevis/git/mksurfdata_toolchain/tools/site_and_regional, I executed:
    ./subset_data point --lat 40.6878 --lon 267.0228 --site 1x1_smallvilleIA --create-surface --create-landuse --crop
    and got this file (in subdirectory /subset_data_single_point):
    landuse.timeseries_1x1_smallvilleIA_hist_78_CMIP6_1850-2015_c240103.nc
    b) subset_data cannot trim the file, so I used ncks:
    ncks -d time,0,5 landuse.timeseries_1x1_smallvilleIA_hist_78_CMIP6_1850-2015_c240103.nc landuse.timeseries_1x1_smallvilleIA_hist_78_CMIP6_1850-1855_c240103.nc
    c) ncap2 -s "PCT_LAKE=array(0.,0.,PCT_CROP); PCT_LAKE={0.,50.,25.,25.,25.,25.} ; PCT_LAKE_MAX=array(50.,50.,PCT_CROP_MAX); PCT_CROP=array(0.,0.,PCT_LAKE); PCT_CROP={0.,25.,12.,12.,12.,12.}; PCT_CROP_MAX=array(25.,25.,PCT_LAKE_MAX)" landuse.timeseries_1x1_smallvilleIA_hist_78_CMIP6_1850-1855_c240103.nc landuse.timeseries_1x1_smallvilleIA_hist_78pfts_1850-1855_dynLakes_c240103.nc

@slevis-lmwg
Copy link
Contributor

  1. c) ncap2 -s "PCT_URBAN=array(0.,0.,PCT_URBAN); PCT_URBAN={0.,0.,0.,20.,15.,0.,10.,8.,0.,10.,8.,0.,10.,8.,0.,10.,8.,0.} ; PCT_URBAN_MAX=array(0.,0.,PCT_URBAN_MAX); PCT_URBAN_MAX={20.,15.,0.}; PCT_CROP=array(0.,0.,PCT_LAKE); PCT_CROP={0.,25.,12.,12.,12.,12.}; PCT_CROP_MAX=array(25.,25.,PCT_LAKE_MAX)" landuse.timeseries_1x1_smallvilleIA_hist_78_CMIP6_1850-1855_c240103.nc landuse.timeseries_1x1_smallvilleIA_hist_78pfts_1850-1855_dynUrban_c240103.nc

@slevis-lmwg
Copy link
Contributor

I have not found a 3rd smallville case to address if there is one.
To make the steps reproducible, I created this script:
/glade/work/slevis/git/mksurfdata_toolchain/tools/modify_input_files/modify_smallville_w_dynurban_and_lake.sh

I updated Externals.cfg to dev159, ran ./manage_externals/..., and updated the 2 user_nl_clm files in the smallville testmod directories. The smallville tests PASS:

./create_test ERS_Lm25.1x1_smallvilleIA.IHistClm50BgcCropQianRs.derecho_gnu.clm-smallville_dynlakes_monthly
./create_test ERS_Lm25.1x1_smallvilleIA.IHistClm50BgcCropQianRs.derecho_gnu.clm-smallville_dynurban_monthly

@ekluzek
Copy link
Collaborator Author

ekluzek commented Jan 4, 2024

@slevis-lmwg it looks like we removed testing for this dataset. I'll look into that some more...

Here is the previous file that was used:

/glade/campaign/cesm/cesmdata/cseg/inputdata/lnd/clm2/surfdata_map/release-clm5.0.18/landuse.timeseries_1x1_smallvilleIA_hist_78pfts_CMIP6_simyr1850-1855_c190214.nc

Note, it says 1850-1855, but it's really a constructed file that exercises specific landuse transitions in those 5 years. So it covers all the type of changes in a short test.

Look into creating that file and if it's easy enough we could add it back into our testing.

@slevis-lmwg
Copy link
Contributor

The info from the above landuse file that needs to be replicated. Each line is a year (1850-1855):

 input_pftdata_filename =
  "<pft_f>100</pft_f><pft_i>13</pft_i><harv>0,0,0,0,0</harv><graz>0</graz>",
  "<pft_f>100</pft_f><pft_i>13</pft_i><harv>0,0,0,0,0</harv><graz>0</graz>",
  "<pft_f>1,1,1,1,1,1,1,1,1,91</pft_f><pft_i>15,16,17,18,19,20,21,22,23,24</pft_i><harv>0,0,0,0,0</harv><graz>0</graz>",
  "<pft_f>91,1,1,1,1,1,1,1,1,1</pft_f><pft_i>15,16,17,18,19,20,21,22,23,24</pft_i><harv>0,0,0,0,0</harv><graz>0</graz>",
  "<pft_f>50,1,2,2,3,3,4,4,5,5,21</pft_f><pft_i>13,15,16,17,18,19,20,21,22,23,24</pft_i><harv>0,0,0,0,0</harv><graz>0</graz>",
  "<pft_f>75,1,1,1,1,1,1,1,1,1,16</pft_f><pft_i>13,15,16,17,18,19,20,21,22,23,24</pft_i><harv>0,0,0,0,0</harv><graz>0</graz>" ;

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Jan 8, 2024

I updated this script (originally created above)
/glade/work/slevis/git/mksurfdata_toolchain/tools/modify_input_files/modify_smallville_w_dynurban_and_lake.sh
to generate the third landuse file.

I updated the /cropMonthOutput testmod's user_nl_clm and this test now passes:
ERS_Ly6.1x1_smallvilleIA.IHistClm50BgcCropQianRs.derecho_gnu.clm-cropMonthOutput

@slevis-lmwg
Copy link
Contributor

The last test passed. I renamed the .sh script to modify_smallville.sh.

@slevis-lmwg
Copy link
Contributor

slevis-lmwg commented Jan 12, 2024

TODO slevis

  • Make branch, open PR, push the new script to my remote
  • copy the three new files to /glade/campaign/... (do NOT rimport)
  • complete 2nd checkbox at the top. In summary:
    Add new target for smallville transient and add that target to target subset-all. Follow crop-smallville-historical as a template.

@slevis-lmwg
Copy link
Contributor

Copied the 3 landuse files and the corresponding fsurdat file:

landuse.timeseries_1x1_smallvilleIA_hist_78pfts_1850-1855_dynLakes_c240103.nc
landuse.timeseries_1x1_smallvilleIA_hist_78pfts_1850-1855_dynPft_c240103.nc
landuse.timeseries_1x1_smallvilleIA_hist_78pfts_1850-1855_dynUrban_c240103.nc
surfdata_1x1_smallvilleIA_hist_78pfts_CMIP6_1850-2015_c240103.nc

to /glade/campaign/cesm/cesmdata/inputdata/lnd/clm2/surfdata_esmf/ctsm5.2.0

slevis-lmwg added a commit that referenced this issue Feb 16, 2024
Workaround for transient Smallville tests #1673 + testing all new datasets
@slevis-lmwg
Copy link
Contributor

UPDATE
Smallville hist tests in the ctsm5.2 branch #2372 indicate that pct_nat_pft for 1850 is inconsistent between the fsurdat and landuse files generated by subset_data. This needs further investigation later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement new capability or improved behavior of existing capability
Projects
No open projects
Development

No branches or pull requests

7 participants