-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mksurfdat toolchain: Wrapper tool that handles all the steps needed to create a CTSM surface dataset #644
Comments
Eventually, I think we want a single python-based tool for this (though we'll probably keep mksurfdata_map in Fortran). In order to break this down into more manageable chunks, I see two possible approaches: (1) Top down: Start by making a small wrapper script that calls the existing tools (a mix of shell scripts and perl, I think). This will involve working out a reasonable user interface for the high-level wrapper. Then we can start converting the individual tools to python one-by-one; as we do, we can call them directly rather than using subprocesses. (2) Bottom up: Start by converting each of the individual tools to python one-by-one; create the wrapper after that is all done. In talking with @mvertens about this, we think that (1) is probably the best approach. |
The pre-processing step of generating a namelist is discussed here: #86 |
I am transferring the latest proposed approach from the google doc to this discussion. Comments welcome:
A) Using the new wrapper script,
TODO: What script does Ufuk have available for the CESM/CTSM case?
NB. What we have been referring to as namelist here will be a control file (in a format such as namelist, yaml, config-file format, json, xml). There will also be an internal namelist file read by the mksurfdata_map fortran code that will not involve user modification. Critical options in Important options in Critical options for PTCLM, but also for our production of several of our grids for testing. @slevisconsulting adding here that we need to decide how to implement these overrides in the tool-chain. I picture the user deciding to insert these in their generated control file in place of the corresponding mksrf_ files. Or even the mksrf_ file name strings themselves could be used as comma delimited lists when so desired: Options that are probably still useful, but should be looked into and maybe could be done a different way Options from mksurfdata.pl that can be removed:
NOTE: Two requirements. One is that you can still use the Makefile.data makefile to build the standard resolutions needed Another requirement is that there is testing in place for all of this. There could be both unit as well as functional and system testing for the entire system. Envisioned changes from TODO2 @slevisconsulting will come up with file naming conventions with @ekluzek and @negin513 and will add the necessary metadata to all raw datasets. TODO3 @negin513 and @slevisconsulting will replace
TODO4 @negin513 and @slevisconsulting will assemble the functions of B) Generating the domain file @mvertens keep in mind when working on this that there are two alternate scripts for step 1 in generating a domain file: |
@ekluzek completed TODO1 above by listing important mksurfdata.pl options and options that may be obsolete. Regarding TODO2 "metadata needed in raw datasets" we decided as follows: srcmeshfile_scrip_w_mask We decided to include four mesh file paths (SCRIP and UNSTRUCT SRC files with and without masks), so as to be prepared for all these options, regardless of whether we complete #823 and #648 . We decided to omit mapping file names from the metadata of the raw datasets. The script will "know" which raw data corresponds to which mapping file based on the SRC grid_name and landmask_name. I will create the new raw dataset files with new date-stamps in the file names. I will add some other clarifications to the proposal text (preceding post). |
When you say you're going to list 4 different mesh file paths, do you just mean in the short-term? That seems okay, but I feel like it could add (significantly) more confusion than value in the long-term, so I hope that by the time we release this new method, we'll have gotten this down to a single mesh file name. |
You're right @billsacks , ideally this would be for the short term. Alternatively, I could add a single mesh file path The more I think about it, the more I prefer the latter option now. It seemed inefficient, but with my notes the second and third times will be quicker. |
The thing is each time you change it you make a new raw data file for each
of the datasets needed. So you end up making a ton more datasets.
…On Mon, Oct 19, 2020, 1:55 PM Samuel Levis ***@***.***> wrote:
You're right @billsacks <https://github.com/billsacks> , ideally this
would be for the short term. Alternatively, I could add a single mesh file
path src_mesh_file and modify it later as needed, e.g. once if we go to
nomask and once if we go to UNSTRUCT files.
The more I think about it, the more I prefer the latter option now. It
seemed inefficient, but with my notes the second and third times will be
quicker.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#644 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACYCZQBMDUGUWJWDYLVAVLTSLSKRLANCNFSM4G2EIO6A>
.
|
I like @slevisconsulting 's suggestion for now. I assume there will be a moderately long period in which this is under development, and these new files are just being used on the development branch, not on master. So they don't need to be added to the inputdata repository. Then we can revisit this question if this is getting close to be ready to come to master and #823 and/or #648 are still unresolved. |
I have added the metadata to the mksrf_ files in a copy that I am keeping here for now: Next is TODO3 that replaces |
I will discuss TODO3 in greater detail in #86 to prevent clutter here. |
I'm thinking about the option "-fast_maps" in current mksurfdata.pl. This is basically covered in issue #450. I think this would be useful to have even in the early versions, because it will make testing so much faster. |
@slevisconsulting @negin513 and had a good discussion about this a bit today. Looking at @negin513 script that creates the control file to then create both mapping files and the namelist file for mksurfdata_map. We looked at the list of current mksurfdata.pl options, and decided -debug can be removed for sure. Along the same lines I'd say to remove "-inlandwet" and "-merge_gis" as the namelist could be easily modified to turn those things on. We had decided to get rid of most of the single-point options. But, in order to create all the datasets that we currently support keeping the following options seems prudent... -pft_frc "list of fractions"...Comma delimited list of percentages for veg types I don't see an easy way to create some of the datasets without supporting the above. I don't think supporting the above is too onerous either. And that does get rid of the four single point soil options. Another good option to keep is -no-surfdata, as that's used in the normal dataset creation. You use that for example when you just want to create landuse.timeseries files for all the different SSP options, but already have the surface dataset needed, so don't want to recreate it. |
Notes from today's meeting with @negin513 and @slevisconsulting based on our notes in the corresponding google doc:
Include default values in help page. Negin, I misspoke when I said that -fast_maps was not needed:
Holding off on single point options for now, but keeping in mind @ekluzek 's comments on this topic.
|
@negin513 presented our progress to date in today's CTSM Software meeting. Thanks everyone for the feedback. I updated the big picture and moving parts of the wrapper script in the schematic according to today's conversation. The group agreed:
All, pls add or correct anything I may have missed. |
Thanks, @slevisconsulting and @negin513 . One question about this schematic that I didn't get a chance to raise today: The way you have drawn this seems to imply that create_surface_data.py would run the whole thing at once, including gen_user_namelist.py. But my understanding from our earlier discussion was that there would be a two-step process: you would first run gen_user_namelist.py, then modify the default namelist as you wish, then have a tool that wraps all of the rest of the steps, given that (modified) namelist as input. Is that still the plan? |
@billsacks I was picturing a wrapper script that runs all the steps at once when generating surface datasets for default resolutions using default raw datasets. The wrapper script would permit the user to stop at step 1 to modify the namelist and/or step 2 to verify that they are satisfied with the mapping files. I am not attached to this view if the group prefers to separate the first step out of the wrapper script. |
I haven't (yet) developed strong feelings on how this should work. I'm just thinking that it sounds like the most common workflow for users (not CTSM maintainers) would be:
For me, if I were doing that workflow, I think it would be most intuitive and least error-prone if (1) and (3) were different tools. I think what you're saying (though I may be misunderstanding) is that there would be one tool (create_surface_data.py) that would operate differently and have different command-line usages depending on what steps you want it to do. My gut feeling is that having a single tool that is smart enough to run (or not run) different steps is good when the most common thing is to want to run all of those steps at once, but that if the most common thing is to want to run certain steps separately, then there should be separate tools for those different steps. Of course, others may feel differently than I do on that. |
I'd agree with Bill on this, but I don't have strong feelings. The main
point is that the directions for a user are clear and that they can easily
follow them to do the more common thing, which is as Bill described.
…On Thu, Dec 17, 2020 at 12:05 PM Bill Sacks ***@***.***> wrote:
I haven't (yet) developed strong feelings on how this should work. I'm
just thinking that it sounds like the most common workflow for users (not
CTSM maintainers) would be:
1. Run something to generate a default namelist
2. Modify that namelist
3. Run the rest of the tool chain, taking that namelist as input
For me, if I were doing that workflow, I think it would be most intuitive
and least error-prone if (1) and (3) were different tools. I think what
you're saying (though I may be misunderstanding) is that there would be one
tool (create_surface_data.py) that would operate differently and have
different command-line usages depending on what steps you want it to do. My
gut feeling is that having a single tool that is smart enough to run (or
not run) different steps is good when the most common thing is to want to
run all of those steps at once, but that if the most common thing is to
want to run certain steps separately, then there should be separate tools
for those different steps. Of course, others may feel differently than I do
on that.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#644 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AFABYVDXWDABL7AU6WVAZWLSVJI7TANCNFSM4G2EIO6A>
.
|
I agree with both arguments and I think they are not necessarily exclusive options.
which runs all the steps including namelist generation, since the user does not need to change anything in the namelist. This is similar to what @slevisconsulting is proposing. However, for other cases where the user wants to modify the namelist, we can have it like this:
This is similar to what @dlawrenncar and @billsacks are proposing. This is just a suggestion that incorporates both ideas. But I am not attached to it and we can go however the group seems appropriate. Overall, the workflow can be changed easily to accommodate either of these opinions. Therefore, I suggest not worrying about it too much for now and getting back to this particular issue during later stages of development. |
Another important point which I 100% agree with @dlawrenncar is that: the main point is to have clear instructions for the users. |
This seems reasonable as long as:
|
@billsacks : I agree with your points. |
New topic... Negin and I thought some more about modifying the fortran to accept the new namelist and realized that it introduces an element of risk by requiring us to reconstitute the mapping filenames in two places:
Particularly, mkmapdata.py and the fortran would be reconstituting names such as Unless there's another way around this or people do not consider it an issue, @negin513 and I propose that we return to @ekluzek 's suggestion of recreating the old namelist under the covers before starting the fortran executable. |
Yes, I see your point. However, in thinking about your latest comment, I'm wondering if there may be a bigger issue here that we've been overlooking - or if I'm thinking about things wrong. The issue I see is: How does mkmapdata.py know which source grid files to use? Your new schematic doesn't seem to address this question, but I think it needs to come from the namelist generated by gen_user_namelist.py, which the user may modify to point to different files. So I think what we really need is for the first step, gen_user_namelist.py, to NOT generate a Fortran namelist, but instead to generate something that can be read by the next step in the python toolchain. I'd suggest a config (cfg) file because they are easy to work with by hand and in python and are part of the python standard library; we also use config files elsewhere in the LILAC toolchain. So I think we might need something like the following: (1) User invokes gen_user_namelist.py (which should be renamed to not have "namelist" in its name) to generate SOMETHING.cfg. This file contains at least two sections: One section gives the raw data file paths, and one or more sections give other user-modifiable inputs to mksurfdata_map. (2) The user modifies anything they want in SOMETHING.cfg (3) User invokes create_surface_data.py. This reads SOMETHING.cfg, and:
So in summary: Yes, I am coming to agree that there needs to be this translation step, but I'm also coming to realize that the user-modifiable file needs to be in a python-friendly format, not a Fortran-friendly format (which itself is a driver for needing a translation step). Does that make sense? Or am I thinking about this the wrong way? |
@billsacks summary makes the most sense to me as well. It's also a nicely done process that allows it to be easily customized or just run out of the box for standard production cases. In one of the discussions I had with @negin513 and @slevisconsulting we also talked about the fact that the SOMETHING.cfg file could be in a different format such as cfg, YAML, or JSON or whatever else you want. I like @billsacks suggestion of cfg because it's in LILAC and in the standard python library. |
I have updated the schematic (see slide 5) to reflect these preferences. |
@slevisconsulting can you give us access to your slides? |
This comment reminds me that I need to go back and modify the contents of the raw datasets to point to the nomask SRC files. |
Once I do that, I will still NOT copy new datasets to /inputdata until I hear otherwise. |
Done. The files are in /glade/campaign/cgd/tss/slevis/rawdata as before, but now organized in two subdirectories:
|
With #1663 this becomes obsolete. So closing this as a wontfix. |
Starting here with notes from Mike Barlage that explain how he generates CTSM surface data for WRF domains, because we're using the application of WRF domains as the motivation for redesigning the mksurfat toolchain.
Mike's notes:
Creating setup based on WRF domain - CTSM Cheyenne/Geyser
create_scrip_file.ncl
creates two files that are complements of each other only in the mask field
script and data reside: /glade/work/barlage/ctsm/nldas_grid/scrip
Modify mkunitymap.ncl with commented lines
+; if ( any(ncb->grid_imask .ne. 1.0d00) )then
+; print( "ERROR: the mask of the second file isn't identically 1!" );
+; print( "(second file should be land grid file)");
+; exit
+; end if
Link scrip files
ln -sf /glade/work/barlage/ctsm/nldas_grid/scrip/wrf2clm_land_noneg.nc .
ln -sf /glade/work/barlage/ctsm/nldas_grid/scrip/wrf2clm_ocean_noneg.nc .
setenv GRIDFILE1 wrf2clm_ocean_noneg.nc
setenv GRIDFILE2 wrf2clm_land_noneg.nc
setenv MAPFILE wrf2clm_mapping_noneg.nc
setenv PRINT TRUE
ncl mkunitymap.ncl
Will throw some git errors if not run in a repo
*** takes a few seconds
qsub regridbatch_barlage.sh
copy stored in ~/src/ctsm/regrid_scripts/
Build:
cd src/
../../../configure --macros-format Makefile --mpilib mpi-serial
(source ./.env_mach_specific.csh ; gmake)
cd ..
./gen_domain -m /glade/work/barlage/ctsm/nldas_grid/scrip/wrf2clm_mapping_noneg.nc -o wrf2clm_ocn_noneg -l wrf2clm_lnd_noneg
creates:
domain.lnd.wrf2clm_lnd_wrf2clm_ocn.180808.nc
domain.ocn.wrf2clm_lnd_wrf2clm_ocn.180808.nc
domain.ocn.wrf2clm_ocn.180808.nc
copy to /glade/work/barlage/ctsm/nldas_grid/gen_domain_files
*** takes a few seconds
The text was updated successfully, but these errors were encountered: