Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimize scripts in a job run #1254

Closed
amametjanov opened this issue Feb 3, 2017 · 18 comments
Closed

Minimize scripts in a job run #1254

amametjanov opened this issue Feb 3, 2017 · 18 comments
Assignees
Labels

Comments

@amametjanov
Copy link
Member

case.submit already sets up the namelists in the run-dir, however ,at the start of a job, this is repeated: e.g.

timing dir is /project/projectdirs/acme
Checking file env_case
Checking file env_mach_pes
Checking file env_build
Running /global/u2/a/azamat/cori/repos/ACME-master/cime/../components/cam/cime_config/buildnml:
     CAM writing dry deposition namelist to drv_flds_in 
Writing ocean component namelist to ./docn_in 
CAM writing namelist to atm_in
    
Running /global/u2/a/azamat/cori/repos/ACME-master/cime/../components/clm/cime_config/buildnml:
     Warning:: running with user defined cppdefs is NOT validated / scientifically supported.
CLM configure done.
CLM adding use_case 2000_control defaults for var 'sim_year' with val '2000'
CLM adding use_case 2000_control defaults for var 'sim_year_range' with val 'constant'
CLM adding use_case 2000_control defaults for var 'use_case_desc' with val 'Conditions to simulate 2000 land-use'
    
Running /global/u2/a/azamat/cori/repos/ACME-master/cime/../components/cice/cime_config/buildnml:
     Setting CESM root directory to /global/u2/a/azamat/cori/repos/ACME-master
Setting CICE configuration script directory to /global/u2/a/azamat/cori/repos/ACME-master/components/cice/bld
Setting CICE build directory to /global/u2/a/azamat/cori/repos/ACME-master/cime/scripts/cases/FC5AV1C-04P2-ne30_ne30-01/Buildconf/ciceconf
The configuration cache file will be created in /global/u2/a/azamat/cori/repos/ACME-master/cime/scripts/cases/FC5AV1C-04P2-ne30_ne30-01/Buildconf/ciceconf/config_cache.xml
Is bc_dep_to_snow_updates active (0-NO; 1-YES)?: 1
Using MCT for comp_intf.
cice : mode         is prescribed 
cice : ncat         is 1 
Horizontal grid specifier: ne30np4.
cice : grid         is ne30np4    
cice : nlon         is 48602    
cice : nlat         is 1    
cice : bsizex       is    
cice : bsizey       is    
cice : mxblcks      is    
cice : decomp type  is   
CPP definitions set by configure: '  -DCCSMCOUPLED -Dcoupled -Dncdf -DNCAT=1 -DNXGLOB=48602 -DNYGLOB=1 -DNTR_AERO=0 -DMODAL_AER'
creating /global/u2/a/azamat/cori/repos/ACME-master/cime/scripts/cases/FC5AV1C-04P2-ne30_ne30-01/Buildconf/ciceconf/Filepath
creating /global/u2/a/azamat/cori/repos/ACME-master/cime/scripts/cases/FC5AV1C-04P2-ne30_ne30-01/Buildconf/ciceconf/CICE_cppdefs
creating /global/u2/a/azamat/cori/repos/ACME-master/cime/scripts/cases/FC5AV1C-04P2-ne30_ne30-01/Buildconf/ciceconf//global/u2/a/azamat/cori/repos/ACME-master/cime/scripts/cases/FC5AV1C-04P2-ne30_ne30-01/Buildconf/ciceconf/config_cache.xml
CICE configure done.
    
Running /global/u2/a/azamat/cori/repos/ACME-master/cime/components/data_comps/docn/cime_config/buildnml:
    
    
Running /global/u2/a/azamat/cori/repos/ACME-master/cime/components/stub_comps/sglc/cime_config/buildnml:
    
    
Running /global/u2/a/azamat/cori/repos/ACME-master/cime/components/stub_comps/swav/cime_config/buildnml:
    
    
Running /global/u2/a/azamat/cori/repos/ACME-master/cime/../components/rtm/cime_config/buildnml:
    
    
Running /global/u2/a/azamat/cori/repos/ACME-master/cime/driver_cpl/cime_config/buildnml:
     infile is /global/u2/a/azamat/cori/repos/ACME-master/cime/scripts/cases/FC5AV1C-04P2-ne30_ne30-01/Buildconf/cplconf/namelist 
 Read fldsin file: /global/u2/a/azamat/cori/repos/ACME-master/cime/scripts/cases/FC5AV1C-04P2-ne30_ne30-01/Buildconf/camconf/drv_flds_in
    
-------------------------------------------------------------------------
 - To prestage required restarts, untar a restart.tar file into /global/cscratch1/sd/azamat/acme_scratch/FC5AV1C-04P2-ne30_ne30-01/run
 - Case input data directory (DIN_LOC_ROOT) is /project/projectdirs/acme/inputdata 
 - Checking for required input datasets in DIN_LOC_ROOT
-------------------------------------------------------------------------
total tasks is: 5400
2017-01-31 12:19:11 MODEL EXECUTION BEGINS HERE
run command is srun  -c 4   --cpu_bind=cores  -n 5400  /global/cscratch1/sd/azamat/acme_scratch/FC5AV1C-04P2-ne30_ne30-01/bld/acme.exe  >> acme.log.170131-121301 2>&1
2017-01-31 12:26:40 MODEL EXECUTION HAS FINISHED
Failed to kill syslog: [Errno 3] No such process
check for resubmit
dout_s False
mach cori-knl
resubmit_num 0
@jgfouca
Copy link
Member

jgfouca commented Feb 6, 2017

@amametjanov I assume your issue is that buildnml is slow and you don't want to repeat it. Marianna has made changes on the ESMCI that minimize and optimize namelist building, so this should be resolved once we merge 5.2 into ACME.

@jgfouca
Copy link
Member

jgfouca commented Mar 15, 2017

@amametjanov this is one of the core problems with CIME-5* and I don't know if it's ever going to be fixed. The best I can tell you for now is that we're aware of it.

@jgfouca jgfouca closed this as completed Mar 15, 2017
@rljacob
Copy link
Member

rljacob commented Mar 15, 2017

Az, was there another problem caused by calling the script twice?

@amametjanov
Copy link
Member Author

Mostly speed: login nodes run at 2.3GHz, KNL compute nodes run at 1.4GHz. Also, at the time, the file system was slow.

@rljacob
Copy link
Member

rljacob commented Mar 23, 2017

Does 5.2 still appear to be doing to much in the run script?

@amametjanov
Copy link
Member Author

amametjanov commented Mar 23, 2017

Yes. Doing a standard workflow case.submit runs namelist building twice:

$ ./case.submit 
Creating component namelists
   Running cam buildnml
CAM writing dry deposition namelist to drv_flds_in 
CAM writing namelist to atm_in
   Running clm buildnml
Warning:: running with user defined cppdefs is NOT validated / scientifically supported.
CLM configure done.
CLM adding use_case 1850_control defaults for var 'sim_year' with val '1850'
CLM adding use_case 1850_control defaults for var 'sim_year_range' with val 'constant'
CLM adding use_case 1850_control defaults for var 'use_case_desc' with val 'Conditions to simulate 1850 land-use'
   Running mpascice buildnml
CESM ROOT IS: /gpfs/mira-home/azamatm/repos/ACME-next
MPAS-CICE build-namelist: ice_grid is oEC60to30v3 
OK -- found /projects/ccsm/inputdata/ice/mpas-cice/oEC60to30v3/seaice.EC60to30v3.restartFrom_anvil0221.170301.nc
OK -- found /projects/ccsm/inputdata/ice/mpas-cice/oEC60to30v3/mpas-cice.graph.info.161222.part.1360
   Running mpaso buildnml
CESM ROOT IS: /gpfs/mira-home/azamatm/repos/ACME-next
MPAS-O build-namelist: ocn_grid is oEC60to30v3 
OK -- found /projects/ccsm/inputdata/ocn/mpas-o/oEC60to30v3/oEC60to30v3.161222.nc
OK -- found /projects/ccsm/inputdata/ocn/mpas-o/oEC60to30v3/oEC60to30v3.restartFrom_anvil0221.170301.nc
OK -- found /projects/ccsm/inputdata/ocn/mpas-o/oEC60to30v3/mpas-o.graph.info.161222.part.512
   Running mosart buildnml

   Calling sglc buildnml
   Running sglc buildnml

   Calling swav buildnml
   Running swav buildnml

   Calling sesp buildnml
   Running sesp buildnml

   Calling drv buildnml
Finished creating component namelists
Checking that inputdata is available as part of case submission
Loading input file list: 'Buildconf/cam.input_data_list'
Loading input file list: 'Buildconf/clm.input_data_list'
Loading input file list: 'Buildconf/mpas-cice.input_data_list'
Loading input file list: 'Buildconf/mpas-o.input_data_list'
Loading input file list: 'Buildconf/mosart.input_data_list'
Loading input file list: 'Buildconf/cpl.input_data_list'
Check case OK
submit_jobs case.run
job is case.run
Submit job case.run
Submitting job script qsub   --cwd /gpfs/mira-home/azamatm/repos/ACME-next/cime/scripts/cases/A_WCYCL1850S-ne30_oECv3_ICG-128-20170316 -A HiRes_EarthSys_2 -t 00:30:00 -n 128 -q default --mode script case.run 
Submitted job id is 1055281
$

After job completes, job output log has:

$ cat 1055281.output 
Creating component namelists
   Running cam buildnml
CAM writing dry deposition namelist to drv_flds_in 
CAM writing namelist to atm_in
   Running clm buildnml
Warning:: running with user defined cppdefs is NOT validated / scientifically supported.
CLM configure done.
CLM adding use_case 1850_control defaults for var 'sim_year' with val '1850'
CLM adding use_case 1850_control defaults for var 'sim_year_range' with val 'constant'
CLM adding use_case 1850_control defaults for var 'use_case_desc' with val 'Conditions to simulate 1850 land-use'
   Running mpascice buildnml
CESM ROOT IS: /gpfs/mira-home/azamatm/repos/ACME-next
MPAS-CICE build-namelist: ice_grid is oEC60to30v3 
OK -- found /projects/ccsm/inputdata/ice/mpas-cice/oEC60to30v3/seaice.EC60to30v3.restartFrom_anvil0221.170301.nc
OK -- found /projects/ccsm/inputdata/ice/mpas-cice/oEC60to30v3/mpas-cice.graph.info.161222.part.1360
   Running mpaso buildnml
CESM ROOT IS: /gpfs/mira-home/azamatm/repos/ACME-next
MPAS-O build-namelist: ocn_grid is oEC60to30v3 
OK -- found /projects/ccsm/inputdata/ocn/mpas-o/oEC60to30v3/oEC60to30v3.161222.nc
OK -- found /projects/ccsm/inputdata/ocn/mpas-o/oEC60to30v3/oEC60to30v3.restartFrom_anvil0221.170301.nc
OK -- found /projects/ccsm/inputdata/ocn/mpas-o/oEC60to30v3/mpas-o.graph.info.161222.part.512
   Running mosart buildnml

   Calling sglc buildnml
   Running sglc buildnml

   Calling swav buildnml
   Running swav buildnml

   Calling sesp buildnml
   Running sesp buildnml

   Calling drv buildnml
Finished creating component namelists
-------------------------------------------------------------------------
 - Prestage required restarts into /projects/HiRes_EarthSys_2/azamatm/A_WCYCL1850S-ne30_oECv3_ICG-128-20170316/run
 - Case input data directory (DIN_LOC_ROOT) is /projects/ccsm/inputdata 
 - Checking for required input datasets in DIN_LOC_ROOT
-------------------------------------------------------------------------
2017-03-17 03:03:58 MODEL EXECUTION BEGINS HERE
run command is /usr/bin/runjob  --block $COBALT_PARTNAME $LOCARGS  --envs OMP_STACKSIZE=16M  --envs XL_BG_SPREADLAYOUT=YES  --envs BG_THREADLAYOUT=1  --label short  --envs OMP_NUM_THREADS=$OMP_NUM_THREADS  --ranks-per-node 16  --np 2048 :  /projects/HiRes_EarthSys_2/azamatm/A_WCYCL1850S-ne30_oECv3_ICG-128-20170316/bld/acme.exe  >> acme.log.$LID 2>&1  
2017-03-17 03:15:54 MODEL EXECUTION HAS FINISHED
$

@rljacob
Copy link
Member

rljacob commented Mar 25, 2017

Issues have been opened for a couple of solutions:
ESMCI/cime#1278
ESMCI/cime#1276

@tangq
Copy link
Contributor

tangq commented Apr 12, 2017

@amametjanov and @rljacob ,

When I make many changes to the user_nl_xxx files, I usually run preview_namelist on the login node before submitting the job to confirm all the changes are correctly implemented and the run won't fail by preview_namelist when the model is actually executed later on.

Besides preview_namelist, are there other checks in case.submit and case.run before the actually command (e.g., srun) of submitting the job? If so, is there a way to invoke those checks manually similar to preview_namelist? Such function will save a lot of waiting time in the queue (esp. for machines like Mira), if we can know whether the run can pass the checks before submitting it. Thanks.

@amametjanov
Copy link
Member Author

Also tagging @jonbob to get his suggestions on making sure that all MPAS files are in place prior to case.submit.

@tangq
Copy link
Contributor

tangq commented Apr 12, 2017

Tagging @golaz as he's also interested in knowing such functions.

@rljacob rljacob reopened this Apr 12, 2017
@rljacob
Copy link
Member

rljacob commented Apr 12, 2017

case.submit will also call check_input_data. You can also call that yourself. See "check_input_data --help" for options.

@rljacob
Copy link
Member

rljacob commented May 4, 2017

This issue is primarily about reducing scripts in case.run and @erichlf now has a PR for that.

@ndkeen
Copy link
Contributor

ndkeen commented May 5, 2017

I have some data on the length of time spent NOT running the model during a batch job execution on cori-knl. Is this a good place to copy/paste?

@golaz
Copy link
Contributor

golaz commented May 8, 2017

@ndkeen: you can post it here. I'd be curious to see it.

@rljacob
Copy link
Member

rljacob commented May 18, 2017

An additional option to skip preview-namelist has been implemented in ESMCI/cime#1471 and will be in ACME with the next CIME update (after #1490 )

@jonbob
Copy link
Contributor

jonbob commented May 19, 2017

@rljacob - can this be the default behavior?

@rljacob
Copy link
Member

rljacob commented May 22, 2017

Eventually. We'll need to add a variable to control the default because some user's rely on it and some don't.

@amametjanov
Copy link
Member Author

The ./case.submit --skip-preview-namelist option is working for test jobs (after the latest CIME merge): namelist generation is not repeated at the start of the job. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants