Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi-serial builds on Cori-haswell (for SCM) #1615

Closed
bogensch opened this issue Jul 6, 2017 · 12 comments
Closed

mpi-serial builds on Cori-haswell (for SCM) #1615

bogensch opened this issue Jul 6, 2017 · 12 comments
Assignees
Labels

Comments

@bogensch
Copy link
Contributor

bogensch commented Jul 6, 2017

@csjack and others are interested in running the single column model on Cori-haswell, but were having troubles. I just attempted to build on Cori but got a compile error in the gptl build (bldlog tail attached). I turned off mpi-serial for the build, as a test, and the model compiled just fine but did not run (as SCM needs to be built with mpi-serial for a successful run). Any suggestions on how to get up and running with mpi-serial on Cori-haswell?

cori.gptl.bldlog.tail.txt

@PeterCaldwell
Copy link
Contributor

@ndkeen - since you're the local Cori afficionado, I figured you're more likely than anyone else to be able to help here... any ideas?

@worleyph
Copy link
Contributor

worleyph commented Jul 6, 2017

Simplest solution is to not try to build with PAPI. This is not on by default on Titan. Don't know why it would be on Cori-KNL. Need to remove -DHAVE_PAPI from the gptl compile options. Might check if this is in Macros.make in your case directory (and remove it if so)? Otherwise, someone more CIME savvy will need to advise. Could also try loading the papi module (in env_mach_specific.xml in your case directory).

@worleyph
Copy link
Contributor

worleyph commented Jul 6, 2017

Yes, Macros.make has gptl CPP definitions. From a Titan case.

 GPTL_CPPDEFS:= -DHAVE_NANOTIME -DBIT64 -DHAVE_VPRINTF -DHAVE_BACKTRACE -DHAVE_SLASHPROC -DHAVE_COMM_F2C -DHAVE_TIMES -DHAVE_GETTIMEOFDAY

Would need to determine where these are defined.

@worleyph
Copy link
Contributor

worleyph commented Jul 6, 2017

HAVE_PAPI is defined in config_compilers.xml for

 <compiler COMPILER="intel" MACH="edison">
 <compiler COMPILER="intel17" MACH="edison">
 <compiler COMPILER="intel" MACH="cori-haswell">
 <compiler COMPILER="intel" MACH="cori-knl">
 <compiler COMPILER="intel" MACH="eos">

Since we have not been using PAPI in production runs, disabling it by default for these systems and compilers makes sense. However, this is a POC call ( @ndkeen ), and perhaps there is another way that will work just for the mpi-serial case?

@ndkeen
Copy link
Contributor

ndkeen commented Jul 13, 2017

Hey! Sorry, this came out when I was on vacation and I missed it. I ran into this myself trying to run the SMS_R_Ld5.T42_T42.FSCM5A97. I think I have an easy fix. Just do some module removing before the module loads reqd for mpi-serial. Testing now. I'm seeing this on edison/cori-haswell/cori-knl and the same fix should work on all.

@ndkeen ndkeen self-assigned this Jul 13, 2017
@ndkeen
Copy link
Contributor

ndkeen commented Jul 14, 2017

  1. I'm sure we could make it work with PAPI, but since we weren't using it anyway, I have 2 PR's to remove it. So that will get you around the first issue that was reported here.
  2. After that, there are some module issues where I found I needed to remove the parallel hdf/netcdf modules just before loading the serial modules. I have a branch for that.
  3. The branch in (2) allows SMS_R_Ld5.T42_T42.FSCM5A97 (the only test in acme-developer that uses mpi-serial) to pass on edison with intel and intel17. It also passes on cori with GNU. However, it still fails with intel.
  4. One issue seems like a module/TCL issue that I can't repeat on the command line -- something where it seems to behave differently when running in the system. If I comment the command that removes the cray-hdf5 module, I can get around this. Weird, but seems innocent. However, the test then fails with:
0:  NetCDF: HDF error
0:  pio_support::pio_die:: myrank=          -1 : ERROR: ionf_mod.F90:         235 :
0:   NetCDF: HDF error
0: Image              PC                Routine            Line        Source             
0: acme.exe           0000000008038B6D  Unknown               Unknown  Unknown
0: acme.exe           0000000007C42EDA  pio_support_mp_pi         120  pio_support.F90
0: acme.exe           0000000007C4151A  pio_utils_mp_chec          74  pio_utils.F90
0: acme.exe           0000000007D971DF  ionf_mod_mp_open_         235  ionf_mod.F90
0: acme.exe           0000000007C3F3B9  piolib_mod_mp_pio        2834  piolib_mod.F90
0: acme.exe           00000000009D0E91  cam_pio_utils_mp_        1106  cam_pio_utils.F90
0: acme.exe           00000000009BCCF5  cam_initfiles_mp_          60  cam_initfiles.F90
0: acme.exe           000000000081A752  cam_comp_mp_cam_i         158  cam_comp.F90
0: acme.exe           00000000007E8144  atm_comp_mct_mp_a         260  atm_comp_mct.F90
0: acme.exe           00000000004591DD  component_mod_mp_         227  component_mod.F90
0: acme.exe           00000000004243FC  cesm_comp_mod_mp_        1173  cesm_comp_mod.F90
0: acme.exe           00000000004500AC  MAIN__                     63  cesm_driver.F90
0: acme.exe           00000000004133DE  Unknown               Unknown  Unknown
0: libc-2.19.so       00002AAAAFA88AC5  __libc_start_main     Unknown  Unknown
0: acme.exe           00000000004132E9  Unknown               Unknown  Unknown
0: MPI_Abort: error code = 1

So I might go ahead with a PR to adjust the module commands as it will improve things, but there may still be some work.

@bogensch
Copy link
Contributor Author

Hi @ndkeen , thanks for your help on this so far. Myself (and others) are trying to run the SCM on Edison post machine update with the most recent master. The model compiles fine, but then dies during initialization with the error you mentioned above:

0: NetCDF: HDF error
0: pio_support::pio_die:: myrank= -1 : ERROR: ionf_mod.F90: 235 :
0: NetCDF: HDF error

Any suggestions how to overcome this or should this be a new gitissue? thanks!

@ndkeen
Copy link
Contributor

ndkeen commented Aug 15, 2017

Yea, that's what we are seeing. I'm not sure what that means or how best to proceed. I think it is an error from PIO, so was hoping someone knew what it meant.

I added a little more to this issue: #1633

@rljacob
Copy link
Member

rljacob commented Aug 15, 2017

HDF? Does that mean its a parallel build of netcdf? You need a serial build to work with mpi-serial.

@ndkeen
Copy link
Contributor

ndkeen commented Aug 15, 2017

No, we use the serial versions for mpi-serial. As far as I know, this is the first time mpi-serial has been tested on the nersc machines.

@ndkeen
Copy link
Contributor

ndkeen commented Aug 16, 2017

Note that you should be able to use the GNU compiler until we figure out what's happening. --compiler=gnu

@rljacob rljacob added the Cori label Oct 4, 2017
@jgfouca jgfouca reopened this Oct 12, 2017
@ndkeen
Copy link
Contributor

ndkeen commented Feb 1, 2018

This issue was resolved (and documented elsewhere, though I can't find it at the moment) by changing the type of a certain netcdf file used from netCDF-4 classic model to classic.
This was a change made to our files in inputdata at NERSC, though I'm not yet certain if the files were changes on the SVN server yet.

General discussion of netcdf file types at NERSC here:
#1970

@ndkeen ndkeen closed this as completed Feb 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants