Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SMS.ne30_oECv3.A_BGCEXP_BCRC_CNPRDCTC_1850.bebop_intel fails with unknown file format error #2048

Open
jayeshkrishna opened this issue Jan 23, 2018 · 55 comments
Assignees
Labels

Comments

@jayeshkrishna
Copy link
Contributor

The test fails with the following error message,

pio_support::pio_die:: myrank=          -1 : ERROR: ionf_mod.F90:         235 :
  NetCDF: Unknown file format

@jayeshkrishna jayeshkrishna self-assigned this Jan 23, 2018
@jayeshkrishna
Copy link
Contributor Author

jayeshkrishna commented Jan 23, 2018

The failure occurs when the test (CAM) opens a file. The stack trace is,

Image              PC                Routine            Line        Source
acme.exe           0000000002FB245D  Unknown               Unknown  Unknown
acme.exe           0000000002CA2FF1  pio_support_mp_pi         120  pio_support.F90
acme.exe           0000000002CA10C5  pio_utils_mp_chec          59  pio_utils.F90
acme.exe           0000000002DA08F0  ionf_mod_mp_open_         235  ionf_mod.F90
acme.exe           0000000002C92881  piolib_mod_mp_pio        2834  piolib_mod.F90
acme.exe           0000000000534FEB  cam_pio_utils_mp_        1106  cam_pio_utils.F90
acme.exe           0000000000DE3558  solar_data_mp_sol         160  solar_data.F90
acme.exe           00000000005F6057  physpkg_mp_phys_i         790  physpkg.F90
acme.exe           00000000004F254E  cam_comp_mp_cam_i         178  cam_comp.F90
acme.exe           00000000004EA4A3  atm_comp_mct_mp_a         260  atm_comp_mct.F90
acme.exe           0000000000433888  component_mod_mp_         231  component_mod.F90
acme.exe           000000000042387B  cime_comp_mod_mp_        1177  cime_comp_mod.F90
acme.exe           0000000000430ABC  MAIN__                     92  cime_driver.F90
acme.exe           0000000000415ADE  Unknown               Unknown  Unknown
libc-2.17.so       00002B5FA25C0C05  __libc_start_main     Unknown  Unknown
acme.exe           00000000004159E9  Unknown               Unknown  Unknown

@ndkeen
Copy link
Contributor

ndkeen commented Jan 23, 2018

This is a somewhat common failure mode. For one case, I placed write statements before the file open to see the file that was causing the issue which helped me to debug. Does it make sense to write the filename (to log files) that we are trying to open (before trying to open) when DEBUG=TRUE?

@jqyin
Copy link
Contributor

jqyin commented Jan 23, 2018

@jayeshkrishna , Thanks for looking into it. The test run passed cam init on Edison but failed at ocn init with a different issue.

@jayeshkrishna
Copy link
Contributor Author

jayeshkrishna commented Jan 23, 2018

The issue could be the format of the following input file,

[jayesh@beboplogin2 ~]$ ncdump -k /home/ccsm-data/inputdata/atm/cam/solar/Solar_1850control_input4MIPS_c20171101.nc 
netCDF-4

I will soon verify that the file that is causing the failure is the one above.
@singhbalwinder : Does the above file need to be NetCDF4 (can it be netcdf classic)?

@rljacob
Copy link
Member

rljacob commented Jan 23, 2018

I had to change PIO_TYPE to netcdf for the atmosphere to read that file. Why was that necessary?

@jayeshkrishna
Copy link
Contributor Author

The default iotype, PnetCDF, does not support the NetCDF4 file type.

@jayeshkrishna
Copy link
Contributor Author

I just verified that the above file (/home/ccsm-data/inputdata/atm/cam/solar/Solar_1850control_input4MIPS_c20171101) is causing the test to fail with the default iotype (pnetcdf) on bebop.
I would recommend converting this input file to "classic netcdf" .
@singhbalwinder / @cameronsmith1 : Can you convert this file to "classic netcdf" ?

@jayeshkrishna
Copy link
Contributor Author

@ndkeen
Copy link
Contributor

ndkeen commented Jan 23, 2018

Does it make sense to test/require that all netcdf files be of a certain set of types? Recall #1970

@singhbalwinder
Copy link
Contributor

I do not know how to convert a netcdf file to "classic netcdf" format. @cameronsmith1 : Do you know how to do that?

@cameronsmith1
Copy link
Contributor

There are options to each of the NCO commands that specify what format the output netcdf should be in. You can also use nccopy.

@singhbalwinder
Copy link
Contributor

Thanks @cameronsmith1, it was really helpful! nccopy seems to be easy enough to use for this task.

@jayeshkrishna : There are following options for conversion:

[-k output_kind] kind of output netCDF file
              omitted => same as input
              '1' or 'classic' => classic file format
              '2' or '64-bit-offset' => 64-bit offset format
              '3' or 'netCDF-4' =>  netcdf-4 format
              '4' or 'netCDF-4 classic  model' => netCDF-4 classic model

Which option should I choose, option 1 or option 4? Thanks!

@cameronsmith1
Copy link
Contributor

Does anybody know whether changing the NetCDF file type produces BFB results? I am pinging @czender too.

@singhbalwinder
Copy link
Contributor

That's a good point. I was assuming that it would stay BFB.

@golaz
Copy link
Contributor

golaz commented Jan 23, 2018

@jayeshkrishna : is this only an issue for one specific machine, bebop? I have been running with this input file on Edison for a long time now, I assume I'm using pnetcdf there.

@rljacob
Copy link
Member

rljacob commented Jan 23, 2018

I had the same problem on anvil. bebop/anvil/blues all read the same file.

@cameronsmith1
Copy link
Contributor

FYI, all of the files provided by CMIP6 (input4MIPS) are in netcdf-4 format. So if that is the problem, then all of the files will need to be modified, uploaded to the SVN server, and the defaults in the use_case xml file changed on master.

@golaz
Copy link
Contributor

golaz commented Jan 24, 2018

@cameronsmith1: we could do this, but it would be a lot of work... @rljacob: anything else we could do?

@rljacob
Copy link
Member

rljacob commented Jan 24, 2018

It looks like the cases being run on edison still have PIO_TYPENAME_ATM set to pnetcdf and yet can read that file. @jayeshkrishna it must be something about the version of pnetcdf. edison has 1.6.1.

Can someone confirm that Solar_1850control_input4MIPS_c20171101.nc on edison is still netcdf-4 format? I can't get a version of ncdump that works.

@ndkeen
Copy link
Contributor

ndkeen commented Jan 24, 2018

To get a version of ncdump that has the '-k' option, you just need the default for module cray-netcdf. So module load cray-netcdf or, if you already have that module loaded, module swap cray-netcdf cray-netcdf should get you the default. (The version we use for our code is NOT yet the default on edison as our simulations are too fragile to change anything, but it is on cori).

You don't need to be on edison to work with data in /project.

cori11% ncdump -k atm/cam/solar/Solar_1850control_input4MIPS_c20171101.nc
netCDF-4

@jayeshkrishna
Copy link
Contributor Author

@singhbalwinder : Sorry, missed your question on which option to use in the conversation. Please convert the file to "netcdf classic" (option 1)

@rljacob
Copy link
Member

rljacob commented Jan 24, 2018

@jayeshkrishna we first need to figure out why this works on edison: the file is still netcdf-4 but "pnetcdf" is the PIO option.

@ndkeen
Copy link
Contributor

ndkeen commented Jan 24, 2018

fwiw, when I try SMS.ne30_oECv3.A_BGCEXP_BCRC_CNPRDCTC_1850 on cori with master, it fails.

cori04% cat SMS.ne30_oECv3.A_BGCEXP_BCRC_CNPRDCTC_1850.cori-knl_intel.q04/TestStatus.log
2018-01-23 18:29:07: CREATE_NEWCASE FAILED for test 'SMS.ne30_oECv3.A_BGCEXP_BCRC_CNPRDCTC_1850.cori-knl_intel'.
Command: /global/cscratch1/sd/ndk/wacmy/ndk_machinefiles_cori-knl-pelayout-adjust03/cime/scripts/create_newcase --case /global/cscratch1/sd/ndk/acme_scratch/cori-knl/mfpeat03/SMS.ne30_oECv3.A_BGCEXP_BCRC_CNPRDCTC_1850.cori-knl_intel.q04 --res ne30_oECv3 --compset A_BGCEXP_BCRC_CNPRDCTC_1850 --test --machine cori-knl --compiler intel --project acme 
Output: Did not find an alias or longname compset match for A_BGCEXP_BCRC_CNPRDCTC_1850
ERROR: No compset alias A_BGCEXP_BCRC_CNPRDCTC_1850 found and this does not appear to be a compset longname.

@cameronsmith1
Copy link
Contributor

Hi @ndkeen , That test looks like a BGC test, and I don't see anything in the error that looks like a netcdf issue. What am I missing?

@ndkeen
Copy link
Contributor

ndkeen commented Jan 25, 2018

The title of this issue has a testname that I tried on cori and it failed as noted above (nothing to do with netcdf, looks like missing some config files -- or I have the wrong testname?).

The netcdf issue is the one that is posted in the first comment of the issue. I was trying to help as I've seen that same error before. But I can't recreate on cori.

@cameronsmith1
Copy link
Contributor

Now I understand. Thanks, @ndkeen . BTW, the error message in your previous message indicates that the version of the code you are using doesn't have an entry in config_compsets.xml for that compset.

I am pinging @susburrows , since she may know what is going on with that compset (A_BGCEXP_BCRC_CNPRDCTC_1850).

@susburrows
Copy link
Contributor

Apologies, I would like to help on this but am tied up with phase 2 science planning this week. @jqyin and @acme-y9s have already been looking into this issue.

@jayeshkrishna
Copy link
Contributor Author

We discussed this issue during the performance call.
In my experience NetCDF4P has been slower and unstable (hangs etc with certain combination of hdf5-parallel and netcdf libraries) than PnetCDF. It would be great if we can convert the files to "classic netcdf". Performance of reading these files can become an issue when we add more files with the NetCDF4 format.
Also note that when we silently switch to NetCDF4P in PIO, we will have some files that are read using PnetCDF and other files (the NetCDF4 files) read using NetCDF4P.
Some points to consider,

  • Are the input files B4B after converting from NetCDF4 to "classic netcdf" format ? (I would assume that they are B4B, but we need to verify it.)
  • What are the steps required to use an input file that is converted from NetCDF4 to "classic netcdf"? (Can we automate it?)

@cameronsmith1
Copy link
Contributor

cameronsmith1 commented Jan 30, 2018

For the second question, the conversion is easy. The challenge is that we have a principle that whenever a datafile changes it must have a different filename, so the conversion is only the first step. Hence, the steps to changing a file are:

  1. Convert file to new netcdf format.
  2. Give new file a new name.
  3. upload new file to SVN server.
  4. Edit all E3SM case generation files to use the new name.
  5. Assure ourselves the results of those cases are bfb.
  6. Issue PR and update master with those changes.
  7. Communicate to existing users that they need to update to the latest master (or manually patch their version).

Each of these steps is easy, but it adds up if there are many files. It is also problematic when we are also trying to lock down the precise code and data for the big runs.

@jayeshkrishna
Copy link
Contributor Author

Ok, I understand.
If it is a time consuming process we could do it over a period of time. Meanwhile, we can get parallel I/O support for NetCDF libraries on Anvil and other machines.

@cameronsmith1
Copy link
Contributor

Do we know how big the performance impact is of using the different netcdf versions?

@rljacob
Copy link
Member

rljacob commented Jan 31, 2018

@jayeshkrishna is going to look in to it and that may make the decision for us.

Are we planning high-res CMIP6 runs? Are there high-res versions of the necessary input files?

@cameronsmith1
Copy link
Contributor

Thanks for looking into the performance implications.

Yes, we are planning at least 50 years of ne120 coupled solution (@PeterCaldwell is leading that effort). Some of the input files are at higher resolution, and some will be the low-res version that gets regridded inside E3SM. The short version of the story is that we have all those files, and are pulling the configuration together.

Ironically, we just encountered a problem reading a file, but it isn't clear whether that has anything to do with this thread.

@PeterCaldwell
Copy link
Contributor

It turns out that cori-knl also can't read netcdf-4 files now. Because the 1950-control runs use CMIP6 data written in netcdf-4 and we can't really depend on edison for our high-res runs, this netcdf-4 reading issue effectively prevents us from finalizing the high-res production compsets.

I fixed this by writing a python script (https://gist.github.com/PeterCaldwell/070f8e1fd967b59b21db79a5e7a24272) that identifies netcdf-4 variables (either in atm_in or by crawling the entire inputdata directory) and making copies of them based on netcdf-3-classic and which have the timestamp in the file name updated to today's date. There are 17 files I needed to update. I'm in the process of uploading these new files to the svn server.

I used cprnc to confirm that the netcdf-3 files I created were bfb with the netcdf-4 files they originated from. This check is also part of the git gist linked above. My understanding is that cprnc only checks for similarity to within some tolerance rather than checking that all digits are identical. Thus it could be that the inputdata is "identical" but model runs using my new files wouldn't be bfb. To test this, I ran 2 days of A_WCYCL1950 at ne30 using netcdf4 and netcdf3 files. cprnc also confirmed that the output of these runs are identical. So I think it's safe to switch to these new netcdf-3-classic files. If someone else wants to do more testing you are very welcome ;-).

I'm happy to run my script on the rest of the inputdata archive or on the atm_in files for the low-res DECK runs if others desire (@golaz , @cameronsmith1 ).

One question I still have is whether netcdf-3-classic is the optimal file type. If something else (e.g. netcdf-3 with 64 bit offset) would be better, please let me know sooner rather than later.

@mt5555
Copy link
Contributor

mt5555 commented Feb 3, 2018

regarding cprnc: I think it will only call the files identical is all variables that are in both files are BFB. If they only agree to some tolerance, it returns the RMS differences.

For small files, netcdf3 is the best format. The 64 bit offset is only needed if the files are > 2GB.

@PeterCaldwell
Copy link
Contributor

Thanks Mark. Your expectations regarding cprnc seem to be borne out by model runs.

@rljacob
Copy link
Member

rljacob commented Mar 16, 2018

@PeterCaldwell did you finish replacing all the NetCDF-4 files?

@rljacob
Copy link
Member

rljacob commented Mar 16, 2018

@jayeshkrishna the test at the top of this issue is now passing on bebop (and anvil) but the file is still NetCDF-4. Did you update the netcdf library?

@PeterCaldwell
Copy link
Contributor

I did create new netcdf3 files for all netcdf4 files in 1950 compsets and (if I recall correctly) deck configurations. Compsets need to be updated to actually use these files for low-res configurations. I have a branch that does this for 1950 compsets, but I was waiting to fix other problems with land ICs before issuing a PR. I think Chris doesn't want to change the deck compsets right now (even though the change would be bfb) for fear of inducing errors. So - PR coming in the next day or two.

@PeterCaldwell
Copy link
Contributor

I haven't updated BGC experiment netcdf4 files.

@rljacob
Copy link
Member

rljacob commented Mar 16, 2018

The file in question is Solar_1850control_input4MIPS_c20171101.nc which is also used in A_WCYCL1850S_CMIP6

@rljacob
Copy link
Member

rljacob commented Mar 16, 2018

I see what's happening, the test that's now passing is SMS.ne30_oECv3.BGCEXP_BCRC_CNPRDCTC_1850.bebop_intel.clm-bgcexp. The clm-bgcexp testmod changes the PIO type to netcdf which avoids the problem.

@rljacob
Copy link
Member

rljacob commented Mar 16, 2018

@PeterCaldwell did you change the datasets for the high-res cases you're doing?

@PeterCaldwell
Copy link
Contributor

Before issuing the initial 1950 compset PR I realized netcdf4 was a problem and switched to netcdf3 for high-res. I thought I had also fixed netcdf4 issues for low-res 1950 compsets, but somehow that didn't make its way to master (I probably forgot to push the low-res change). So yes - the high-res uses netcdf3 files.

@PeterCaldwell
Copy link
Contributor

@rljacob - do you want me to do anything (other than make a PR for 1950 low-res fixes)?

@rljacob
Copy link
Member

rljacob commented Mar 16, 2018

Yes go ahead. Someone needs to update the other time periods files as well.

rljacob added a commit that referenced this issue Mar 28, 2018
…2174)

Fixes broken A_WCYCL1950S_CMIP6_LR and A_WCYCL1950S_CMIP6_LRtunedHR
compsets by:

  replacing all netcdf4 references with netcdf3 equivalents
  specifying separate clm use case files for LR and HR compsets to avoid
      bug(?) where clm ignores resolution in choosing finidat and fsurdat files

Also cleans up/fixes both HR and LR 1950 compsets by:

   removing all 1950 mentions in namelist_defaults_clm4_5.xml. These
     default values were never used and would just confuse anyone trying to
     edit these compsets.
   removing landuse files from both HR and LR compsets. Using landuse files
     in control compsets is wrong and will give bad answers (though the
     impact is probably small).

Didn't run the test suite because 1950 compsets aren't tested.
Ran 1950 HR, LR, and LRtunedHR compsets for 1 day on cori.

Fixes some issues discussed in #2048
[BFB] except for 1950 compsets; [NML]
rljacob added a commit that referenced this issue Mar 28, 2018
Fixes broken A_WCYCL1950S_CMIP6_LR and A_WCYCL1950S_CMIP6_LRtunedHR
compsets by:

  replacing all netcdf4 references with netcdf3 equivalents
  specifying separate clm use case files for LR and HR compsets to avoid
      vug(?) where clm ignores resolution in choosing finidat and
      fsurdat files

Also cleans up/fixes both HR and LR 1950 compsets by:

  removing all 1950 mentions in namelist_defaults_clm4_5.xml. These
    default values were never used and would just confuse anyone trying to
    edit these compsets.
  removing landuse files from both HR and LR compsets. Using landuse files
    in control compsets is wrong and will give bad answers (though the
    impact is probably small).

Didn't run the test suite because 1950 compsets aren't tested.
Ran 1950 HR, LR, and LRtunedHR compsets for 1 day on cori.

Fixes some issues discussed in #2048
[BFB] except for 1950 compsets; [NML]
@rljacob rljacob assigned sarich and unassigned jayeshkrishna Oct 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests