-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SMS.ne30_oECv3.A_BGCEXP_BCRC_CNPRDCTC_1850.bebop_intel fails with unknown file format error #2048
Comments
The failure occurs when the test (CAM) opens a file. The stack trace is,
|
This is a somewhat common failure mode. For one case, I placed write statements before the file open to see the file that was causing the issue which helped me to debug. Does it make sense to write the filename (to log files) that we are trying to open (before trying to open) when DEBUG=TRUE? |
@jayeshkrishna , Thanks for looking into it. The test run passed cam init on Edison but failed at ocn init with a different issue. |
The issue could be the format of the following input file,
I will soon verify that the file that is causing the failure is the one above. |
I had to change PIO_TYPE to netcdf for the atmosphere to read that file. Why was that necessary? |
The default iotype, PnetCDF, does not support the NetCDF4 file type. |
I just verified that the above file (/home/ccsm-data/inputdata/atm/cam/solar/Solar_1850control_input4MIPS_c20171101) is causing the test to fail with the default iotype (pnetcdf) on bebop. |
FYI, the file is here - https://acme-svn2.ornl.gov/acme-repo/acme/inputdata/atm/cam/solar |
Does it make sense to test/require that all netcdf files be of a certain set of types? Recall #1970 |
I do not know how to convert a netcdf file to "classic netcdf" format. @cameronsmith1 : Do you know how to do that? |
There are options to each of the NCO commands that specify what format the output netcdf should be in. You can also use nccopy. |
Thanks @cameronsmith1, it was really helpful! @jayeshkrishna : There are following options for conversion:
Which option should I choose, option 1 or option 4? Thanks! |
Does anybody know whether changing the NetCDF file type produces BFB results? I am pinging @czender too. |
That's a good point. I was assuming that it would stay BFB. |
@jayeshkrishna : is this only an issue for one specific machine, bebop? I have been running with this input file on Edison for a long time now, I assume I'm using pnetcdf there. |
I had the same problem on anvil. bebop/anvil/blues all read the same file. |
FYI, all of the files provided by CMIP6 (input4MIPS) are in netcdf-4 format. So if that is the problem, then all of the files will need to be modified, uploaded to the SVN server, and the defaults in the use_case xml file changed on master. |
@cameronsmith1: we could do this, but it would be a lot of work... @rljacob: anything else we could do? |
It looks like the cases being run on edison still have PIO_TYPENAME_ATM set to pnetcdf and yet can read that file. @jayeshkrishna it must be something about the version of pnetcdf. edison has 1.6.1. Can someone confirm that Solar_1850control_input4MIPS_c20171101.nc on edison is still netcdf-4 format? I can't get a version of ncdump that works. |
To get a version of ncdump that has the '-k' option, you just need the default for module You don't need to be on edison to work with data in /project.
|
@singhbalwinder : Sorry, missed your question on which option to use in the conversation. Please convert the file to "netcdf classic" (option 1) |
@jayeshkrishna we first need to figure out why this works on edison: the file is still netcdf-4 but "pnetcdf" is the PIO option. |
fwiw, when I try
|
Hi @ndkeen , That test looks like a BGC test, and I don't see anything in the error that looks like a netcdf issue. What am I missing? |
The title of this issue has a testname that I tried on cori and it failed as noted above (nothing to do with netcdf, looks like missing some config files -- or I have the wrong testname?). The netcdf issue is the one that is posted in the first comment of the issue. I was trying to help as I've seen that same error before. But I can't recreate on cori. |
Now I understand. Thanks, @ndkeen . BTW, the error message in your previous message indicates that the version of the code you are using doesn't have an entry in config_compsets.xml for that compset. I am pinging @susburrows , since she may know what is going on with that compset (A_BGCEXP_BCRC_CNPRDCTC_1850). |
We discussed this issue during the performance call.
|
For the second question, the conversion is easy. The challenge is that we have a principle that whenever a datafile changes it must have a different filename, so the conversion is only the first step. Hence, the steps to changing a file are:
Each of these steps is easy, but it adds up if there are many files. It is also problematic when we are also trying to lock down the precise code and data for the big runs. |
Ok, I understand. |
Do we know how big the performance impact is of using the different netcdf versions? |
@jayeshkrishna is going to look in to it and that may make the decision for us. Are we planning high-res CMIP6 runs? Are there high-res versions of the necessary input files? |
Thanks for looking into the performance implications. Yes, we are planning at least 50 years of ne120 coupled solution (@PeterCaldwell is leading that effort). Some of the input files are at higher resolution, and some will be the low-res version that gets regridded inside E3SM. The short version of the story is that we have all those files, and are pulling the configuration together. Ironically, we just encountered a problem reading a file, but it isn't clear whether that has anything to do with this thread. |
It turns out that cori-knl also can't read netcdf-4 files now. Because the 1950-control runs use CMIP6 data written in netcdf-4 and we can't really depend on edison for our high-res runs, this netcdf-4 reading issue effectively prevents us from finalizing the high-res production compsets. I fixed this by writing a python script (https://gist.github.com/PeterCaldwell/070f8e1fd967b59b21db79a5e7a24272) that identifies netcdf-4 variables (either in atm_in or by crawling the entire inputdata directory) and making copies of them based on netcdf-3-classic and which have the timestamp in the file name updated to today's date. There are 17 files I needed to update. I'm in the process of uploading these new files to the svn server. I used cprnc to confirm that the netcdf-3 files I created were bfb with the netcdf-4 files they originated from. This check is also part of the git gist linked above. My understanding is that cprnc only checks for similarity to within some tolerance rather than checking that all digits are identical. Thus it could be that the inputdata is "identical" but model runs using my new files wouldn't be bfb. To test this, I ran 2 days of A_WCYCL1950 at ne30 using netcdf4 and netcdf3 files. cprnc also confirmed that the output of these runs are identical. So I think it's safe to switch to these new netcdf-3-classic files. If someone else wants to do more testing you are very welcome ;-). I'm happy to run my script on the rest of the inputdata archive or on the atm_in files for the low-res DECK runs if others desire (@golaz , @cameronsmith1 ). One question I still have is whether netcdf-3-classic is the optimal file type. If something else (e.g. netcdf-3 with 64 bit offset) would be better, please let me know sooner rather than later. |
regarding cprnc: I think it will only call the files identical is all variables that are in both files are BFB. If they only agree to some tolerance, it returns the RMS differences. For small files, netcdf3 is the best format. The 64 bit offset is only needed if the files are > 2GB. |
Thanks Mark. Your expectations regarding cprnc seem to be borne out by model runs. |
@PeterCaldwell did you finish replacing all the NetCDF-4 files? |
@jayeshkrishna the test at the top of this issue is now passing on bebop (and anvil) but the file is still NetCDF-4. Did you update the netcdf library? |
I did create new netcdf3 files for all netcdf4 files in 1950 compsets and (if I recall correctly) deck configurations. Compsets need to be updated to actually use these files for low-res configurations. I have a branch that does this for 1950 compsets, but I was waiting to fix other problems with land ICs before issuing a PR. I think Chris doesn't want to change the deck compsets right now (even though the change would be bfb) for fear of inducing errors. So - PR coming in the next day or two. |
I haven't updated BGC experiment netcdf4 files. |
The file in question is Solar_1850control_input4MIPS_c20171101.nc which is also used in A_WCYCL1850S_CMIP6 |
I see what's happening, the test that's now passing is SMS.ne30_oECv3.BGCEXP_BCRC_CNPRDCTC_1850.bebop_intel.clm-bgcexp. The clm-bgcexp testmod changes the PIO type to netcdf which avoids the problem. |
@PeterCaldwell did you change the datasets for the high-res cases you're doing? |
Before issuing the initial 1950 compset PR I realized netcdf4 was a problem and switched to netcdf3 for high-res. I thought I had also fixed netcdf4 issues for low-res 1950 compsets, but somehow that didn't make its way to master (I probably forgot to push the low-res change). So yes - the high-res uses netcdf3 files. |
@rljacob - do you want me to do anything (other than make a PR for 1950 low-res fixes)? |
Yes go ahead. Someone needs to update the other time periods files as well. |
…2174) Fixes broken A_WCYCL1950S_CMIP6_LR and A_WCYCL1950S_CMIP6_LRtunedHR compsets by: replacing all netcdf4 references with netcdf3 equivalents specifying separate clm use case files for LR and HR compsets to avoid bug(?) where clm ignores resolution in choosing finidat and fsurdat files Also cleans up/fixes both HR and LR 1950 compsets by: removing all 1950 mentions in namelist_defaults_clm4_5.xml. These default values were never used and would just confuse anyone trying to edit these compsets. removing landuse files from both HR and LR compsets. Using landuse files in control compsets is wrong and will give bad answers (though the impact is probably small). Didn't run the test suite because 1950 compsets aren't tested. Ran 1950 HR, LR, and LRtunedHR compsets for 1 day on cori. Fixes some issues discussed in #2048 [BFB] except for 1950 compsets; [NML]
Fixes broken A_WCYCL1950S_CMIP6_LR and A_WCYCL1950S_CMIP6_LRtunedHR compsets by: replacing all netcdf4 references with netcdf3 equivalents specifying separate clm use case files for LR and HR compsets to avoid vug(?) where clm ignores resolution in choosing finidat and fsurdat files Also cleans up/fixes both HR and LR 1950 compsets by: removing all 1950 mentions in namelist_defaults_clm4_5.xml. These default values were never used and would just confuse anyone trying to edit these compsets. removing landuse files from both HR and LR compsets. Using landuse files in control compsets is wrong and will give bad answers (though the impact is probably small). Didn't run the test suite because 1950 compsets aren't tested. Ran 1950 HR, LR, and LRtunedHR compsets for 1 day on cori. Fixes some issues discussed in #2048 [BFB] except for 1950 compsets; [NML]
The test fails with the following error message,
The text was updated successfully, but these errors were encountered: