-
Notifications
You must be signed in to change notification settings - Fork 0
High Resolution runs only work with pnetcdf/1.5.0 #1279
Comments
@mgduda could you comment on this when you have a moment? This is a fairly high priority ACME need. |
@mgduda I just put this on the agenda for the Mon 4/10 telecon. |
From @vanroekel: One other thing regarding the PIO errors, if we do switch to cdf5 output, we must require netcdf 4.4.0 or above. This is default on LANL IC and titan at least. But does not exist on rhea. |
@vanroekel , @mark-petersen Sorry for not replying earlier! We've observed the same behavior regarding pnetCDF versions as well: 1.5.0 seems to allow us to write large variables (as long as any individual MPI task does not write more than 4 GB of the variable), but newer versions of pnetCDF stop us with a format constraint error. My guess as to why this results in a "valid" output file is that the netCDF format never explicitly stores the size of a variable or record, but only the dimensions of that variable; so, as long as one only reads back <4 GB at a time, there may not be problems. At least for stand-alone MPAS models, selecting
If ACME is using the standard XML stream definition files, I suspect that just adding this attribute would work for you all, too. Just for reference, at the end of section 5.2 in the atmosphere users' guide (perhaps in a different section of other users' guides) is a description of the As was noted on our developers' telecon today, though, post-processing CDF5 / 64BIT_DATA files can be tricky. As a method of last resort, I have some Fortran code that links against the regular netCDF and pnetCDF libraries and could be modified to serve as a converter between CDF5 and CDF2 or HDF5 formats. |
@vanroekel , @mark-petersen , it sounds like in the case of the RRS18to6 mesh, it might be that the only commonly used field that violates the 4GB restriction is normalVelocity. If so, then we might be able to avoid the headaches of postprocessing with CDF5 format files, because the history files could still use CDF2 format (I'm assuming normalVelocity is not of interest for history files). (I.e., just the restart stream could be set to use CDF5) |
@matthewhoffman this sounds like a reasonable plan I think velocityZonal and velocityMerdional are in all other outputs. I will test this in the 18to6. |
@vanroekel and @matthewhoffman, if we ever want to compute the Meridional Overturning Circulation with post processing from the RRS18to6 (which we very likely don't), we would need normalVelocity to do so. I'm just mentioning it now because I want to make sure we're removing that possibility with eyes open. @milenaveneziani? |
All, as I note above, I have no problems with analysis of CDF5 data as long as the netcdf is up to date (>= 4.4.0). For example, all output for my G-case High Res is CDF5 and I can run analysis with no issues as the netcdf is v4.4.1, but it won't work on rhea, as netcdf is 4.3.3.1. Once machines have netcdf upgraded, I don't foresee problems. But I may be missing some headaches noted by @matthewhoffman and @mgduda. I was told by pnetcdf developers that the newer versions of nco and netcdf fully support CDF5, so I suspect these headaches won't be long term. |
If possible, I think it would be a good think to keep outputting normalVelocity until we test the MOC am at low- and high-resolution, and we make sure things work as expected. |
It is fine with me to leave normalVelocity, the only caveat is that the analysis will not work on rhea until netcdf is upgraded. I have put in a request already. |
i agree with Milena--we need to keep normalVelocity. in the meantime, it sounds like Michael's code would be nice. my workaround today (for small files) is to ncdump to an cdl file on a 4.4.1 machine (eg titan), then ncgen that file back to netcdf using 4.4.0 (eg, rhea). |
A short term workaround might be to stick normalVelocity in its own output file with CDF5 format. That way the primary output will still work with older netCDF libraries, but you'd still have normalVelocity around if needed. |
The problem with moving normal velocity is that it will require timeSeriesStatsMonthly to be split as well. This is the file and normalVelocity that is used in MPAS-Analysis. Is that correct @milenaveneziani ? |
@vanroekel Could you re-run an output test with the RRS18to6 to confirm that
works with pnetcdf >1.5.0, without any code changes for |
@mark-petersen I can not get this to work in ACME, it seems that PIO_TYPENAME supersedes all io_type choices, and pnetcdf,cdf5 is not a valid option for ACME right now. @jonbob has found similar behavior in his tests. |
@vanroekel - I found something in the driver that sets a master pio_type, which then controls all pio output from there. I have a test to try tomorrow, so we can chat first thing |
great! I was just chatting with @mark-petersen about this and we found the same and are going to try as well. |
Ah, great minds and all.... |
@vanroekel and @jonbob I just ran the exact restart test on MPAS-O on wolf, with netcdf 4.4.0, using the flag
This did not work on grizzly, but just because the netcdf is an early version by mistake:
In other words, we should proceed and try this in ACME. I can try on edison. |
@mark-petersen - that's great, but do we need to check the netcdf version somehow before changing the setting? I've been mucking about in the scripts and figured out how to do most of what we need, but that would complicate it.... |
@mark-petersen, In checking netcdf is highly variable in its versions on IC. only the intel15.0.5 version will work. All others have netcdf c bindings of 4.3.2. So it seems like a check of version will be needed unfortunately. |
@vanroekel and @jonbob Good news. I ran 1 day + restart + 1 day on edison, EC60to30v3 G case. Once with default settings, once with restart stream flag
The only change needed was:
This is compiled on edison with intel, so libraries are:
|
@vanroekel - does this issue get closed with ACME PR #1456? Or should we test first? |
@jonbob that PR does fix it the issue for ACME, but this issue does remain on the MPAS stand alone side. I'm not how we should address the issue for MPAS-O only simulations on the 18to6 mesh. any thoughts? I don't think modifying streams.ocean in default inputs is wise, perhaps this is a modification to the testing infrastructure? Pinging @mark-petersen as well. |
I think io_type="netcdf" should be the default in Registry.xml in MPAS-O stand along for all non-partitioned AM output streams. I can't think of any reason not to - can anyone else? I would leave everything else as io_type="pnetcdf,cdf2" for now because we don't have cdf5 tools everywhere. We could put cdf5 in the testing scripts for RRS18to6 restart. |
I agree with what @mark-petersen proposes for stand alone, especially as LANL does not yet have cdf5 capabilities. |
In running an MPAS-O case with approximately 3.6 million cells (RRS18to6) our log.err file has numerous instances of an error "Bad return value from PIO" and the file written has size 0. If we use pnetcdf/1.5.0, this does not happen, the output looks reasonable and valid (verified with ncdump and visualizing with paraview).
After digging through the framework and comparing pnetcdf versions, it appears pnetcdf/1.5.0 works because there was a bug in that version that was remedied in later versions. In MPAS, we use NC_64BIT_OFFSET by default for output For cdf-2 files, any single variable cannot exceed 4GB in size. Any variable in my 18to6 run that is dimensioned nedges by nVertLevels (e.g., normalVelocity) has a size of ~8GB and thus violates this constraint. In pnetcdf/1.5.0 only the variable dimensions were accounted for and there was no consideration for the size of an element. This allowed us to pass the size check and proceed to file writes. This was remedied in pnetcdf/1.6.0 and we can no longer write using NC_64BIT_OFFSET. I still do not understand why I get valid output for an array that violates cdf-2 constraints, and am communicating with pnetcdf developers on this (see the discussion at https://trac.mcs.anl.gov/projects/parallel-netcdf/ticket/29). However, I think the more appropriate solution is to switch the default output to NC_64BIT_DATA (cdf-5), or at least allow easier use of this option. From what I can tell in the framework there is not an easy way to use NC_64BIT_DATA. If I look at this block from mpas_io.F
I can only get to the 64BIT_DATA option if master_pio_iotype is set. Yet I can't seem to find where this happens. I see no calls to
MPAS_io_set_iotype
. Am I missing it? Or is 64BIT_OFFSET currently the only option for output? If so, is it possible to change this? 64BIT_DATA seems to work in my tests with modified framework, but don't know if I'm missing something else about why output is only written in cdf-2.The text was updated successfully, but these errors were encountered: