Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current issues renaming dimensions and coordinates #597

Open
czender opened this issue Nov 7, 2017 · 74 comments
Open

Current issues renaming dimensions and coordinates #597

czender opened this issue Nov 7, 2017 · 74 comments

Comments

@czender
Copy link
Contributor

czender commented Nov 7, 2017

Old issues with renaming dimensions and coordinates in netCDF4 files are beginning to affect CMIP6 processing so here's a reminder. All recent library versions are affected. See here for a chronology. Create a netCDF4 file with a coordinate:

netcdf bug_rnm {
       dimensions:
            lon = 4 ;
       variables:
           float lon(lon) ;
       data:
          lon = 0, 90, 180, 270 ; 
}

Remember to re-create the file after each of the three tests below. Renaming both dimension and coordinate together works. Yay! This surprises me, given the history recounted above:

	 ncrename -d lon,longitude -v lon,longitude ~/bug_rnm.nc # works

Renaming dimension then coordinate separately fails:

       ncrename -d lon,longitude ~/bug_rnm.nc
       ncrename -v lon,longitude ~/bug_rnm.nc # borken "HDF error"

Renaming coordinate then dimension separately fails:

       ncrename -v lon,longitude ~/bug_rnm.nc
       ncrename -d lon,longitude ~/bug_rnm.nc # borken "nc4_reform_coord_var: Assertion `dim_datasetid > 0' failed."
@WardF
Copy link
Member

WardF commented Nov 7, 2017 via email

@edhartnett
Copy link
Contributor

edhartnett commented Nov 7, 2017

Howdy @czender !

I found a test written by Quincey that deals with renaming, nc_test4/tst_rename.c.

I have add to the existing tests, and reproduced your error in C code. Here's the test:

fprintf(stderr,"*** Test Charlie's test for renaming...");
 {
#define CHARLIE_TEST_FILE "nc_rename_coord_dim.nc"
#define DIM1_NAME "lon"
#define VAR1_NAME "lon"
#define DIM2_NAME "longitude"
#define VAR2_NAME "longitude"
#define DIM1_LEN 4
#define NDIM1 1
    int ncid, dimid, varid;
    float data[DIM1_LEN] = {0, 90.0, 180.0, 270.0};
    
    if (nc_create(CHARLIE_TEST_FILE, NC_NETCDF4, &ncid)) ERR;
    if (nc_def_dim(ncid, DIM1_NAME, DIM1_LEN, &dimid)) ERR;
    if (nc_def_var(ncid, VAR1_NAME, NC_FLOAT, NDIM1, &dimid, &varid)) ERR;
    if (nc_enddef(ncid)) ERR;
    if (nc_put_var_float(ncid, varid, data)) ERR;
    if (nc_close(ncid)) ERR;

    /* Reopen the file to check. */
    if (check_charlies_file(CHARLIE_TEST_FILE, DIM1_NAME, VAR1_NAME)) ERR;

    /* Open the file and rename the dimension. */
    if (nc_open(CHARLIE_TEST_FILE, NC_WRITE, &ncid)) ERR;
    if (nc_redef(ncid)) ERR;
    if (nc_rename_dim(ncid, 0, DIM2_NAME)) ERR;
    if (nc_enddef(ncid)) ERR;
    if (nc_close(ncid)) ERR;

    /* Reopen the file to check. */
    if (check_charlies_file(CHARLIE_TEST_FILE, DIM2_NAME, VAR1_NAME)) ERR;

    /* Open the file and rename the variable. */
    if (nc_open(CHARLIE_TEST_FILE, NC_WRITE, &ncid)) ERR;
    if (nc_redef(ncid)) ERR;
    if (nc_rename_var(ncid, 0, VAR2_NAME)) ERR;
    if (nc_enddef(ncid)) ERR;
    if (nc_close(ncid)) ERR;

    /* Reopen the file to check. */
    if (check_charlies_file(CHARLIE_TEST_FILE, VAR2_NAME, VAR1_NAME)) ERR;

This fails like this:

*** Test Charlie's test for renaming...HDF5-DIAG: Error detected in HDF5 (1.10.1) thread 0:
  #000: H5Gdeprec.c line 459 in H5Gmove(): couldn't move link
    major: Links
    minor: Unable to initialize object
  #001: H5Gdeprec.c line 541 in H5G_move(): unable to move link
    major: Links
    minor: Can't move object
  #002: H5L.c line 2763 in H5L_move(): unable to find link
    major: Symbol table
    minor: Object not found
  #003: H5Gtraverse.c line 867 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #004: H5Gtraverse.c line 639 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #005: H5L.c line 2623 in H5L_move_cb(): unable to follow symbolic link
    major: Symbol table
    minor: Object not found
  #006: H5Gtraverse.c line 867 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #007: H5Gtraverse.c line 639 in H5G_traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #008: H5L.c line 2484 in H5L_move_dest_cb(): an object with that name already exists
    major: Symbol table
    minor: Object not found
Sorry! Unexpected result, tst_rename.c, line: 301

I will take a look and see what is going on here...

@edhartnett
Copy link
Contributor

OK, the problem can be seen from the h5dump:

HDF5 "nc_rename_coord_dim.nc" {
GROUP "/" {
   ATTRIBUTE "_NCProperties" {
      DATATYPE  H5T_STRING {
         STRSIZE 67;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "version=1|netcdflibversion=4.5.1-development|hdf5libversion=1.10.1"
      }
   }
   DATASET "lon" {
      DATATYPE  H5T_IEEE_F32LE
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): 0, 90, 180, 270
      }
      ATTRIBUTE "DIMENSION_LIST" {
         DATATYPE  H5T_VLEN { H5T_REFERENCE { H5T_STD_REF_OBJECT }}
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): (DATASET 659 /longitude )
         }
      }
      ATTRIBUTE "_Netcdf4Dimid" {
         DATATYPE  H5T_STD_I32LE
         DATASPACE  SCALAR
         DATA {
         (0): 0
         }
      }
   }
   DATASET "longitude" {
      DATATYPE  H5T_IEEE_F32BE
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): 0, 0, 0, 0
      }
      ATTRIBUTE "CLASS" {
         DATATYPE  H5T_STRING {
            STRSIZE 16;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "DIMENSION_SCALE"
         }
      }
      ATTRIBUTE "NAME" {
         DATATYPE  H5T_STRING {
            STRSIZE 64;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
         DATASPACE  SCALAR
         DATA {
         (0): "This is a netCDF dimension but not a netCDF variable.         4"
         }
      }
      ATTRIBUTE "REFERENCE_LIST" {
         DATATYPE  H5T_COMPOUND {
            H5T_REFERENCE { H5T_STD_REF_OBJECT } "dataset";
            H5T_STD_I32LE "dimension";
         }
         DATASPACE  SIMPLE { ( 1 ) / ( 1 ) }
         DATA {
         (0): {
               DATASET 331 /lon ,
               0
            }
         }
      }
   }
}
}

When the code hits this line in NC4_rename_var() it fails:

  /* Change the HDF5 file, if this var has already been created
      there. */
   if (var->created)
   {
      if (H5Gmove(grp->hdf_grpid, var->name, name) < 0)
         BAIL(NC_EHDFERR);
   }

So the code is trying to rename the dataset with H5Gmove, but can't because there is already a dataset with the new name. It was created to support the dimension name change - its one of those dreaded secret datasets to support HDF5 dimension scales.

I guess the way to proceed is to check before the H5Gmove to see if there is another var of the desired name which secretly exists, and then delete it, and then make the newly renamed dataset into a dimension scale.

I will try to take a stab at that...

@czender
Copy link
Contributor Author

czender commented Nov 8, 2017

Thanks, @edhartnett. It sounds like an intricate fix, so I hope you #persist.

@cameronsmith1
Copy link

Thanks, All!

@edhartnett
Copy link
Contributor

This is happening on this branch: https://github.com/NetCDF-World-Domination-Council/netcdf-c/tree/ejh_zender_coordinate_var_rename

So far I have just added the new test that demonstrates the problem.

@edhartnett
Copy link
Contributor

@czender I am working on this now as part of my current branch. I have been sneaking up on this by writing tests for lots of simpler code in nc4var.c. Now I have done the easy stuff and am moving on to the rename bugs.

To help us understand the urgency of this fix, if any, can you elaborate on how it is being used in CMIP6? And what CMIP6 is?

Are we talking world-shatteringly important climate research here? Or what?

And why the heck are you renaming vars anyway? Can't you scientists just make up your minds? What is the use-case that is driving this? Can you help me understand the importance of this feature?

Thanks!

PS I am going to fix it anyway, obviously. You would have to physically restrain me to stop me from fixing it. I am just curious why you need it.

@cameronsmith1
Copy link

CMIP6 is the large international climate model intercomparison for the next IPCC report. The CMIP6 organizers provided data in netcdf files for all of the different modeling groups to use. However, each climate model typically looks for a particular name for each input variable it needs, which varies from model to model (eg, sea_surface_temperature, SST, surfTemp), and there are hundreds of these, especially when we have to reprocess our output to match the CMIP6 required variable names. Hence the need to rename the variables.

The good news is that I have already worked around this bug (it was too important to wait), but it will be good to get it fixed for the future.

@cameronsmith1
Copy link

I actually use the NCO utilities, so this is a question for @czender .

@DennisHeimbigner
Copy link
Collaborator

Ed,In retrospect, I sure wih you guys had not been seduced
by dimension scales :-)

@edhartnett
Copy link
Contributor

Mea culpa, mea culpa, mea maxima culpa.

@edhartnett
Copy link
Contributor

@cameronsmith1 thanks for the explanation.

What workaround are you using?

@cameronsmith1
Copy link

I ended up using the nco utilities of @czender in a different way, so I am not sure how to express it in native HDF commands.

@czender
Copy link
Contributor Author

czender commented Dec 24, 2017

I receive a few bug reports a year from people trying to rename coordinates in netCDF4 files with ncrename. CMIP does this on an industrial scale. Most people end up using the multistep workaround(s) documented in the NCO manual here.

@edhartnett
Copy link
Contributor

OK I have a fix for this problem, as well as several other rename problems. I am still working on another one I found.

There is a bunch of state code in this which is not helpful. For example there are these values in NC_VAR_INFO_T:

   nc_bool_t is_new_var;        /* True if variable is newly created */
   nc_bool_t was_coord_var;     /* True if variable was a coordinate var, but either the dim or var has been renamed */
   nc_bool_t became_coord_var;  /* True if variable _became_ a coordinate var, because either the dim or var has been renamed */

These are set at various times in nc_rename_var() and nc_rename_dim().

Then in the sync that is called during nc_enddef() these values are consulted and various actions taken.

Problem is, it's not hard to come up with combinations of rename calls which confuse the crap out of the code. There are so many possible ways that users can name and rename vars and dims, with enddefs/redefs thrown in or not, It gets very, very confusing.

What would work better would be to assume a statelessness. Whenever a rename command is complete, the file on disk and in metadata must be completely adjusted and in a complete state. There must be nothing left over of the rename process for enddef to do. All work must be done in nc_rename_var() and nc_rename_dim().

If that were true, then the user could call them in any wild order, and it would not matter to us.

I am not attempting to take these out right now. I am patching the existing code so that it works for the test cases that we've identified, including Charlie's. But I suspect there are many more bugs out there with the existing code that will only be eliminated by removing these state variables and dealing with everything in the rename.

@edhartnett
Copy link
Contributor

OK, these fixes are up as PR #755. It's branch ejh_fill_values on my netcdf clone: git@github.com:NetCDF-World-Domination-Council/netcdf-c.git

@czender if you could grab the branch and give it a try, that would be great. I don't think I've fixed every possible rename problem, but see if your common cases now work OK.

Any problems would be best expressed by adding a test to nc_test4/tst_rename.c. Simply copy an existing test case, then change it to make it fail, and send me the code or post it here.

@cameronsmith1
Copy link

cameronsmith1 commented Jan 1, 2018

Thanks for working on this. At the web-page that @czender mentioned, there is a long list of problems over the years with renaming variables, which supports your conclusion that the current implementation is fragile against the many possible renaming situations. Interestingly, @czender notes that a robust solution is to convert from netcdf4 to netcdf3, then rename and convert back to netcdf4. Does the implementation of rename for netcdf3 offer a useful template?

@czender
Copy link
Contributor Author

czender commented Jan 3, 2018

Thanks, @edhartnett. I will try to test this soon and report the results here.

@edhartnett
Copy link
Contributor

@czender you should actually use branch currently up as PR #763.

This PR contains just the rename fixes from the other PR, now closed without merging.

In tst_rename.c there are some commented-out lines which show an unfixed bug:

      /* Open the file and rename the variable. This will remove
         * the dimscale-only dataset "longitude" and rename the
         * extisting dataset "lon" to "longitude". Variable
         * "longitude" will become a coordinate var. */
        /* if (nc_open(CHARLIE_TEST_FILE, NC_WRITE, &ncid)) ERR; */
        /* if (nc_redef(ncid)) ERR; */
        /* if (nc_rename_var(ncid, 0, LONGITUDE)) ERR; */
        /* if (nc_enddef(ncid)) ERR; */
        /* if (nc_close(ncid)) ERR; */

Still working on this one. ;-)

@czender
Copy link
Contributor Author

czender commented Jan 9, 2018

I got as far as re-building and testing the latest netCDF master four days ago. It produced a total of 7 new and unexpected failures with the NCO regression test, half with ncks and half with ncpdq. All new failures involved netCDF4 files. I have not yet figured out the exact cause. Tracking that down has higher priority than testing the rename fix for now.

@edhartnett
Copy link
Contributor

edhartnett commented Jan 9, 2018 via email

@cameronsmith1
Copy link

NCO is extremely widely used in the climate community for working with netcdf files, so including them in your test suite would be wonderful.

@WardF
Copy link
Member

WardF commented Jan 9, 2018

We already do as part of our testing. Unfortunately there are two sets of tests: ‘make check’ which will return non-zero when there is a failure, and ‘make test’ which requires manual intervention to catch failures. We run the firmer but not the latter.

@WardF
Copy link
Member

WardF commented Jan 9, 2018

@czender are you seeing failures in make check or make test? I am not seeing failures with make check. I will check with make test after my poster session.

@czender
Copy link
Contributor Author

czender commented Jan 9, 2018

As you suspect it is with "make test", which is always what I mean by "the NCO regression test". For historical reasons, NCO's "make check" is basically just a check that the NCO C++ code library links and runs a single, simple program.

@czender
Copy link
Contributor Author

czender commented Jan 10, 2018

After a little investigation it turns out that this was a problem with "make test" relying on a file, hdn.nc, that autoconf apparently does not re-create, it is not a netCDF or NCO code issue. Phew. Now I can move on to testing the rename patch...

@WardF
Copy link
Member

WardF commented Jan 10, 2018

Excellent! I will hold off, let me know if any regressions pop up. Thank you @czender!

@edhartnett
Copy link
Contributor

I have a PR (#1294) up with a fix to the most recently found rename bug.

@edhartnett
Copy link
Contributor

edhartnett commented Jan 26, 2019

Here's another newly discovered rename issue. This code:

   fprintf(stderr,"*** test renaming two coord vars...");
   {
      int ncid, dimid1, dimid2, varid1, varid2;
      char file_name[NC_MAX_NAME + 1];
      char name[NC_MAX_NAME + 1];

      /* Create file with dim and associated coordinate var. */
      sprintf(file_name, "%s_sync.nc", TEST_NAME);
      if (nc_create(file_name, NC_CLOBBER|NC_NETCDF4|NC_CLASSIC_MODEL, &ncid)) ERR;
      if (nc_def_dim(ncid, D1_NAME, DIM1_LEN, &dimid1)) ERR;
      if (nc_def_dim(ncid, D2_NAME, DIM1_LEN, &dimid2)) ERR;
      if (nc_def_var(ncid, D1_NAME, NC_INT, NDIM1, &dimid1, &varid1)) ERR;
      if (nc_def_var(ncid, D2_NAME, NC_INT, NDIM1, &dimid2, &varid2)) ERR;
      if (nc_close(ncid)) ERR;

      /* Open the file and rename the vars. */
      nc_set_log_level(4);
      if (nc_open(file_name, NC_WRITE, &ncid)) ERR;
      if (nc_rename_var(ncid, varid1, TMP_NAME)) ERR;
      if (nc_rename_var(ncid, varid2, D1_NAME)) ERR;
      if (nc_close(ncid)) ERR;

      /* Reopen file and check, */
      if (nc_open(file_name, NC_WRITE, &ncid)) ERR;
      /* if (nc_inq_dimid(ncid, D1_NAME, &dimid)) ERR; */
      if (nc_inq_varid(ncid, TMP_NAME, &varid1)) ERR;
      if (nc_close(ncid)) ERR;
   }
   SUMMARIZE_ERR;


Produces this output:

*** test renaming two coord vars...				log_level changed to 4
	NC4_open: path tst_rename_sync.nc mode 4097 params 0
	HDF5 error messages turned on.
			nc4_open_file: path tst_rename_sync.nc mode 4097
			nc4_grp_list_add: name / 
				nc4_open_file: set HDF raw chunk cache to size 4194304 nelems 1009 preemption 0.750000
			rec_read_metadata: grp->hdr.name /
			found dataset d1
				read_var: obj_name d1
			read_coord_dimids: var->hdr.name d1
			found dataset d2
				read_var: obj_name d2
			read_coord_dimids: var->hdr.name d2
			NC4_read_ncproperties
			NC4_get_provenance: ncid 0x0 propstring version=2,netcdf=4.6.3-development,hdf5=1.10.4
				rec_match_dimscales: grp->hdr.name /
		*** NetCDF-4 Internal Metadata: int_ncid 0x10000 ext_ncid 0x10000
		FILE - path: tst_rename_sync.nc cmode: 0x1109 parallel: 0 redef: 0 fill_mode: 0 no_write: 0 next_nc_grpid: 1
		 GROUP - / nc_grpid: 0 nvars: 2 natts: 0
		 DIMENSION - dimid: 0 name: d1 len: 4 unlimited: 0
		 DIMENSION - dimid: 1 name: d2 len: 4 unlimited: 0
		 VARIABLE - varid: 0 name: d1 ndims: 1 dimscale: 1 dimids: 0
		 VARIABLE - varid: 1 name: d2 ndims: 1 dimscale: 1 dimids: 1
		NC4_rename_var: ncid 0x10000 varid 0 name t1
			nc4_get_var_meta: var d1
			Moving dataset d1 to t1
			var is now t1
			nc4_break_coord_var dim d1 was associated with var t1, but now has different name
			rec_detach_scales: grp->hdr.name /
				nc4_find_dim: dimid 0
		NC4_rename_var: ncid 0x10000 varid 1 name d1
			nc4_get_var_meta: var d2
			Moving dataset d2 to d1
			var is now d1
			nc4_break_coord_var dim d2 was associated with var d1, but now has different name
			rec_detach_scales: grp->hdr.name /
				nc4_find_dim: dimid 1
	NC4_close: ncid 0x10000
			nc4_close_hdf5_file: h5->path tst_rename_sync.nc abort 0
			sync_netcdf4_file
		*** NetCDF-4 Internal Metadata: int_ncid 0x10000 ext_ncid 0x10000
		FILE - path: tst_rename_sync.nc cmode: 0x1109 parallel: 0 redef: 0 fill_mode: 0 no_write: 0 next_nc_grpid: 1
		 GROUP - / nc_grpid: 0 nvars: 2 natts: 0
		 DIMENSION - dimid: 0 name: d1 len: 4 unlimited: 0
		 DIMENSION - dimid: 1 name: d2 len: 4 unlimited: 0
		 VARIABLE - varid: 0 name: t1 ndims: 1 dimscale: 0 dimids: 0
		 VARIABLE - varid: 1 name: d1 ndims: 1 dimscale: 0 dimids: 1
			nc4_rec_write_groups_types: grp->hdr.name /
			nc4_rec_write_metadata: grp->hdr.name /, bad_coord_order 0
				write_dim: creating dim d1
				write_dim: about to H5Dcreate1 a dimscale dataset d1
HDF5-DIAG: Error detected in HDF5 (1.10.4) thread 0:
  #000: H5D.c line 145 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 326 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1572 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1813 in H5L__create_real(): can't insert link
    major: Links
    minor: Unable to insert object
  #004: H5Gtraverse.c line 851 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 627 in H5G__traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: H5L.c line 1608 in H5L__link_cb(): name already exists
    major: Links
    minor: Object already exists
ERROR: file nc4hdf.c, line 1788.
NetCDF: HDF error
HDF5-DIAG: Error detected in HDF5 (1.10.4) thread 0:
  #000: H5D.c line 145 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: H5Dint.c line 326 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: H5L.c line 1572 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: H5L.c line 1813 in H5L__create_real(): can't insert link
    major: Links
    minor: Unable to insert object
  #004: H5Gtraverse.c line 851 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: H5Gtraverse.c line 627 in H5G__traverse_real(): traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: H5L.c line 1608 in H5L__link_cb(): name already exists
    major: Links
    minor: Object already exists
Sorry! Unexpected result, tst_rename2.c, line: 173

I am working on this one...

The problem is that there is a dim without coord var called D1. When changing the var name to D1 that var will become the coord var for dim D1, which previously had no coord var, so was a dim-only dataset in the HDF5 file. So we have to determine that this is happening, and then delete the dim-only dataset in the file before doing the rename.

This condition should be detected by this bit of code in hdf5var.c:

      if (var->ndims)
      {
         NC_HDF5_DIM_INFO_T *hdf5_d0;
         hdf5_d0 = (NC_HDF5_DIM_INFO_T *)var->dim[0]->format_dim_info;

         /* Is there an existing dimscale-only dataset of this name? If
          * so, it must be deleted. */
         if (hdf5_d0->hdf_dimscaleid)
         {
            if ((retval = delete_dimscale_dataset(grp, var->dim[0]->hdr.id,
                                                  var->dim[0])))
               return retval;
         }
      }

But thats not happening because in this case hdf5_d0->hdf_dimscaleid is 0. I believe this is the real bug here.

@edhartnett
Copy link
Contributor

The rename code is pretty dumb - some of my dumbest work on netcdf-4. I just didn't give it much thought, it was an afterthought really. I have since come to understand that renaming is actually an important operation for a lot of important users.

I have an idea for a comprehensive rename fix.

Consider two cases, easy and difficult renames. Difficult renames involve coordinate variables and dimensions. (And easy renames are easy.)

There are various orders of difficult rename, based on what happens between open and enddef. I can open a file and rename a coordinate var and enddef. That's a first order difficult rename. Or I can open the file, rename the coord var, then rename its dimension, then enddef. That's a second order difficult rename. Each rename I add can change the state of other vars and dims in the file, all of which must be sorted out in the enddef. To handle this we maintain a partial state machine with fields like was_coord_var and became_coord_var in the NC_VAR_INFO_T struct.

The contemplation of higher order difficult renames leads to a form of madness in which I grind my teeth, clench my fists, and curse users like @czender for doing so many crazy rename operations. (See https://www.youtube.com/watch?v=8EnsUeR2MyI)

My idea is to limit the problem to first order difficult renames, which should be tractable. Each time the user does a difficult rename, the library will do any automatic nc_sync(). This will flush all changes to disk (including the rename and all its consequences). The next difficult rename operation will be a first order difficult rename, because the last rename has been flushed to disk, the state machine issues resolved, and everything reset as if the file had just be opened.

Adding an nc_sync() should never have an influence on what the user sees in the file. It simply causes a flush to disk, and also handles any changes made necessary by the difficult rename. So this should be a safe operation. (Although it may cause a slight performance hit for those doing many renames. But correctness is more important than performance.)

However, the fly in the ointment is that adding this sync causes some current rename tests to fail, due to previously hidden first order difficult rename bugs that were masked by compensating 2nd or 3rd order difficult renames bugs that have been added to the code base to support all the crazy rename tests we currently pass. (See https://www.youtube.com/watch?v=BNsrK6P9QvI)

My current course is to fix these first order rename bugs so that I can turn on the syncing after each difficult rename operation. Then I can add more complex tests to demonstrate that higher-order difficult renames will work in tests, due to the auto-syncing that is happening after each one.

This is still somewhat daunting task, but at least feasible. (See https://www.youtube.com/watch?v=GND10sWq0n0).

@czender
Copy link
Contributor Author

czender commented Feb 2, 2019

Thank you for running the long race on this! Can't wait to see the harmonious finished product (see https://www.youtube.com/watch?v=CSav51fVlKU).

@edwardhartnett
Copy link
Contributor

Well @czender I am not running the long race here. I've apparently stopped at a country pub along the way. I'm not currently working on this master rename-fix, which involves calling sync after each complex rename, to ensure that we only have to deal with first-order complex rename operations.

How urgent is this? Did we address the most important rename problems in the fixes I did make? Or are there still a lot of rename problems caused by netcdf-4?

@czender
Copy link
Contributor Author

czender commented Feb 14, 2020

It is of moderate importance. Users have mostly gotten used to these bugs and take the advice to convert their files to netCDF3 when renames fail on their netCDF4 files, and then convert back. In the past year two users have contacted me after discovering this bug and not knowing what to do. That was a typical year, and applying the "rule of 10" to ncrename bug reports, netCDF4 rename bugs probably mildly irritate a few dozen ncrename users each year. The fixes you made last year do not seem to fix the cases where users (like me) retrofit all the dimension and coordinate names in a file in one ncrename command in order to make a dataset directly intercomparable with another. This is an important use case, IMHO, that seems to trigger the higher order rename bugs. The workaround is straightforward (call ncrename multiple times with single renames), though inelegant. And elegance matters. Help us Obi-Wan EdHartnett. You're our only hope.

@DennisHeimbigner
Copy link
Collaborator

I take your note about ncrename to mean that doing a sequence
of renames without closing the file causes problems, but
do a close and reopen between renames works. Correct?

@czender
Copy link
Contributor Author

czender commented Feb 14, 2020

Yes

@sgpearse
Copy link

Hi @edhartnett, I wanted to share a use case where this bug is impeding some of our Vapor users.

CM1 is a model for convective systems, particularly hurricanes, tornadoes, and supercells. It has been growing in popularity recently, and our team is seeing an increasing number of requests on how to ingest CM1 data into Vapor.

Vapor only supports NetCDF data that follows the CF Conventions. Unfortunately, CM1 outputs its coordinate variables with different names than their corresponding dimensions. Some users are determined enough to get around the bug, but we also see some give up. If renaming coordinate variables worked as intended, it would reduce our efforts in supporting CM1 users, and also increase our user base. In my opinion, it would also help to produce more compelling science.

Thank you.

@edwardhartnett
Copy link
Contributor

Thanks for sharing at use case!

Can you not get the CM1 model team to change their output to properly use coordinate vars?

I would really like to rewrite the rename code, and use enddef/redef after each rename. I believe this could eliminate all the bugs associated with renames, which would be great. However, time, time, time...

@sgpearse
Copy link

@edhartnett

Can you not get the CM1 model team to change their output to properly use coordinate vars?

Fair point. I'll reach out to them.

I would really like to rewrite the rename code, and use enddef/redef after each rename. I believe this could eliminate all the bugs associated with renames, which would be great. However, time, time, time...

I understand. We're in the same boat. I just wanted to raise this point in hopes it would raise the priority.

@leighorf
Copy link

leighorf commented Aug 7, 2020

@edhartnett

Can you not get the CM1 model team to change their output to properly use coordinate vars?

Fair point. I'll reach out to them.

I would really like to rewrite the rename code, and use enddef/redef after each rename. I believe this could eliminate all the bugs associated with renames, which would be great. However, time, time, time...

I understand. We're in the same boat. I just wanted to raise this point in hopes it would raise the priority.

FYI, I have patched the latest version of CM1 to output VAPOR3 compatible files: https://github.com/leighorf/cm1r19.10

@mathomp4
Copy link

mathomp4 commented Jan 25, 2021

All,

I just want to make sure that a bug a user is seeing with ncrename is related to this (before I bother @czender with an NCO bug report). Namely, the user wanted to rename a variable and dimension lev(lev) to levels(levels) but:

$ ncdump -v lev merra2.airs_aqua.mean3d.201810.nc4
netcdf merra2.airs_aqua.mean3d.201810 {
dimensions:
	lon = 576 ;
	lat = 361 ;
	lev = 117 ;
	time = UNLIMITED ; // (1 currently)
variables:
	double lon(lon) ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
	double lat(lat) ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_north" ;
	double lev(lev) ;
		lev:long_name = "vertical level" ;
		lev:units = "hPa" ;
		lev:positive = "down" ;
...
		:LongitudeResolution = "0.625" ;
		:DataResolution = "0.5 x 0.625" ;
		:identifier_product_doi_authority = "http://dx.doi.org" ;
		:identifier_product_doi = "TBD" ;
data:

 lev = 229, 228, 227, 226, 225, 223, 222, 221, 220, 219, 218, 217, 216, 215,
    212, 208, 202, 193, 190, 186, 182, 181, 178, 177, 176, 174, 173, 172,
    171, 170, 169, 168, 167, 166, 130, 129, 128, 126, 125, 124, 123, 122,
    121, 119, 117, 116, 115, 114, 113, 110, 109, 107, 106, 105, 104, 103,
    102, 101, 100, 99, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88, 87, 86,
    85, 84, 83, 81, 80, 78, 77, 76, 75, 73, 72, 70, 68, 67, 66, 65, 64, 63,
    62, 61, 60, 59, 57, 56, 55, 53, 52, 50, 49, 47, 46, 45, 44, 31, 29, 21,
    18, 14, 13, 11, 10, 9, 6, 3 ;
}
$ ncrename --history -v lev,levels -d lev,levels merra2.airs_aqua.mean3d.201810.nc4 foo.nc4
$ ncdump -v levels foo.nc4
netcdf foo {
dimensions:
	lon = 576 ;
	lat = 361 ;
	levels = 117 ;
	time = UNLIMITED ; // (1 currently)
variables:
	double lon(lon) ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_east" ;
...
	double levels(levels) ;
		levels:long_name = "vertical level" ;
		levels:units = "hPa" ;
		levels:positive = "down" ;

// global attributes:
...
		:WorthernmostLatitude = "-180.0" ;
		:EasternmostLatitude = "179.375" ;
		:LatitudeResolution = "0.5" ;
		:LongitudeResolution = "0.625" ;
		:DataResolution = "0.5 x 0.625" ;
		:identifier_product_doi_authority = "http://dx.doi.org" ;
		:identifier_product_doi = "TBD" ;
data:

 levels = _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _ ;
}

I mean, it does rename lev(lev) to levels(levels), but the content of levels seem to be...gone. 🤷🏼‍♂️

Obviously there's an issue, but not sure what. Now it does look like the "usual workaround" of:

$ ncks -h -3 merra2.airs_aqua.mean3d.201810.nc4 tmp.nc
$ ncrename --history -d lev,levels -v lev,levels tmp.nc tmp_rename.nc
$ ncks -h -4 tmp_rename.nc newfile.nc4

does work...though (weirdly?) the compression in the original file is lost, so you have to do:

$ ncks -h -4 -L 2 tmp_rename.nc newfile.nc4

but it does seem to at least seem to preserve what's in levels(levels):

$ ncdump -v levels newfile.nc4  | tail -15
		:LongitudeResolution = "0.625" ;
		:DataResolution = "0.5 x 0.625" ;
		:identifier_product_doi_authority = "http://dx.doi.org" ;
		:identifier_product_doi = "TBD" ;
data:

 levels = 229, 228, 227, 226, 225, 223, 222, 221, 220, 219, 218, 217, 216,
    215, 212, 208, 202, 193, 190, 186, 182, 181, 178, 177, 176, 174, 173,
    172, 171, 170, 169, 168, 167, 166, 130, 129, 128, 126, 125, 124, 123,
    122, 121, 119, 117, 116, 115, 114, 113, 110, 109, 107, 106, 105, 104,
    103, 102, 101, 100, 99, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88, 87,
    86, 85, 84, 83, 81, 80, 78, 77, 76, 75, 73, 72, 70, 68, 67, 66, 65, 64,
    63, 62, 61, 60, 59, 57, 56, 55, 53, 52, 50, 49, 47, 46, 45, 44, 31, 29,
    21, 18, 14, 13, 11, 10, 9, 6, 3 ;
}

@czender
Copy link
Contributor Author

czender commented Jan 25, 2021

Hi Matt,
Your reporrt has all the hallmarks of being due to this issue and I think your diagnosis and workaround are correct. netCDF3 files do not support compression, so the loss of compression with your workaround is expected not weird. A different workaround would be to use ncrename twice in succession, once for the variable name and once for the dimension name. Then you would not need to convert to netCDF3 as an intermediary stage. HTH, Charlie

@mathomp4
Copy link

Hi Matt,
Your reporrt has all the hallmarks of being due to this issue and I think your diagnosis and workaround are correct. netCDF3 files do not support compression, so the loss of compression with your workaround is expected not weird. A different workaround would be to use ncrename twice in succession, once for the variable name and once for the dimension name. Then you would not need to convert to netCDF3 as an intermediary stage. HTH, Charlie

@czender I tried that but:

$ ncrename --history -v lev,levels merra2.airs_aqua.mean3d.201810.nc4 yaya.var.nc4
$ ncdump -v levels yaya.var.nc4  | tail -n 15
		:LongitudeResolution = "0.625" ;
		:DataResolution = "0.5 x 0.625" ;
		:identifier_product_doi_authority = "http://dx.doi.org" ;
		:identifier_product_doi = "TBD" ;
data:

 levels = 229, 228, 227, 226, 225, 223, 222, 221, 220, 219, 218, 217, 216,
    215, 212, 208, 202, 193, 190, 186, 182, 181, 178, 177, 176, 174, 173,
    172, 171, 170, 169, 168, 167, 166, 130, 129, 128, 126, 125, 124, 123,
    122, 121, 119, 117, 116, 115, 114, 113, 110, 109, 107, 106, 105, 104,
    103, 102, 101, 100, 99, 98, 97, 96, 95, 94, 93, 92, 91, 90, 89, 88, 87,
    86, 85, 84, 83, 81, 80, 78, 77, 76, 75, 73, 72, 70, 68, 67, 66, 65, 64,
    63, 62, 61, 60, 59, 57, 56, 55, 53, 52, 50, 49, 47, 46, 45, 44, 31, 29,
    21, 18, 14, 13, 11, 10, 9, 6, 3 ;
}
$ ncrename --history -d lev,levels  yaya.var.nc4 yaya.var_and_dim.nc4
$ ncdump -v levels yaya.var_and_dim.nc4  | tail -n 15
		:WorthernmostLatitude = "-180.0" ;
		:EasternmostLatitude = "179.375" ;
		:LatitudeResolution = "0.5" ;
		:LongitudeResolution = "0.625" ;
		:DataResolution = "0.5 x 0.625" ;
		:identifier_product_doi_authority = "http://dx.doi.org" ;
		:identifier_product_doi = "TBD" ;
data:

 levels = _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _,
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _ ;
}

That -d lev,levels call seems to 💥 the levels variable. Is there perhaps an extra option I need to pass?

@czender
Copy link
Contributor Author

czender commented Jan 25, 2021

Ouch. No, you tried what I intended (because it sometimes works). I'd stick with the original "convert to netCDF3 (with -3, -5, or -6) first" workaround :)

@veenstrajelmer
Copy link

veenstrajelmer commented Aug 9, 2024

I might have ran into this issue in netcdf4-python, documented here: Unidata/netcdf4-python#1357

Therefore, I was wondering whether it is likely that this issue will be solved any time soon. I noticed this is an issue from 2017 already, so I guess this might not happen ever. Just looking for some expectation management.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests