Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netcdf 4.9.2 and 1.14.3 : attach_dimscales: Assertion `dsid > 0' failed when closing file #2962

Closed
Alexander-Barth opened this issue Aug 7, 2024 · 5 comments

Comments

@Alexander-Barth
Copy link
Contributor

Alexander-Barth commented Aug 7, 2024

This is a follow-up of the #1772 (comment).
While the issue #1772 is rather about a current limitation of the NetCDF4/HDF5 implementation, this issue is the about an assertion error than occurs in this context when the file is closed. At first glance, it seems that only recent version of HDF5 trigger this error, while #1772 itself is a long-standing issue.

  • the version of the software with which you are encountering an issue

NetCDF 4.9.2 and HDF5 1.14.3

  • environmental information (i.e. Operating System, compiler info, java version, python version, etc.)

Alpine Linux 3.15 (the build environment of Julia's C packages) with GCC 5.2.0

Other build dependencies (including indirect ones) are:

Blosc v1.21.5
Bzip2 v1.0.8
GnuTLS v3.8.4
HDF5 v1.14.3
Hwloc v2.11.1
Lz4 v1.10.0
OpenSSL v3.0.14
P11Kit v0.24.1
XML2 v2.13.1
XZ v5.4.6
Zstd v1.5.6
libaec v1.1.2
libzip v1.10.1
GMP v6.3.0
LibCURL v7.73.0
LibGit2 v1.2.3
LibSSH2 v1.11.1
MbedTLS v2.24.0
MozillaCACerts v2024.7.2
Zlib v1.3.1

The NetCDF is configured as:

./configure --prefix=/workspace/destdir --build=x86_64-linux-musl --host=x86_64-linux-gnu --enable-shared --disable-static --disable-dap-remote-tests --disable-plugins

These features are enabled:

sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.2 # nc-config --all

This netCDF 4.9.2 has been built with the following features: 

  --cc            -> cc
  --cflags        -> -I/workspace/destdir/include -I/workspace/destdir/include
  --libs          -> -L/workspace/destdir/lib -lnetcdf
  --static        -> -lhdf5_hl -lhdf5 -lm -lz -ldl -lzip -lsz -lbz2 -lzstd -lblosc -lxml2 -lcurl 
  --has-dap          -> yes
  --has-dap2         -> yes
  --has-dap4         -> yes
  --has-nc2          -> yes
  --has-nc4          -> yes
  --has-hdf5         -> yes
  --has-hdf4         -> no
  --has-logging      -> no
  --has-pnetcdf      -> no
  --has-szlib        -> yes
  --has-cdf5         -> yes
  --has-parallel4    -> yes
  --has-parallel     -> yes
  --has-nczarr       -> yes
  --has-zstd         -> yes
  --has-benchmarks   -> no
  --has-multifilters -> yes
  --has-stdfilters   -> deflate bz2 blosc zstd
  --has-quantize     -> yes

  --prefix           -> /workspace/destdir
  --includedir       -> /workspace/destdir/include
  --libdir           -> /workspace/destdir/lib
  --plugindir        -> 
  --version          -> netCDF 4.9.2
  • a description of the issue with the steps needed to reproduce it

The following C program leads to an abort in nc4hdf.c:

#include <stdio.h>
#include <string.h>
#include <netcdf.h>
int main() {
  int ncid, var_time_id, dim2_id, dim_id, var_myvar_id, status;
  int data[3] = {1,2,3};
  status = nc_create("test.nc", NC_CLOBBER|NC_NETCDF4, &ncid);
  if(status != NC_NOERR) {
    printf("Could not create the file.\n");
  }
  status = nc_def_dim(ncid, "time2", 3, &dim2_id);
  if(status != NC_NOERR) {
    printf("Could not create time2.\n");
  }  
  status = nc_def_var(ncid, "time", NC_INT, 1, &dim2_id, &var_time_id);
  if(status != NC_NOERR) {
    printf("Could not create variable time.\n");
  }
  status = nc_put_var_int(ncid, var_time_id, data);
  if(status != NC_NOERR) {
    printf("Could not put var.\n");
  }
  status = nc_def_dim(ncid, "time", 3, &dim_id);
  if(status != NC_NOERR) {
    printf("Could not create dim time.\n");
  }     
  status = nc_def_var(ncid, "myvar", NC_INT, 1, &dim_id, &var_myvar_id);
  if(status != NC_NOERR) {
    printf("Could not create myvar.\n");
  }
  status = nc_put_var_int(ncid, var_myvar_id, data); /* hitting issue #1772 */
  if(status != NC_NOERR) {
    printf("Could not put myvar.\n"); /* HDF5 error */
  }    
  printf("before calling close\n");
  status = nc_close(ncid);
  printf("error code after close = %d\n", status);
  return 0;
}

The output of this program is:

sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.2 # gcc -o nc_error nc_error.c $(nc-config --cflags --libs)
sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.2 # ./nc_error 
Could not put myvar.
before calling close
nc_error: nc4hdf.c:1469: attach_dimscales: Assertion `dsid > 0' failed.
Aborted (core dumped)

Note that this error is not reproducible with NetCDF 4.9.2 compiled with HDF5 1.10.7 (from ubuntu 22.04). So far this issue could only be reproduced with HDF5 1.14.2 and 1.14.3.

Since the error depends on the HDF5 version and since we have an HDF5 error at the line nc_put_var_int(ncid, var_myvar_id, data); I think it is possible that the root cause is not in NetCDF itself.

This issue was reported by @ryofurue in Alexander-Barth/NCDatasets.jl#261 for the julia package NCDatasets.jl. While for a typical C or Fortran program, an HDF5 error would end the program, in an interactive julia,python,R,... session however it is quite common that the user can make an error which should not result in aborting the session (actually, here it is not really a user error, but rather an limitation of NetCDF4 not occurring on NetCDF3 for example as discussed in #1772).

While solving #1772 would be quite desirable, the purpose of this issue, as I see it, is just about the assertion failure.

Thank you for your time and your insights!

@edwardhartnett
Copy link
Contributor

I have few regrets in life, but many of them have to do with HDF5 dimscales.

I will explain in detail, but first suggest you try adding flag NC_NODIMSCALE_ATTACH to your file create mode, to see if that resolves the problem. Let me know if it does.

Unfortunately, HDF5 does not have a dimension, and that leads to problems. What netCDF does is create a HDF5 dataset for each dimension (as well as each variable). When you have a variable and dimension which have the same name, netcdf-4 assumes that is a coordinate dimension (the values of the variable are the values along that dimension, and the variable has one dimension, which is the dimension of the same name).

So, for example, you would have a dimension "longitude" with length 100. Then you would have a variable called "longitude" which has 100 values, the values of the longitude for each value along that dimension. This is a "coordinate variable" in netCDF.

The problem with the above code is that you define a dimension time2, and then a variable, time.

Then, you write some data to time, causing all metadata to be written to the file.

Then, you go back into define mode and define a dimension named time, the name already in use by the variable. So this is bad netCDF practice, because you have a variable and a dimension of the same name, this should be a coordinate dimension, which means the dimension of variable "time" should be dimension "time", not "time2". In other words, once you define a variable, you should not add a dimension of the same name. If you want a coordinate variable, then define the dimension first, then use it as the dimension of the variable.

Is there a good reason for this? Or is it just a corner case that you discovered?

This is part of a larger category of renaming problems, starting with #597, but searching for "rename" on the netcdf-c issue list reveals many more. Fixing these would require a rewrite of a significant section of netcdf-4 code, which handles renames.

You can also build netCDF with CFLAGS=-NDEBUG to turn off all assertions and see what happens.

@Alexander-Barth
Copy link
Contributor Author

Alexander-Barth commented Aug 7, 2024

Thanks a lot for your detailed explanation!

Yes, the assertion failure is indeed not triggered when I use nc_create("test.nc", NC_CLOBBER|NC_NETCDF4 |NC_NODIMSCALE_ATTACH , &ncid);:

sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.2 # gcc -o nc_error_nodimscale nc_error_nodimscale.c $(nc-config --cflags --libs)
sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.2 # ./nc_error_nodimscale 
Could not put myvar.
before calling close
error code after close = -101

We have just the NC_EHDFERR (-101, HDF error) as expected.

I fully agree that creating such NetCDF confusing and indeed bad practice. It came about by copying the NetCDF variables along with dimension and names from a pre-existing file. The user was not aware that there were different dimension with inconsistent names. In NCDatasets.jl, one can copy a NetCDF variables from one file to another and all dimensions are created implicitly if necessary. There should have been only one time dimension. But user error happend. I am ok with returning an error message in this case, but an abort is a bit drastic.

You are right, that CFLAGS="-DNDEBUG" disables the assertion:

sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.2 # gcc -o nc_error nc_error.c $(nc-config --cflags --libs)
sandbox:${WORKSPACE}/srcdir/netcdf-c-4.9.2 # ./nc_error
Could not put myvar.
before calling close
error code after close = -124  #  NC_EDIMSCALE, Problem with HDF5 dimscales

@ryofurue
Copy link

ryofurue commented Aug 7, 2024

Edit: Sorry I missed your comment that it would take a lot of efforts to fix the problem. I understand.

First of all, @Alexander-Barth 's post is about the assertion error. So, the following discussion is not central to the issue.

When you have a variable and dimension which have the same name, netcdf-4 assumes that is a coordinate dimension

I thought that we need two conditions to be met:

  1. There is a dimension named "aaa" and there is a 1D variable named "aaa".
  2. The variable "aaa" has dimension "aaa".

In the submitted example, condition 1 is met but condition 2 isn't. Indeed, if you swap def_dim and def_var, a correct netCDF file is created without error.

@Alexander-Barth created the example specifically to produce the assertion error. I'll present a simpler code to illustrate the HDF problem:

// try_netcdf_error.cc
#include <iostream>
#include "netcdf.h"

int main() {
  int i, ncid, dimid_time, varid_tax, dimid_tax;
  const int tax[] = {1,2,3,4,5,6,7,8,9,10};

  i = nc_create("test.nc", NC_NETCDF4, &ncid);
  i = nc_def_dim(ncid, "TIME", 10, &dimid_time);

  i = nc_def_dim(ncid, "tax", 7, &dimid_tax); //(1) -> fine.
  if (i != NC_NOERR) {std::cerr << nc_strerror(i) << std::endl;}

  const int dims[] = {dimid_time};
  i = nc_def_var(ncid, "tax", NC_INT, 1, dims, &varid_tax);

  //i = nc_def_dim(ncid, "tax", 7, &dimid_tax); //(2) -> HDF error on put_var
  //if (i != NC_NOERR) {std::cerr << nc_strerror(i) << std::endl;}

  i = nc_enddef(ncid);

  i = nc_put_var(ncid, varid_tax, tax);
  if (i != NC_NOERR) {std::cerr << nc_strerror(i) << std::endl;}

  i = nc_close(ncid);
}

This program produces a correct netCDF file:

netcdf test {
dimensions:
	TIME = 10 ;
	tax = 7 ;
variables:
	int tax(TIME) ;
data:

 tax = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ;
}

Here there is a variable named "tax" and there is a dimension named "tax". That is, condition 1 is met. But, the variable "tax" has a dimension "TIME" and the netCDF file is happily created.

In my code, swap def_dim and def_var on lines (1) and (2). Then, HDF error occurs when writing into the variable "tax".

So, I suppose this problem is fixable. Am I correct?

Even if it is hard to fix it, it would be nice if a meaningful error is issued at nc_enddef.

@Alexander-Barth
Copy link
Contributor Author

Alexander-Barth commented Aug 7, 2024

What about changing the assert by just returning an error:

--- ./libhdf5/nc4hdf.c.orig
+++ ./libhdf5/nc4hdf.c
@@ -1466,7 +1466,9 @@
                         dsid = ((NC_HDF5_VAR_INFO_T *)(var->dim[d]->coord_var->format_var_info))->hdf_datasetid;
                     else
                         dsid = ((NC_HDF5_DIM_INFO_T *)var->dim[d]->format_dim_info)->hdf_dimscaleid;
-                    assert(dsid > 0);
+
+                    if (dsid <= 0)
+                        return NC_EDIMSCALE;
 
                     /* Attach the scale. */
                     if (H5DSattach_scale(hdf5_var->hdf_datasetid, dsid, d) < 0)

The example code above runs ok with this change.

@edwardhartnett
Copy link
Contributor

If you wish to submit a PR for that, I would have no objection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants