Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccopy bug with user defined types #1956

Closed
WardF opened this issue Mar 2, 2021 · 7 comments · Fixed by #1959
Closed

nccopy bug with user defined types #1956

WardF opened this issue Mar 2, 2021 · 7 comments · Fixed by #1959
Assignees
Milestone

Comments

@WardF
Copy link
Member

WardF commented Mar 2, 2021

The following issue was discovered and documented by our user community. Thanks!

Attn: @DennisHeimbigner


  1. Issue
    Under certain particular circumstances nccopy fails to determine the type of attributes when they have a user defined type.

  2. Informal description
    Here is how nccopy fails in this regard:

  • when attempting to copy the schema of a variable, nccopy will also attempt to copy its attributes
  • for each attribute with a user defined type this type needs to be determined at this step
  • for example, _FillValue attributes for enumerated type variables obviously have a user defined type
  • the current implementation of nccopy (4.7.4) fails to determine the user define type (in some cases) because of the way it searches for the data type, which goes like this:
    o first, the type is searched for in the parent group of the variable
    o if it is found there, then nccopy works just fine
    o else the search continues "recursively" in the subgroups of the variable's parent group; this is where a first issue is as nccopy should look in the ancestor groups, until reaching the root group; the copying of some of the enumerated type variables fails with a "Not a valid data type or _FillValue type mismatch" error because of this
    o for other enumerated type variables we see a Segmentation fault instead
    o this second issue is because the recursive search mentioned above is probably not what the implementer really intended; instead of doing a depth-first search, it gets stuck in an infinite recursive call (from the variable's parent group back to itself); this causes a stack overflow after enough recursive calls; apparently on Linux systems there is no detection for stack overflow, so this eventually results into a segmentation fault (which weíre seeing for all enumerated type variables in a group which does not itself contain the enumerated type definition, but has subgroups in which to try to recurse, and fails to do that because of the reason described above).
  1. How to reproduce the issue
    create_and_copy_all_examples.sh will call:
  • create_enum_variable_with_type_defined_in_ancestor_group_with_no_subgroups.py
  • create_enum_variable_with_type_defined_in_ancestor_group_with_subgroups.py
  • create_enum_variable_with_type_defined_in_same_group.py.

These scripts will create the following minimal example products:

  • type_defined_in_ancestor_group_with_no_subgroups.nc (the type is defined in the root group, while the variable is defined in a separate group, but this group does not have subgroups)
  • type_defined_in_ancestor_group_with_subgroups.nc (the type is defined in the root group, while the variable is defined in a separate group and this group has subgroups)
  • type_defined_in_same_group.nc (the type and the variable are defined in the same group)
    respectively.

create_and_copy_all_examples.sh then attempts to nccopy these three products with the following results:

  • for type_defined_in_ancestor_group_with_no_subgroups.nc nccopy fails with the following error:
    NetCDF: Not a valid data type or _FillValue type mismatch
    Location: file /tmp/build/80754af9/libnetcdf_1582139698333/work/ncdump/nccopy.c; line 1326
    (also producing the partial copy product: type_defined_in_ancestor_group_with_no_subgroups_copy.nc)
  • for type_defined_in_ancestor_group_with_subgroups.nc nccopy fails with the following error:
    Segmentation fault (core dumped)
    (also producing the corrupted product: type_defined_in_ancestor_group_with_subgroups_copy.nc)
  • for type_defined_in_same_group.nc nccopy successfully copies the product (see the copy product: type_defined_in_same_group_copy.nc).
  1. How to trace and maybe solve the issue
    Source code : https://github.com/Unidata/netcdf-c/tree/v4.7.4
    Build like this: https://www.unidata.ucar.edu/software/netcdf/docs/getting_and_building_netcdf.html#netCDF-CMake

Source files involved:

Reversed stack trace:

The search for the type should recurse going up the group hierarchy, not down (or in both directions). In the current implementation, the type is looked for in the subgroups of the variable's parent group. But it should be looked for in the supergroups too (or even excusively in the supergroups).
This is the cause of the "Not a valid data type or _FillValue type mismatch" error when the user defined type is defined in a supergroup of the variable's parent group. We follow this approach when we reuse a type in multiple subgroups (avoiding repeating the same definition).
In order to go up the group hierarchy, the nc_inq_grps calls should be replaced by calls to a function determining the ancestor groups on the path to the root, here https://github.com/Unidata/netcdf-c/blob/v4.7.4/libdispatch/dcopy.c#L200 and here https://github.com/Unidata/netcdf-c/blob/v4.7.4/libdispatch/dcopy.c#L206 and the recursive call to NC_rec_find_nc_type should be removed (i.e. https://github.com/Unidata/netcdf-c/blob/v4.7.4/libdispatch/dcopy.c#L213).
In a more efficient approach, just the parent group could be determined and then the other ancestors could be determined by the recursive call to NC_rec_find_nc_type. This should deal with the issue.

Additionally (independent of the main issue regarding the at least restrictive direction of search in the group hierarchy), in the current implementation the recursive call enters an infinite loop because:

@WardF WardF added this to the 4.8.1 milestone Mar 2, 2021
@WardF WardF self-assigned this Mar 2, 2021
@DennisHeimbigner
Copy link
Collaborator

The issue is not actually with nccopy. It is an issue with the way that the netcdf library
uses a simple user-type name to find the corresponding typeid. The lookup algorithm
has at least two implementations in the library plus one other in ncgen.
Unfortunately, all three algorithms appear to be be different.

Fixing this is likely to break other code so it may take some time to figure our
a reasonable compromise.

DennisHeimbigner added a commit to DennisHeimbigner/netcdf-c that referenced this issue Mar 6, 2021
re: Github issue Unidata#1956

The function NC_compare_nc_types in libdispatch/dcopy.c uses an
incorrect algorithm to search for types. The core of this is the
function NC_rec_find_nc_type in libdispatch/dcopy.c. Currently
it searchs the current group and its subtree.

Additionally, the function NC4_inq_typeid in libsrc4/nc4internal.c
has been extended to handle fully qualified names. It was originally
designed to do this, but for some reason never completed.

The NC_rec_find_nc_type algorithm has been altered to match the
algorithm used by NC4_inq_typeid. It operates as follows.

Given a file F, group G and a type T. It searches file F2, group
G2, for another type T2 that is equivalent to T.

The search order is as follows.
1. Search G2 for a type T2 equivalent to T.
2. Search upwards in the ancestor groups of G2 for a type T2 equivalent to T.
3. Search the complete group tree of F2 in pre-order, breadth-first order to locate T2 equivalent to T.

Also add a test case to validate algorithm: ncdump/test_scope.sh.

Note, this change may cause compatibility problems, though it is
unlikely because two different equivalent type declarations in
one dataset is unlikely.
@DennisHeimbigner
Copy link
Collaborator

I have a fix, I think. See #1959

@cobarzan
Copy link

cobarzan commented Mar 8, 2021

Since they are multiple approaches to look for a user defined type (i.e. up or down the group hierarchy or even both in the same time), why not make them all available via an extra argument to nccopy (and possibly to other applications too)? As default (i.e. when this extra argument is not set explicitly), nccopy could still use the current approach (i.e. looking down the group hierarchy). This way nccopy would stay backwards compatible while adding functionality.
A bit of debate might be needed on how to add this argument and propagate it in code to the place it would actually be used (i.e. somewhere around here: https://github.com/Unidata/netcdf-c/blob/v4.7.4/libdispatch/dcopy.c#L588). But the solution is otherwise pretty straightforward.

@DennisHeimbigner
Copy link
Collaborator

Would you be in favor of doing this for dimension name search as well?

@cobarzan
Copy link

cobarzan commented Mar 8, 2021

I was under the impression it is not even possible to define dimensions down the group hierarchy from where they are used. Personally, I do not see any reason to do so. Neither for enumerations. But I understand for enumerations this sometimes happens and there are important users that could benefit from using the down the group hierarchy search. Maybe the same is true for dimensions.
As I said above, it does not seem to be possible to define dimensions down the group hierarchy from where they are used using the netcdf4-python:

import netCDF4 as nc
dataset = nc.Dataset("test.nc",'w')
group = dataset.createGroup("test_group")
group.createDimension("test_dimension",10)
variable = dataset.createVariable("test_variable", int, ("test_dimension"))
dataset.close()

Running the above code will fail like this:

Traceback (most recent call last):
  File "[...]\Anaconda3\lib\site-packages\netCDF4\utils.py", line 48, in _find_dim
    dim = group.dimensions[dimname]
AttributeError: 'NoneType' object has no attribute 'dimensions'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "[...]\Anaconda3\lib\site-packages\netCDF4\utils.py", line 52, in _find_dim
    group = group.parent
AttributeError: 'NoneType' object has no attribute 'parent'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dimension_down_the_hierarchy.py", line 8, in <module>
    variable = dataset.createVariable("test_variable", int, ("test_dimension"))
  File "netCDF4\_netCDF4.pyx", line 2807, in netCDF4._netCDF4.Dataset.createVariable
  File "netCDF4\_netCDF4.pyx", line 3818, in netCDF4._netCDF4.Variable.__init__
  File "[...]\Anaconda3\lib\site-packages\netCDF4\utils.py", line 54, in _find_dim
    raise ValueError("cannot find dimension %s in this group or parent groups" % dimname)
ValueError: cannot find dimension test_dimension in this group or parent groups

@DennisHeimbigner
Copy link
Collaborator

I guess I have always assumed that the current ability to specify a fully qualified name
was sufficient to handle all cases where some type was not in the current type search
algorithm,

@cobarzan
Copy link

cobarzan commented Mar 8, 2021

This fails the same way:

import netCDF4 as nc
dataset = nc.Dataset("test.nc",'w')
group = dataset.createGroup("test_group")
group.createDimension("test_dimension",10)
variable = dataset.createVariable("test_variable", int, ("/test_group/test_dimension"))
dataset.close()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants