-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can argument start be NULL in get/put APIs? #231
Comments
Why do you want to do this? If you want to read the entire variable, use the nc_get/put_var() functions. Ed On Sat, Mar 5, 2016 at 9:42 AM, Wei-keng Liao notifications@github.com
|
I am a developer of PnetCDF. One of my purposes is to check if PnetCDF can follow netCDF's convention on error codes returned. The problem I found is an inconsistency within netCDF APIs for handling a NULL argument and unclear specification in the document. |
Well I agree it's a bug if passing NULL causes a segfault. That should But in terms of the error code that you get back, that's a tougher Ed On Sun, Mar 6, 2016 at 11:28 AM, Wei-keng Liao notifications@github.com
|
The NULL checking for argument start does appear in vars APIs only. See line 198 of dvarput.c Either NC_EINVAL or NC_EINVALCOORDS is fine, as long as the causes are well defined in the document. However, I would prefer an error code that can provide more info to the problem. NC_EINVALCOORDS is more about the argument start. I agree that invalid pointer arguments are hard to check. Segfaul is sometimes unavoidable. However, NULL, when used, has special meanings in many netCDF APIs. In this case, it can be defined as reading/writing an entire variable and ignores count, stride, and imap arguments. Maybe similar approach can be applied to count, stride, and imap. I just hope this can be clarified in netCDF document. |
I think your analysis is correct. We should be checking for null arguments as appropriate, but it probably |
I agree with @DennisHeimbigner; NULL checking and returning the proper error code is the way to go, and doing all of that in libdispatch feels like the natural place. I also agree that interpreting NULL isn't the way we want to go; it might disguise other problems when the NULL is inadvertent. |
Howdy Wei-keng Liao at the parallel-netcdf team! Did you change pnetcdf to exactly imitate netcdf for the 1.7.0 release? Because I just upgraded to that release and now the PIO tests are failing with exactly this sort of problem - a segmentation fault that seems to be caused by nulls as parameters. Thanks, |
Hi, Ed No, PnetCDF 1.7.0 checks and returns NC_ENULLSTART if start argument is NULL An special case is when the variables are scalars and start is ignored. Wei-keng On Mar 21, 2016, at 10:53 AM, Imperial Fleet Commander Tarth wrote:
|
As you can see from the core dump output it is happening in ncmpii_sanity_check which is a pnetcdf function. Also this happened when I updated to pnetcdf 1.7.0. Backtrace for this error: |
However Kate and Jim have apparently been using pnetcdf 1.7.0 for a few weeks with no problems, so likely this is a problem on my machine. Let me work on it for a while and see if I can figure out what is going on. Also if this is still a problem I will move this conversation to the pnetcdf mailing list, where Ward and Dennis will not have to read about it. ;-) |
I created a test program for this issue. It tests whether expected error codes (NC_EINVALCOORDS or NC_EEDGE) can be returned when using NULL pointers in arguments start, count, stride, or imap. When running against the latest master branch, it may cause a coredump due to trying dereference NULL pointers. https://trac.mcs.anl.gov/projects/parallel-netcdf/browser/trunk/test/testcases/nc_null_args.c Also, I added a function below in the test program |
OK, I have hit this issue when working on lazy vars for netcdf-4. This is now fixed for all netCDF-4 code. However, instead of returning an error, which I think is not actually correct, the netCDF-4 functions assume starts of 0, counts of full dimension extent, and stride of 1. That is:
behave exactly like:
(Similarly for gets). Unfortunately, the classic code does not do this, it segfaults. ;-( As Dennis suggests, I am looking at putting the solution I am currently using for netCDF-4 files into the dispatch layer code, so that it works for all dispatch layers. This makes sense, since how NULL starts, counts, and strides are treated is part of the netCDF API that all dispatch layers should respect. |
I think either defaulting start to all zeros or reporting an error are equally good
|
Well the vara functions handle a missing (i.e. NULL) count by using the maximum extent of each dimension. This is done in the dispatch layer. So we don't know whether anyone is doing that, but we probably don't want to change it. So for NULL start we can assume all 0 or return error, but we still have the problem of NULL count for vars for classic files. It causes segfault. I think the cleanest solution is to assume NULL is the same as in nc_get_var(), that is, 0 for start, full extent for count, and 1 for stride. This should be done in the dispatch layer. It's already partially being done (i.e. for count in vara functions), but it needs to be done for start and stride as well, which will be very easy. |
In PnetCDF, NC_EINVALCOORDS is returned when start is NULL for var1, vara, vars, and varm APIs, and NC_EEDGE is returned when count is NULL for vara, vars, and varm APIs. |
I am perfectly happy to do things the pnetcdf way, but that would break any user code that relies on nc_get/put_vara calls to automatically fill in the count array. Not sure how many users depend on that, probably not many. But if we now return NC_EEDGE for that, it may break user code. So it seems we have three paths to choose from: 1 - Return NC_EINVALCOORDS for NULL start everywhere, return NC_EEDGE for NULL count in vars functions, fill in counts for vara functions. (What error do we return for NULL stride?) (Inconsistent, but nothing breaks.) 2 - Return NC_EINVALCOORDS for NULL start, return NC_EEDGE for NULL count everywhere. (Consistent, but user code that depends on NULL counts for vara calls will break). 3 - Treat NULL start as all 0s, NULL count as dim full extent, and NULL stride as all 1s. This is consistent and nothing breaks, so I think it is what we should do. @DennisHeimbigner or @WardF if you want to make a definitive choice, I will code it up and put up a PR. If you don't feel strongly about it, I will go with option 3. |
PnetCDF always checks start and count against NULL since its official release 1.2.0. From the earlier discussion in this issue, both @DennisHeimbigner and @WardF agreed with the approach to check NULL and return error codes. Here is the quote from @WardF.
|
So @wkliao do you suggest option 1 or 2? The vara functions already interpret NULL count as full extent of dimensions. What should be do there? Continue as we have been? Or return error code? |
If we enforce these rules in libdispatch/dvar{get/put}.c, then none of the underlying |
OK, #2 sounds good to me. I will do this on my lazy var branch and put up a PR in a bit. |
If anyone is interested, here's how the answer looks right now. I have this newly added function that does the work:
This is used in the vara and vars functions like this:
Note that nothing is done unless a NULL argument is passed, to minimally impact performance in the normal case. |
OK, this is ready to go on branch ejh_vars_null_count_issue_2. Turns out, for every put and get we were doing an extra file lookup. I have removed them. |
Minor issue is that there can be memory leaks if errors in Agree that this is minor issue with very low probability of happening and the fix involves uglifying the code. |
@gsjaardema thanks, I have fixed that. |
I tested var1, vara, vars, and varm APIs using NULL for argument start.
All resulted in a coredump, but vars. (version 4.4.0 is used.)
Core file shows the followings for varm case.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000004043a8 in NCDEFAULT_put_varm (ncid=65536, varid=0, start=0x0, edges=0x7ffee1af7730, stride=0x7ffee1af7710, imapp=0x7ffee1af76f0,
479 mystart[idim] = start[idim];
(gdb) where
#0 0x00000000004043a8 in NCDEFAULT_put_varm (ncid=65536, varid=0, start=0x0, edges=0x7ffee1af7730, stride=0x7ffee1af7710, imapp=0x7ffee1af76f0,
#1 0x00000000004044d2 in NC_put_varm (ncid=65536, varid=0, start=0x0, edges=0x7ffee1af7730, stride=0x7ffee1af7710, map=0x7ffee1af76f0,
#2 0x0000000000405d9c in nc_put_varm_float (ncid=65536, varid=0, startp=0x0, countp=0x7ffee1af7730, stridep=0x7ffee1af7710, imapp=0x7ffee1af76f0,
Other coredump files point to
#0 0x00000000004197bb in NCcoordck (ncp=0x25830f0, varp=0x25872f0, coord=0x0) at putget.c:736
736 if(coord > X_UINT_MAX) / rkr: bug fix from previous X_INT_MAX */
The netCDF C reference dose not explicitly say NULL start argument is illegal, but there is error code NC_EINVALCOORDS defined that may be used to tell. So, the question is whether NULL is allowed or not. If allowed, then netCDF needs to allocate and use an array of all 1s internally.
The text was updated successfully, but these errors were encountered: