Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeWarning: invalid value encountered in equal #391

Closed
timburgess opened this issue Apr 9, 2015 · 22 comments
Closed

RuntimeWarning: invalid value encountered in equal #391

timburgess opened this issue Apr 9, 2015 · 22 comments

Comments

@timburgess
Copy link

I've built a new NetCDF4 module from the current master branch library on OS X 10.9.5

Opening an HDF4 file I see the above RuntimeWarning - the first time I've seen this, having used NetCDF4 pretty often on files from the this dataset. The code that produces the warning is pretty simple:

 file = 'sst_geo-polar-blended_night_5km_2015001.hdf'
 ds = Dataset(file)
 sst = ds.variables['sst_analysis'][:]

My first reaction was possibly a corrupted file but tools like hdfls and hdp seem to have no problem investigating the file.

Interestingly, ds.variables['sst_analysis'].set_auto_maskandscale(False) stops the RuntimeWarning but I still get screwy data. As the data is sea surface temps in Celsius, I should get between -2C and 35C but instead I get positive/negative values close to zero.

On building from master, all tests passed and I built with:
HDF5 1.8.14
Netcdf 4.3.3.1
Numpy 1.9.1
Cython 0.22
Clang 6.0 (clang-600.0.57)

I'd appreciate any thoughts/input on this.

@jswhit
Copy link
Collaborator

jswhit commented Apr 9, 2015

Can you post the file somewhere we can grab it?

@jswhit
Copy link
Collaborator

jswhit commented Apr 9, 2015

Also, what is the warning message?

@jswhit
Copy link
Collaborator

jswhit commented Apr 9, 2015

Never mind - I see the warning message is in the subject line.
This warning is coming from numpy. There must be NaNs, or Infinitys in your data. Perhaps NaNs are being used to denote missing data? It could also be that the endian-ness of the data is encoded incorrectly in the file (see issue #346). In any case, you can tell numpy not to print these warnings by using numpy.seterr(invalid='ignore'). However, if the data you are reading is incorrect, then I would try byteswapping the numpy array after you read it from the netcdf variable, using the byteswap() method. If this fixes it, then the endian flag in the file is incorrect.

@timburgess
Copy link
Author

Yes, there are NaNs - they are essentially used to depict missing data (land pixels).

I have put the file on dropbox at https://www.dropbox.com/s/i51d1xzp6ng2bb3/sst_geo-polar-blended_night_5km_2015001.hdf?dl=0 Note that it is large - about 230MB.

I'll try the byte-swapping.

@jswhit
Copy link
Collaborator

jswhit commented Apr 9, 2015

The data is marked in the file as big endian, but it is apparently little endian. Byteswapping seems to fix it.

@timburgess
Copy link
Author

Thanks! I've just about come to the same conclusion looking at the byteswapped data and dumping some charts to look over regions.

Is this check a recent change in the HDF4 C lib or netCDF4? I'm curious as to why my code has read these files previously with no problem. But I am rapt that the issue has been identified!

@jswhit
Copy link
Collaborator

jswhit commented Apr 9, 2015

from ncdump -s sst_geo-polar-blended_night_5km_2015001.hdf (note the endian-ness attribute)

netcdf sst_geo-polar-blended_night_5km_2015001 {
dimensions:
    fakeDim0 = 3600 ;
    fakeDim1 = 7200 ;
    fakeDim2 = 3600 ;
    fakeDim3 = 7200 ;
    fakeDim4 = 3600 ;
    fakeDim5 = 7200 ;
 variables:
    float sst_analysis(fakeDim0, fakeDim1) ;
            sst_analysis:missing_value = -999.f ;
            sst_analysis:fraction_digits = 4 ;
            sst_analysis:scale_factor = 1. ;
            sst_analysis:scale_factor_err = 0. ;
            sst_analysis:add_offset_err = 0. ;
            sst_analysis:add_offset = 0. ;
            sst_analysis:valid_range = -2.f, 35.f ;
            sst_analysis:long_name = "blended daily night-only geo polar sst" ;
            sst_analysis:units = "celsius" ;
            sst_analysis:format = "decimal" ;
            sst_analysis:coordsys = "global lat lon" ;
            sst_analysis:_Storage = "contiguous" ;
            sst_analysis:_Endianness = "big" ;

@timburgess
Copy link
Author

👏 Thanks again. I'll look into the dataset and contact the authors.

@jswhit
Copy link
Collaborator

jswhit commented Apr 9, 2015

Up until recently, netcdf4-python was ignoring the endian-ness attribute and just treating data as if it were the native endian-ness of whatever machine you read the data on. Issue #346 fixed this.

@timburgess
Copy link
Author

I've been doing a bit more investigation on this.

I believe the providers of the file referred to above use IDL and a colleague provided some IDL which I have extended a bit into:

pro createHDF
  f = HDF_SD_START('hdf_test.hdf', /create)
  ds = HDF_SD_CREATE(f, 'data', [5,5], /dfnt_int16)
  arr = INDGEN(5,5)
  HDF_SD_ADDDATA, ds, arr
  HDF_SD_ENDACCESS, ds
  HDF_SD_END, f 
end

Interestingly, these seem to map fairly directly to the HDF C library and looking at the IDL documentation for HDF_SD_CREATE, there is no option to write little-endian. I even tried using /dfnt_lint16 but it failed. So IDL will always produce big-endian HDF4.

ncdump -s hdf_test.hdf shows:

netcdf hdf_test {
dimensions:
    fakeDim0 = 5 ;
    fakeDim1 = 5 ;
variables:
    short data(fakeDim0, fakeDim1) ;
        data:_Storage = "contiguous" ;
        data:_Endianness = "big" ;
// global attributes:
        :_Format = "netCDF-4" ;
data:
 data =
  0, 1, 2, 3, 4,
  5, 6, 7, 8, 9,
  10, 11, 12, 13, 14,
  15, 16, 17, 18, 19,
  20, 21, 22, 23, 24 ;
}

To read the data in C, I wrote:

#include <stdio.h>
#include "mfhdf.h"
#define DIM1 5
#define DIM0 5
#define FILENAME "hdf_test.hdf"
int main() {
  
    int32 sd_id, sds_id, istat, sd_index;
    int32 dims[2], start[2], edges[2], rank;
    int16 array_data[DIM0][DIM1];
    intn i, j;  
  
//    status = Hgetlibversion(&major_v, &minor_v, &release, string);
    start[0] = 0;
    start[1] = 0;
    edges[0] = DIM1;
    edges[1] = DIM0;
  
    sd_id = SDstart(FILENAME, DFACC_READ);
    if (sd_id != FAIL)
        printf ("Opened %s\n", FILENAME);
    sd_index = 0;
    sds_id = SDselect(sd_id, sd_index); 
    istat = SDreaddata(sds_id, start, NULL, edges, (VOIDP)array_data);
    printf ("\nData: \n");
    for (j = 0; j < DIM0; j++) 
    {
        for (i = 0; i < DIM1; i++)
            printf ("  %d ", array_data[i][j]);
        printf ("\n");
    }
    printf ("\n");
    /* Terminate access to the array. */
    istat = SDendaccess(sds_id);
    /* Terminate access to the SD interface and close the file. */
    istat = SDend(sd_id);
    if (istat != FAIL)
        printf("closed\n");
  
    return 0;
}

which outputs

Opened hdf_test.hdf
Data: 
  0   5   10   15   20 
  1   6   11   16   21 
  2   7   12   17   22 
  3   8   13   18   23 
  4   9   14   19   24 
closed

I then tried this in python with the following:

import Cython
import numpy as np
import netCDF4
if __name__ == '__main__':
    print 'Cython version ' + Cython.__version__ + ' installed'
    print 'netCDF4 version ' + netCDF4.__version__ + ' installed'
    
    file = 'hdf_test.hdf'
    ds = netCDF4.Dataset(file)
    data = ds.variables['data'][:]
    print 'RAW'
    print data
    print 'Byte-swapped'
    print data.byteswap()
    
    ds.close()

and got:

Cython version 0.22 installed
netCDF4 version 1.1.8 installed
RAW
[[   0  256  512  768 1024]
 [1280 1536 1792 2048 2304]
 [2560 2816 3072 3328 3584]
 [3840 4096 4352 4608 4864]
 [5120 5376 5632 5888 6144]]
Byte-swapped
[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]
 [15 16 17 18 19]
 [20 21 22 23 24]]

So unfortunately this means that the netcdf4-python output is at odds with what C code and ncdump report.

If anyone wants to further clarify or point out any errors in the above code, I'd be keen to hear.

@jswhit jswhit reopened this Apr 16, 2015
@jswhit
Copy link
Collaborator

jswhit commented Apr 16, 2015

Tim - that's very interesting. Looks like there may be some hidden byteswapping going on somewhere in the python interface.

@jswhit
Copy link
Collaborator

jswhit commented Apr 16, 2015

I presume you ran that C program on a little endian machine? If so I guess that means that the HDF4 library is doing the byte swapping for you, since the data is big endian in the file. The python interface is assuming the byte-swapping has not been done, and is merrily swapping the byte order again.

I wonder if this is a difference between HDF4 and HDF5 - in HDF5 the data is not automatically byte-swapped if the native type is different that the type in the file. Apparently HDF4 does automatically swap the bytes. I suppose could add logic in the python code to not byteswap if the underlying data format is HDF4.

@timburgess
Copy link
Author

Yes. I ran all code on a Mac laptop with OS X 10.9 so little-endian.

I have put together some further C code that produces both a big-endian and a little-endian HDF4. The HDF utils report both with data as the same 5x5 array so byte swapping occurs as the data is written to the big-endian file. This is consistent with what the HDF4 User Guide states.

Unfortunately ncdump and the Python library choke on the little-endian HDF4 file. I'll gather all details and post in a seperate issue.

@jswhit
Copy link
Collaborator

jswhit commented Apr 16, 2015

I think this (and the new issue you just created) are related to a bug in the netCDF C library. I've created a ticket here:

Unidata/netcdf-c#112

@jswhit
Copy link
Collaborator

jswhit commented May 31, 2015

A couple of bugs were recently fixed in the netcdf-c lib that should take care of this. Closing now...

@jswhit jswhit closed this as completed May 31, 2015
@timburgess
Copy link
Author

I think there is still a byte-swap occurring that doesn't need to. Using the latest netcdf-c master and a big-endian HDF4 file, I get

$ ncdump hdf_bigendian.hdf 
netcdf hdf_bigendian {
dimensions:
    fakeDim0 = 5 ;
    fakeDim1 = 5 ;
variables:
    short data(fakeDim0, fakeDim1) ;
data:
 data =
  0, 1, 2, 3, 4,
  5, 6, 7, 8, 9,
  10, 11, 12, 13, 14,
  15, 16, 17, 18, 19,
  20, 21, 22, 23, 24 ;
}

However with the current netcdf-python master, I get:

$ python
Python 2.7.9 (default, Jan  7 2015, 11:50:42) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from netCDF4 import Dataset
>>> ds = Dataset('hdf_bigendian.hdf')
>>> foo = ds.variables['data'][:]
>>> foo
array([[   0,  256,  512,  768, 1024],
       [1280, 1536, 1792, 2048, 2304],
       [2560, 2816, 3072, 3328, 3584],
       [3840, 4096, 4352, 4608, 4864],
       [5120, 5376, 5632, 5888, 6144]], dtype=int16)

The example hdf4 file I am using is here

@jswhit jswhit reopened this Jun 1, 2015
@jswhit
Copy link
Collaborator

jswhit commented Jun 1, 2015

The fact that ncdump is printing the byte-swapped data indicates that the byte-swapping is being done on the fly in the hdf4 library. If you create a big-endian variable in a netcdf file, ncdump does not print byte-swapped data, which confirms that the netcdf lib does not automatically do the swapping. I could check to see if the underlying data format is not hdf4 before attempting to byte-swap - but lacking any more information on what we should expect from the C libs I'm not sure that would be the right solution.

jswhit added a commit that referenced this issue Jun 1, 2015
@jswhit
Copy link
Collaborator

jswhit commented Jun 1, 2015

pull request #426 is a hack to workaround this. Basically, it assumes the HDF4 lib always returns the data in the native endian format. Is this always true?

@jswhit
Copy link
Collaborator

jswhit commented Jun 1, 2015

http://www.hdfgroup.org/training/HDFtraining/UsersGuide/Fundmtls.fm3.html

seems to indicate the byte-swapping to native endian is always done in HDF4.

@timburgess
Copy link
Author

Yes, that's my understanding - that if the data endianness is different to the native os endianness, then the HDF4 lib will byte-swap for the user. HDF5 I believe never byte-swaps - it returns the raw data.

@jswhit
Copy link
Collaborator

jswhit commented Jun 1, 2015

OK, will wait a day or two and merge pull request #426 then.

@timburgess
Copy link
Author

Thanks @jswhit. #426 works for both the big and little endian examples I've tried with the recent netcdf-c changes.

jswhit added a commit that referenced this issue Jun 2, 2015
fix for issue #391 (big endian HDF4 data read incorrectly on little endian machine)
@jswhit jswhit closed this as completed Jun 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants