-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scale_factor, add_offset and _FillValue #39
Comments
Yes, this should have been done a while ago. The scale_factor part should be quite simple. The other part would need a good API, maybe a keyword argument would be sufficient like: I hope to find some time during the Christmas break to do this. |
Yes, this would be very useful. nc = NetCDF.open("some_file.nc")
regular_array = nc.vars["variable_name"][:,:]; # as implemented now
instance_of_data_array = nc.data["variable_name"][:,:]; # data array with scaling If this API looks fine, I can try to implement it and make a pull request (if I succeed). |
Just wanted to note that rather than using |
Has there been any progress on this @Alexander-Barth ? Is it something that I could easily submit a PR for, where would you want it to go? @meggart |
I started doing this and then it turned out to be more difficult than I thought to do properly. The main problem is that currently When we have a scale factor and offset, it is common that the return type is not the same as the NetCDF storage type, for example a variable might be NC_UBYTE on disk with a scale_factor of 3.f0, so the data has to be converted to floating-point when reading. This all got very confusing and I was not really happy with the resulting interface, so I stopped working on it until I have some time to think about it. Maybe you have some good ideas how to do this, here are my attempts so far: 1713963 Another question would be how to deprecate the old behaviour. I think a change like this will inevitably break lots of existing code. |
Indeed, it is not easy to implement this efficiently. Maybe one can implement two sets of types (for NetCDF variable (NcVarRaw, NcVarScaled) and NetCDF files (NcFileRaw, NcFileScaled). Indexing a NcFileRaw type would produce a NcVarRaw and indexing a NcVarRaw would return the same type as in the NetCDF files (ignoring all scaling and _FillValues). However, indexing NcFileScaled (or a better name) and NcVarScaled would return always a DataArray/NullableArray of Float64 for numbers (with scaling and masking _FillValues) and a DataArray/NullableArray of chars for characters (ignoring the scaling but masking _FillValues). |
I'm completely new to Julia, but I work with netCDFs a lot so I need this functionality if I'm going to continue. Has there been any more progress on this? Otherwise I'm willing to have an attempt at implementing something. |
In case, you are interested. Here is my attempt: https://github.com/Alexander-Barth/NCDatasets.jl |
@Alexander-Barth I have looked at your new interface, it looks good, and I think I will try it soon. Do you think it would make sense to factor out the netcdf_c.jl and build.jl into its own package (NetCDF-C) so that you could depend on this one? Then it would be easier to experiment with new interfaces for reading and writing NetCDF files, while minimizing code duplication. |
In fact, I started to change netcdf_c.jl quite bit and I am not sure if you are interested in these changes. I think that this file is an automatically generated file (right?). The changes I made are mainly returning e.g. scalar values instead of changing the elements of a one element vector. I also changed some parameter types from Cint to Csize_t if in the corresponding C API size_t is used. |
@Alexander-Barth Interested in trying this out. |
Ok, I think then it is better to leave the auto-generated file as it is. In general, I think it would be good if you keep posting progress on your new interface, especially once you register it as a package, then we could mention it as an alternative in the README. |
Ok, I agree. The other aspect (similar to scaling) that I have implemented is returning time variables (if they have a unit attribute of the form "days/hours/.. since start date) as arrays of In fact, I recently registered NCDatasets. I added the following to its README file: Alexander-Barth/NCDatasets.jl@daeda94 |
@Alexander-Barth thanks for showing NCDatasets, it wasn't on my radar before. We should indeed link to it in our readme as well. Regarding In my opinion we should aim towards either: Having NCDatasets.jl depend on NetCDF.jl, to reduce double effort, and NCDatasets can use the NetCDF low level API to build any more convenient API. This is more or less what is the idea behind GDAL.jl and ArchGDAL.jl, where GDAL.jl is pretty much a 1 to 1 translation of the GDAL API to Julia, on which ArchGDAL.jl can build to make a friendlier API. The advantage is that if someone has a need for a different API, they can relatively easily build a new API on top of GDAL.jl, without having to do the (not very interesting) plumbing. Try to merge the packages together? I understand that starting a separate package can be a great way to quickly work out new ideas, so I'm happy that you did that, and we can learn from each other. That said, I'm not sure there are any fundamental differences in where we want to go. What are your thoughts? @meggart @Alexander-Barth |
Another motivating factor for having one of the packages depend on the other is to have a single binary (and build file) shared across both packages |
Initially, I considered indeed to depend NCDatasets.jl on NetCDF.jl but I did not choose to follow to road:
While I wanted to use simply
About the merging option:
Of course I see also the overall benefit of combining all forces in a single NetCDF package, but I did not see how I could change NetCDF.jl the way I wanted it to be without massively breaking the user's current code. |
Thanks for the clarification and sorry for the long response time. A general comment first: Writing this package was my first attempt at learning Julia in 2012, so there are indeed many ways to improve the code, and maybe a complete re-write as you started would make sense. Regarding the globals, I see the problem when it comes to threading, the main reason they were introduced was to avoid unnecessary memory allocations. I am still not completely convinced of your dependency argument, since the NetCDF-C.jl package would be a very small package without any other Julia dependencies, containing just a single netcdf-c.jl source file. I support the idea of having separate packages for imperative reading and writing data vs a Object-Dataset like approach, and definitely think you should go on and register your package as an alternative. |
In Python and Matlab/Octave, the NetCDF variables scale_factor and add_offset are applied when these attributes are defined, according to this simple rule:
variable2 = scale_factor * variable + add_offset
As far as I can see, this is also the default for the R package RNetCDF. Is it possible to add this behaviour to ncread and to getindex/setindex! functions? (I actually prefer the later API because indexing is much more natural).
In a similar vein, is it also possible to set values to NaN which are equal to _FillValue or maybe use DataArrays?
The text was updated successfully, but these errors were encountered: