-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add per Dataset encoding support #655
Conversation
btw to make it more backwards compatible we instead change the default encoding to |
that recursion bug was a doozy, put a warning for future devs |
still learning cython :)
cool, looks like this PR is ready for review! |
I'm waiting on this to see how the discussion on netcdf-c plays out (Unidata/netcdf-c#402) |
What if different NC_STRING variables within the same dataset contain data with different encodings? I suppose this could be handled with the proposed The whole thing is a mess and any solution I can think of (including this one) seems fragile and kludgy. At least we have a solution now that works as long as you know the encoding and you only are reading data from one Dataset at a time. |
ya, we need per variable encoding fallbacks. Let me know how you'd like to handle that. Some ideas:
I can then code something up for you to look at. Right now I'm using a custom branch as I need to be able to specify the encoding. As I open multiple datasets in parallel. |
Is your use case (for specifying an encoding) mainly for attributes, or variable data? |
variable data, based on the conversations in the netcdf-c thread it sounds like attribute should all be forced to utf-8 (which I can change in this PR too). |
The reason I ask is if you are mainly concerned about attributes, we could add a kwarg |
Could you post one of the MADIS files that you are dealing with? |
found it for
which does not decode to UTF-8. presumably: 'F1LXJ-13 Annœullin FR ' in CP1252 encoding |
thanks @thehesiod. I've created an alternate 'solution' in the 'encoding' branch. Instead of setting the encoding as a Dataset init parameter, I look for an _Encoding attribute for character and vlen string variables. For string variables, if _Encoding is not set, 'utf-8' is used. For character arrays, if _Encoding is set, then a numpy array of fixed length strings is returned by automatically calling chartostring (the rightmost dimension of the variable is assumed to the the length of the strings). If _Encoding is not set, you get the previous behavior (an array of single characters is returned). So, in your case you would have to add For attributes, I added a 'encoding' kwarg to getncattr. I know adding a new attribute to all the files is probably not a good solution for you, but I'm still a bit confused about what the problem is you are trying to solve. With the current master, you can read the variable stationName and convert it to an array of strings using chartostring - but it will use the global module variable 'default_encoding'. Wouldn't simply adding a kwarg 'encoding' to chartostring solve your problem? (this is also done in the 'encoding' branch). To be specific, here's what I'm suggesting (using the from netCDF4 import Dataset, chartostring
nc = Dataset('20170201_0000')
chararr = nc['stationName'][:]
strarr = chartostring(chararr,encoding='cp1252')
print strarr[77]
nc.close() or alternately from netCDF4 import Dataset
nc = Dataset('20170201_0000','a')
nc['stationName']._Encoding = 'cp1252'
strarr = nc['stationName'][:]
print strarr[77]
nc.close() |
ya thinking about it more doesn't make sense to need an encoding init param for char variable data, not sure why I thought I needed this. Closing this PR |
nc_open_mem
changebasestring
support