Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xarray writing mfdataset results in incorrect data when not using manual encoding #239

Closed
veenstrajelmer opened this issue Feb 15, 2023 · 6 comments · Fixed by #268 or #738
Closed

Comments

@veenstrajelmer
Copy link
Collaborator

veenstrajelmer commented Feb 15, 2023

xarray.to_netcdf() of opened mfdataset results in incorrect data when not using manual encoding

import os
import xarray as xr
import matplotlib.pyplot as plt
plt.close('all')

#open data
dir_data = r'p:\11207892-pez-metoceanmc\3D-DCSM-FM\workflow_manual\01_scripts\04_meteo\era5_temp'
file_nc = os.path.join(dir_data,'era5_mslp_*.nc')
data_xr = xr.open_mfdataset(file_nc)

#optional encoding
#data_xr.msl.encoding['dtype'] = 'float32' #TODO: updating dtype in encoding solves the issue. Source data is int, opened data is float, but encoding is still int.
#data_xr.msl.encoding['_FillValue'] = float(data_xr.msl.encoding['_FillValue'])
#data_xr.msl.encoding['missing_value'] = float(data_xr.msl.encoding['missing_value'])
#data_xr.msl.encoding['zlib'] = True #no effect
#data_xr.msl.encoding['scale_factor'] = 0.01
#data_xr.msl.encoding['add_offset'] = 0

#write to netcdf file
file_out = os.path.join('era5_mslp_out.nc')
data_xr.to_netcdf(file_out)

fig,(ax1,ax2) = plt.subplots(1,2,figsize=(11,5))
data_xr.msl.sel(time='2023-01-24 02:00:00').plot(ax=ax1,cmap='jet') #original dataset
with xr.open_dataset(file_out) as data_xr_check:
    data_xr_check.msl.sel(time='2023-01-24 02:00:00').plot(ax=ax2,cmap='jet') #written dataset
fig.tight_layout()

This results in incorrect data in the written file (right):
image

When updating the dtype (from int to float) in the variable encoding, this issues is solved:
image

The encoding in the source dataset:

data_xr.msl.encoding
Out[28]: 
{'source': 'p:\\11207892-pez-metoceanmc\\3D-DCSM-FM\\workflow_manual\\01_scripts\\04_meteo\\era5_temp\\era5_mslp_2022-11.nc',
 'original_shape': (720, 93, 121),
 'dtype': dtype('int16'),
 'missing_value': -32767,
 '_FillValue': -32767,
 'scale_factor': 0.11615998809759968,
 'add_offset': 99924.34817000595}

Possible issue: source data is integers, but opening files with different scaling_factors (from different files) converts it to floats (or maybe this always happens). The dtype in the encoding is still int, so this is how the netcdf is written, but probably something does not fit within the int-bounds.

@veenstrajelmer
Copy link
Collaborator Author

veenstrajelmer commented Mar 7, 2023

This is solved with dfmt.prevent_dtype_int() which is used in dfmt.merge_meteofiles():

import dfm_tools as dfmt    
data_xr = dfmt.prevent_dtype_int(data_xr)

The new examplefile is preprocess_merge_meteofiles.py

@veenstrajelmer
Copy link
Collaborator Author

veenstrajelmer commented Mar 7, 2023

Solved with above, but another workaround is removing the scale_factor instead of the dtype. This keeps the file size small. However, there are slight offsets between the source and destination datasets, but since the value in the range of 0.1 was replaced by the default 1. The scale_factor probably depends per variable so this is not generic. Also, maybe move dfmt.prevent_dtype_int() to dfmt.preprocess_ERA5() since up to now it is specific to ERA5.

import os
import xarray as xr
import matplotlib.pyplot as plt
plt.close('all')
import numpy as np

#open data
dir_data = r'p:\11207892-pez-metoceanmc\3D-DCSM-FM\workflow_manual\01_scripts\04_meteo\era5_temp'
file_nc = os.path.join(dir_data,'era5_mslp_*.nc')
data_xr = xr.open_mfdataset(file_nc)

#optional encoding
#data_xr.msl.encoding.pop('dtype') #difference is 0
data_xr.msl.encoding.pop('scale_factor') #difference is 0.46-0.5
#data_xr.msl.encoding.pop('add_offset') #difference is 131072.5

#write to netcdf file
file_out = os.path.join('era5_mslp_out.nc')
data_xr.to_netcdf(file_out)

data_xr_check = xr.open_dataset(file_out)

absdiff = (data_xr_check - data_xr).apply(np.fabs)
absdiff_max = absdiff.msl.max(dim=['longitude','latitude'])
fig,ax = plt.subplots()
absdiff_max.plot()
fig.tight_layout()

pop dtype:
image

pop scale_factor:
image

@veenstrajelmer
Copy link
Collaborator Author

@veenstrajelmer
Copy link
Collaborator Author

Recompute issue created: #269

@veenstrajelmer
Copy link
Collaborator Author

veenstrajelmer commented Mar 10, 2023

Zipping might be easier and more generic, however, some encoding has to be altered when doing that (for each variable).

ds.msl.encoding.pop('dtype')
ds.msl.encoding.pop('scale_factor')
ds.msl.encoding.pop('add_offset')
ds.msl.encoding['zlib'] = True # icw dropping dtype/scale_factor/add_offset, results in approximately same filesize with float32 as int16

@veenstrajelmer
Copy link
Collaborator Author

veenstrajelmer commented Sep 18, 2023

GTSM fou ncfiles were easily compressed and without performance reduction (compare_components_fouhis.py).

def compress(ds):
    # float64 was 260MB per raster file (with 2 vars, amp+phs)
    # float32 and zlib=True (with complevel=4 is auto) gives 26MB per file
    # this seems to have no performance on file I/O
    for var in ds.data_vars:
        ds[var].encoding['dtype'] = 'float32'
        ds[var].encoding['_FillValue'] = '-999'
        ds[var].encoding['zlib'] = True
    return ds

For era5 it would be something like this:

# optional encoding. int was 50MB, float32 is 99MB, float32 with zlib is 56MB
data_xr.msl.encoding['dtype'] = 'float32'
float32_fillvalue = netCDF4.default_fillvals['f4']
data_xr.msl.encoding['_FillValue'] = float32_fillvalue
drop_encoding_attrs = ["scale_factor", "add_offset", "missing_value"]
for key in drop_encoding_attrs:
    if key in data_xr.msl.encoding.keys():
        data_xr.msl.encoding.pop(key)
data_xr.msl.encoding['zlib'] = True # 50-60% of file size


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant