xarray writing mfdataset results in incorrect data when not using manual encoding #239

veenstrajelmer · 2023-02-15T11:14:34Z

xarray.to_netcdf() of opened mfdataset results in incorrect data when not using manual encoding

import os
import xarray as xr
import matplotlib.pyplot as plt
plt.close('all')

#open data
dir_data = r'p:\11207892-pez-metoceanmc\3D-DCSM-FM\workflow_manual\01_scripts\04_meteo\era5_temp'
file_nc = os.path.join(dir_data,'era5_mslp_*.nc')
data_xr = xr.open_mfdataset(file_nc)

#optional encoding
#data_xr.msl.encoding['dtype'] = 'float32' #TODO: updating dtype in encoding solves the issue. Source data is int, opened data is float, but encoding is still int.
#data_xr.msl.encoding['_FillValue'] = float(data_xr.msl.encoding['_FillValue'])
#data_xr.msl.encoding['missing_value'] = float(data_xr.msl.encoding['missing_value'])
#data_xr.msl.encoding['zlib'] = True #no effect
#data_xr.msl.encoding['scale_factor'] = 0.01
#data_xr.msl.encoding['add_offset'] = 0

#write to netcdf file
file_out = os.path.join('era5_mslp_out.nc')
data_xr.to_netcdf(file_out)

fig,(ax1,ax2) = plt.subplots(1,2,figsize=(11,5))
data_xr.msl.sel(time='2023-01-24 02:00:00').plot(ax=ax1,cmap='jet') #original dataset
with xr.open_dataset(file_out) as data_xr_check:
    data_xr_check.msl.sel(time='2023-01-24 02:00:00').plot(ax=ax2,cmap='jet') #written dataset
fig.tight_layout()

This results in incorrect data in the written file (right):

When updating the dtype (from int to float) in the variable encoding, this issues is solved:

The encoding in the source dataset:

data_xr.msl.encoding
Out[28]: 
{'source': 'p:\\11207892-pez-metoceanmc\\3D-DCSM-FM\\workflow_manual\\01_scripts\\04_meteo\\era5_temp\\era5_mslp_2022-11.nc',
 'original_shape': (720, 93, 121),
 'dtype': dtype('int16'),
 'missing_value': -32767,
 '_FillValue': -32767,
 'scale_factor': 0.11615998809759968,
 'add_offset': 99924.34817000595}

Possible issue: source data is integers, but opening files with different scaling_factors (from different files) converts it to floats (or maybe this always happens). The dtype in the encoding is still int, so this is how the netcdf is written, but probably something does not fit within the int-bounds.

The text was updated successfully, but these errors were encountered:

veenstrajelmer · 2023-03-07T17:09:52Z

This is solved with dfmt.prevent_dtype_int() which is used in dfmt.merge_meteofiles():

import dfm_tools as dfmt    
data_xr = dfmt.prevent_dtype_int(data_xr)

The new examplefile is preprocess_merge_meteofiles.py

veenstrajelmer · 2023-03-07T18:07:08Z

Solved with above, but another workaround is removing the scale_factor instead of the dtype. This keeps the file size small. However, there are slight offsets between the source and destination datasets, but since the value in the range of 0.1 was replaced by the default 1. The scale_factor probably depends per variable so this is not generic. Also, maybe move dfmt.prevent_dtype_int() to dfmt.preprocess_ERA5() since up to now it is specific to ERA5.

import os
import xarray as xr
import matplotlib.pyplot as plt
plt.close('all')
import numpy as np

#open data
dir_data = r'p:\11207892-pez-metoceanmc\3D-DCSM-FM\workflow_manual\01_scripts\04_meteo\era5_temp'
file_nc = os.path.join(dir_data,'era5_mslp_*.nc')
data_xr = xr.open_mfdataset(file_nc)

#optional encoding
#data_xr.msl.encoding.pop('dtype') #difference is 0
data_xr.msl.encoding.pop('scale_factor') #difference is 0.46-0.5
#data_xr.msl.encoding.pop('add_offset') #difference is 131072.5

#write to netcdf file
file_out = os.path.join('era5_mslp_out.nc')
data_xr.to_netcdf(file_out)

data_xr_check = xr.open_dataset(file_out)

absdiff = (data_xr_check - data_xr).apply(np.fabs)
absdiff_max = absdiff.msl.max(dim=['longitude','latitude'])
fig,ax = plt.subplots()
absdiff_max.plot()
fig.tight_layout()

pop dtype:

pop scale_factor:

veenstrajelmer · 2023-03-08T10:50:17Z

Alternatively, re-compute scaling/offset like suggested in ArcticSnow/TopoPyScale#60 (comment)

Implementation: https://github.com/ArcticSnow/TopoPyScale/blob/494f4e7ea17830ba3d23627bf22ee200a6c4f082/TopoPyScale/topo_export.py#L21

veenstrajelmer · 2023-03-08T12:18:41Z

Recompute issue created: #269

veenstrajelmer · 2023-03-10T16:37:35Z

Zipping might be easier and more generic, however, some encoding has to be altered when doing that (for each variable).

ds.msl.encoding.pop('dtype')
ds.msl.encoding.pop('scale_factor')
ds.msl.encoding.pop('add_offset')
ds.msl.encoding['zlib'] = True # icw dropping dtype/scale_factor/add_offset, results in approximately same filesize with float32 as int16

veenstrajelmer · 2023-09-18T18:27:54Z

GTSM fou ncfiles were easily compressed and without performance reduction (compare_components_fouhis.py).

def compress(ds):
    # float64 was 260MB per raster file (with 2 vars, amp+phs)
    # float32 and zlib=True (with complevel=4 is auto) gives 26MB per file
    # this seems to have no performance on file I/O
    for var in ds.data_vars:
        ds[var].encoding['dtype'] = 'float32'
        ds[var].encoding['_FillValue'] = '-999'
        ds[var].encoding['zlib'] = True
    return ds

For era5 it would be something like this:

# optional encoding. int was 50MB, float32 is 99MB, float32 with zlib is 56MB
data_xr.msl.encoding['dtype'] = 'float32'
float32_fillvalue = netCDF4.default_fillvals['f4']
data_xr.msl.encoding['_FillValue'] = float32_fillvalue
drop_encoding_attrs = ["scale_factor", "add_offset", "missing_value"]
for key in drop_encoding_attrs:
    if key in data_xr.msl.encoding.keys():
        data_xr.msl.encoding.pop(key)
data_xr.msl.encoding['zlib'] = True # 50-60% of file size

veenstrajelmer mentioned this issue Feb 16, 2023

Encoding error when saving netcdf pydata/xarray#7039

Open

4 tasks

veenstrajelmer linked a pull request Mar 7, 2023 that will close this issue

239 xarray writing mfdataset results in incorrect data when not using manual encoding #268

Merged

veenstrajelmer closed this as completed in #268 Mar 7, 2023

veenstrajelmer reopened this Mar 7, 2023

veenstrajelmer closed this as completed Mar 8, 2023

veenstrajelmer reopened this Mar 10, 2023

veenstrajelmer mentioned this issue Jul 18, 2023

Recompute scaling/offset int variable VU-IVM/gtsm3-era5-nrt#2

Open

veenstrajelmer mentioned this issue Oct 20, 2023

Prepare 0.16.0 release #600

Closed

13 tasks

veenstrajelmer mentioned this issue Nov 3, 2023

Prepare 0.17.0 release #640

Closed

9 tasks

veenstrajelmer mentioned this issue Nov 17, 2023

Prepare 0.18.0 release #667

Closed

13 tasks

veenstrajelmer mentioned this issue Dec 7, 2023

Prepare 0.19.0 release #699

Closed

19 tasks

veenstrajelmer linked a pull request Jan 23, 2024 that will close this issue

239 xarray writing mfdataset results in incorrect data when not using manual encoding #738

Merged

veenstrajelmer closed this as completed in #738 Jan 23, 2024

This was referenced Aug 9, 2024

Disruptions and update to new cdsapi #739

Closed

fix test_prevent_dtype_int #942

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xarray writing mfdataset results in incorrect data when not using manual encoding #239

xarray writing mfdataset results in incorrect data when not using manual encoding #239

veenstrajelmer commented Feb 15, 2023 •

edited

Loading

veenstrajelmer commented Mar 7, 2023 •

edited

Loading

veenstrajelmer commented Mar 7, 2023 •

edited

Loading

veenstrajelmer commented Mar 8, 2023

veenstrajelmer commented Mar 8, 2023

veenstrajelmer commented Mar 10, 2023 •

edited

Loading

veenstrajelmer commented Sep 18, 2023 •

edited

Loading

xarray writing mfdataset results in incorrect data when not using manual encoding #239

xarray writing mfdataset results in incorrect data when not using manual encoding #239

Comments

veenstrajelmer commented Feb 15, 2023 • edited Loading

veenstrajelmer commented Mar 7, 2023 • edited Loading

veenstrajelmer commented Mar 7, 2023 • edited Loading

veenstrajelmer commented Mar 8, 2023

veenstrajelmer commented Mar 8, 2023

veenstrajelmer commented Mar 10, 2023 • edited Loading

veenstrajelmer commented Sep 18, 2023 • edited Loading

veenstrajelmer commented Feb 15, 2023 •

edited

Loading

veenstrajelmer commented Mar 7, 2023 •

edited

Loading

veenstrajelmer commented Mar 7, 2023 •

edited

Loading

veenstrajelmer commented Mar 10, 2023 •

edited

Loading

veenstrajelmer commented Sep 18, 2023 •

edited

Loading