Read/Write performance optimizations for netcdf files #6309

hmaarrfk · 2022-02-26T17:40:40Z

What happened?

I'm not too sure this is a bug report, but I figured I would share some of the investigation I've done on the topic of writing large datasets to netcdf.

For clarity, the usecase I'm considering is writing large in-memory array to persistant storage on Linux.

Array size 4-100GB
File Format Netcdf
Dask: No.
Hardware: Some very modern SSD that can write more than 1GB/s (Sabrent Rocket 4 Plus for example).
Operating System: Linux
Xarray version: 0.21.1 (or around there)

The symptoms are two fold:

The write speed is slow. About 1GB/s, much less than the 2-3 GB/s you can get with other means.
The Linux disk cache just keeps filling up.

Its quite hard to get good performance from systems, so I"m going to put a few more constraints on the type of data we are are writing:

The underlying numpy array must be alight to the linux Page boundary of 4096 bytes.
The underlying numpy array must have been pre-faulted and not swapped. (Do not use np.zeros, it doesn't fault the memory)

I feel like these two options are rather easy to get to as I'll show in my example.

What did you expect to happen?

I want to be able to write at 3.2GB/s with my shiny new SSD.

I want to leave my RAM unused when I'm archiving to disk.

Minimal Complete Verifiable Example

import numpy as np
import xarray as xr

def empty_aligned(shape, dtype=np.float64, align=4096):
    if not isinstance(shape, tuple):
        shape = (shape,)

    dtype = np.dtype(dtype)
    size = dtype.itemsize
    # Compute the final size of the array
    for s in shape:
        size *= s

    a = np.empty(size + (align - 1), dtype=np.uint8)
    data_align = a.ctypes.data % align
    offset = 0 if data_align == 0 else (align - data_align)
    arr = a[offset:offset + size].view(dtype)
    # Don't use reshape since reshape might copy the data.
    # This is the suggested way to assign a new shape with guarantee
    # That the data won't be copied.
    arr.shape = shape
    return arr

dataset = xr.DataArray(
    empty_aligned((4, 1024, 1024, 1024), dtype='uint8'),
    name='mydata').to_dataset()
# Fault and write data to this dataset
dataset['mydata'].data[...] = 1

%time dataset.to_netcdf("test", engine='h5netcdf')
%time dataset.to_netcdf("test", engine='netcdf4')

Relevant log output

Both output about 3.5s equivalent to just about 1GB/s.

To get to about 3 ish GB/s (taking about 1.27s to write a 4GB array). One needs to do a few things:

You must align the underlying data to disk.
- h5netcdf (h5py) backend Support setting the data alignment within the HDF5 file h5py/h5py#2040
- netcdf4: Add support for setting HDF5 alignment property when creating a file Unidata/netcdf-c#2206
You must use a driver that bypasses the operating system cache
- Expose DIRECT VFD to Linux Users that have the required support h5py/h5py#2041

For the h5netcdf backend you would have to add the following kwargs to h5netcdf constructor

        kwargs = {
            "invalid_netcdf": invalid_netcdf,
            "phony_dims": phony_dims,
            "decode_vlen_strings": decode_vlen_strings,
            'alignment_threshold': alignment_threshold,
            'alignment_interval': alignment_interval,
        }

Anything else we need to know?

The main challenge is that while writing aligned data this way is REALLY fast, writing small chunks and unaligned data becomes REALLY slow.

Personally, I think that someone might be able to write a new HDF5 driver that does better optimization, I feel like this can help people loading large datasets which seems to be a large part of the community of xarray users.

Environment

INSTALLED VERSIONS
------------------

commit: None
python: 3.9.9 (main, Dec 29 2021, 07:47:36) 
[GCC 9.4.0]
python-bits: 64
OS: Linux
OS-release: 5.13.0-30-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 0.21.1
pandas: 1.4.0
numpy: 1.22.2
scipy: 1.8.0
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.13.1
h5py: 3.6.0.post1
Nio: None
zarr: None
cftime: 1.5.2
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2022.01.1
distributed: None
matplotlib: 3.5.1
cartopy: None
seaborn: None
numbagg: None
fsspec: 2022.01.0
cupy: None
pint: None
sparse: None
setuptools: 60.8.1
pip: 22.0.3
conda: None
pytest: None
IPython: 8.0.1
sphinx: None

h5py includes some additions of mine that allow you to use the DIRECT driver and I am using a version of HDF5 that is built with the DIRECT driver.

The text was updated successfully, but these errors were encountered:

hmaarrfk · 2022-02-26T17:57:33Z

I have to elaborate that this may be even more important for users that READ the data back alot. Reading with the standard Xarray operands hits other limits, but one limit that it definitely hits is that of the HDF5 driver used.

max-sixty · 2022-04-09T15:46:47Z

Thanks @hmaarrfk .

Something I didn't quite understand: what do the values in this map refer to:

        kwargs = {
            "invalid_netcdf": invalid_netcdf,
            "phony_dims": phony_dims,
            "decode_vlen_strings": decode_vlen_strings,
            'alignment_threshold': alignment_threshold,
            'alignment_interval': alignment_interval,
        }

To what extent should this be done in xarray vs the underlying library? I wasn't sure how much context the underlying library needs to execute this well.

dcherian · 2022-04-09T16:40:45Z

It would be nice to add this to our documentation.

hmaarrfk · 2022-04-09T17:06:13Z

@max-sixty unfortunately, I think the way hdf5 is designed, it doesn't try to be too smart about what would be the best fine tuning for your particular system. In some ways, this is the correct approach.

The current constructor pathway:
https://github.com/pydata/xarray/blob/main/xarray/backends/h5netcdf_.py#L164

Doesn't provide a user with a catchall-kwargs. I think this would be an acceptable solution.

I should say that the the performance of the direct driver is terrible without aligned data:
Unidata/netcdf-c#2206 (comment)

max-sixty · 2022-04-09T22:16:15Z

Great, thanks @hmaarrfk — we'd definitely take a PR for the docs or something to pass along kwargs.

hmaarrfk added bug needs triage Issue that has not been reviewed by xarray team member labels Feb 26, 2022

hmaarrfk changed the title ~~Write performance optimizations for netcdf files~~ Read/Write performance optimizations for netcdf files Feb 26, 2022

hmaarrfk mentioned this issue Feb 28, 2022

Add support for setting HDF5 alignment property when creating a file Unidata/netcdf-c#2206

Merged

dcherian added bug topic-backends needs triage Issue that has not been reviewed by xarray team member topic-performance and removed needs triage Issue that has not been reviewed by xarray team member bug labels Mar 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read/Write performance optimizations for netcdf files #6309

Read/Write performance optimizations for netcdf files #6309

hmaarrfk commented Feb 26, 2022 •

edited

Loading

hmaarrfk commented Feb 26, 2022

max-sixty commented Apr 9, 2022

dcherian commented Apr 9, 2022

hmaarrfk commented Apr 9, 2022

max-sixty commented Apr 9, 2022

Read/Write performance optimizations for netcdf files #6309

Read/Write performance optimizations for netcdf files #6309

Comments

hmaarrfk commented Feb 26, 2022 • edited Loading

What happened?

What did you expect to happen?

Minimal Complete Verifiable Example

Relevant log output

Anything else we need to know?

Environment

hmaarrfk commented Feb 26, 2022

max-sixty commented Apr 9, 2022

dcherian commented Apr 9, 2022

hmaarrfk commented Apr 9, 2022

max-sixty commented Apr 9, 2022

hmaarrfk commented Feb 26, 2022 •

edited

Loading