Low performance on reading netcdf file with many groups #9126

mraspaud · 2024-06-14T15:41:04Z

What is your issue?

Really excited about the ongoing integration of xarray-datatree, I started testing a few things, among other things reading some of the files of upcoming satellite missions we are working with for real-time processing.
Reading these files being painfully slow, I started investigating and ended up in xarray looking at how CachingFileManager was being used. I have an idea on how to speed this up that I would like to discuss with someone who knows more about xarray than I do.
But first, a minimal example:

import xarray as xr
import numpy as np

filename = "groups.nc"

# read nc file

for group_num in range(75):
    res = xr.open_dataset(filename, group=f"group{group_num}", chunks="auto")

print("done reading")

Creating the file

import xarray as xr
import numpy as np

filename = "groups.nc"

# create the nc file
for group_num in range(75):

    ds = xr.Dataset(
        {"foo": (("x", "y"), np.random.rand(150, 5000)),
         "bar": (("x", "y"), np.random.rand(150, 5000))
        },
        coords={
            "x": range(150),
            "y": range(5000),
        },
    )

    mode = "w" if group_num == 0 else "a"
    ds.to_netcdf(filename, group=f"group{group_num}", mode=mode)

The reading in that example here takes about 15 seconds on my laptop. Note that I'm not actually reading any of the array data, as we chunk create lazy arrays (dask) thanks to the chunking.

In my debugging this, I realized that the file was opened 75 times, so once for each group. That seems to come from the CachingFileManager being recreated for each group, which leads in turn to reopening the file for reading each group.

I made a quick fix for creating the CachingFileManager only once per file (so only once in this example) and not only did the reading still works, but the reading takes now only 7 seconds, so a performance boost of more than 50 percent!

So the question is, do you think this is a viable modification to the behaviour for the CachingFileManager, or am I missing something?

If you think this is a valid approach I can try to make a PR about it.

mraspaud · 2024-06-14T15:44:10Z

Just saw #8994 , sorry for the noise

mraspaud added the needs triage Issue that has not been reviewed by xarray team member label Jun 14, 2024

mraspaud closed this as completed Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low performance on reading netcdf file with many groups #9126

Low performance on reading netcdf file with many groups #9126

mraspaud commented Jun 14, 2024

mraspaud commented Jun 14, 2024

Low performance on reading netcdf file with many groups #9126

Low performance on reading netcdf file with many groups #9126

Comments

mraspaud commented Jun 14, 2024

What is your issue?

mraspaud commented Jun 14, 2024