Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low performance on reading netcdf file with many groups #9126

Closed
mraspaud opened this issue Jun 14, 2024 · 1 comment
Closed

Low performance on reading netcdf file with many groups #9126

mraspaud opened this issue Jun 14, 2024 · 1 comment
Labels
needs triage Issue that has not been reviewed by xarray team member

Comments

@mraspaud
Copy link
Contributor

What is your issue?

Really excited about the ongoing integration of xarray-datatree, I started testing a few things, among other things reading some of the files of upcoming satellite missions we are working with for real-time processing.
Reading these files being painfully slow, I started investigating and ended up in xarray looking at how CachingFileManager was being used. I have an idea on how to speed this up that I would like to discuss with someone who knows more about xarray than I do.
But first, a minimal example:

import xarray as xr
import numpy as np

filename = "groups.nc"

# read nc file

for group_num in range(75):
    res = xr.open_dataset(filename, group=f"group{group_num}", chunks="auto")

print("done reading")
Creating the file
import xarray as xr
import numpy as np

filename = "groups.nc"

# create the nc file
for group_num in range(75):

    ds = xr.Dataset(
        {"foo": (("x", "y"), np.random.rand(150, 5000)),
         "bar": (("x", "y"), np.random.rand(150, 5000))
        },
        coords={
            "x": range(150),
            "y": range(5000),
        },
    )

    mode = "w" if group_num == 0 else "a"
    ds.to_netcdf(filename, group=f"group{group_num}", mode=mode)

The reading in that example here takes about 15 seconds on my laptop. Note that I'm not actually reading any of the array data, as we chunk create lazy arrays (dask) thanks to the chunking.

In my debugging this, I realized that the file was opened 75 times, so once for each group. That seems to come from the CachingFileManager being recreated for each group, which leads in turn to reopening the file for reading each group.

I made a quick fix for creating the CachingFileManager only once per file (so only once in this example) and not only did the reading still works, but the reading takes now only 7 seconds, so a performance boost of more than 50 percent!

So the question is, do you think this is a viable modification to the behaviour for the CachingFileManager, or am I missing something?

If you think this is a valid approach I can try to make a PR about it.

@mraspaud mraspaud added the needs triage Issue that has not been reviewed by xarray team member label Jun 14, 2024
@mraspaud
Copy link
Contributor Author

Just saw #8994 , sorry for the noise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs triage Issue that has not been reviewed by xarray team member
Projects
None yet
Development

No branches or pull requests

1 participant