You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Really excited about the ongoing integration of xarray-datatree, I started testing a few things, among other things reading some of the files of upcoming satellite missions we are working with for real-time processing.
Reading these files being painfully slow, I started investigating and ended up in xarray looking at how CachingFileManager was being used. I have an idea on how to speed this up that I would like to discuss with someone who knows more about xarray than I do.
But first, a minimal example:
The reading in that example here takes about 15 seconds on my laptop. Note that I'm not actually reading any of the array data, as we chunk create lazy arrays (dask) thanks to the chunking.
In my debugging this, I realized that the file was opened 75 times, so once for each group. That seems to come from the CachingFileManager being recreated for each group, which leads in turn to reopening the file for reading each group.
I made a quick fix for creating the CachingFileManager only once per file (so only once in this example) and not only did the reading still works, but the reading takes now only 7 seconds, so a performance boost of more than 50 percent!
So the question is, do you think this is a viable modification to the behaviour for the CachingFileManager, or am I missing something?
If you think this is a valid approach I can try to make a PR about it.
The text was updated successfully, but these errors were encountered:
What is your issue?
Really excited about the ongoing integration of xarray-datatree, I started testing a few things, among other things reading some of the files of upcoming satellite missions we are working with for real-time processing.
Reading these files being painfully slow, I started investigating and ended up in xarray looking at how CachingFileManager was being used. I have an idea on how to speed this up that I would like to discuss with someone who knows more about xarray than I do.
But first, a minimal example:
Creating the file
The reading in that example here takes about 15 seconds on my laptop. Note that I'm not actually reading any of the array data, as we chunk create lazy arrays (dask) thanks to the chunking.
In my debugging this, I realized that the file was opened 75 times, so once for each group. That seems to come from the CachingFileManager being recreated for each group, which leads in turn to reopening the file for reading each group.
I made a quick fix for creating the CachingFileManager only once per file (so only once in this example) and not only did the reading still works, but the reading takes now only 7 seconds, so a performance boost of more than 50 percent!
So the question is, do you think this is a viable modification to the behaviour for the CachingFileManager, or am I missing something?
If you think this is a valid approach I can try to make a PR about it.
The text was updated successfully, but these errors were encountered: