-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance downloading using byte-range requests #1848
Comments
According to ncdump -hs, the chunking for that variable is:
We need something to compare against. Is there any way |
To be clear, when I asked about chunks, what I really meant is: how many bytes are asked for in each request? It feels like the low performance is due to too many small requests rather than getting e.g. 1MB chunks. |
Is |
Also, just a note that on Unidata/netcdf4-python#1043 (comment) I was just opening the file as an xarray dataset (not downloading, at least not intentionally anyway). That should just read some metadata and the coordinates, but not the data variables. But it was still very slow. |
The file is a netcdf-4 file apparently, so the decision about how much to read |
@rsignell-usgs So when I tested, I tried just opening with netCDF4-python, and didn't have the horrendous slow-down. The quickest way to get it was for me to try to get data, which was unreasonably slow. The original xarray issue might instead be downloading coordinates or something. Either way, it'd be nice to find a way to address a 40x slowdown. |
To be clear, I am not surprised by these numbers. We know that, inherently, access using |
One important point, maybe. |
So I went ahead an ncdump-ed the
and can confirm my suspicions that getting the coordinates (specifically time) was the xarray problem. Getting the data values for
Doing the same for |
@DennisHeimbigner I expected it to be worse, but it seems like HDF5 is doing the naive approach and reading each chunk as an individual request. You could greatly speed this up by reading 512kB or 1M at a time. |
Possibly relevant: time is an UNLIMITED dimension, while latitude and longitude are not. |
I think |
Not sure I follow. The chunks are not necessarily contiguous on disk |
5MB is the typical default, but this is configurable or can be turned off (no caching beyond the specific read). |
Martin- does reading large amounts help or hurt when the file is being |
Definitely relevant, since that's why |
That depends on the connection establishment time (slow for SSL) versus the connection throughput. 5MB was specifically chosen to be "worthwhile", that connection overhead becomes relatively small compared to the total time; of course, if most of the bytes are useless, you would be better off only getting those you need. Note: fsspec allows concurrent downloads, to memory or disc, of multiple URLs, but this is not yet implemented for bytes ranges or random-access. For the latter it probably never will be, because it is unclear how to handle the file instance's internal state. |
There are at least two experiments that need to run:
I can undertake #2, although it is possible that it would be quicker to test |
Some caching reference for fsspec: https://filesystem-spec.readthedocs.io/en/latest/api.html#read-buffering (and https://filesystem-spec.readthedocs.io/en/latest/api.html#fsspec.core.BaseCache , the no-cache option). The type is chosen by passing |
So some relevant experiments from Python land. Running this code: %%time
fobj = fsspec.open('https://coawst-public.s3-us-west-2.amazonaws.com/testing/HadCRUT.4.6.0.0.median.nc', cache_type='none')
with fobj as f:
nc = h5netcdf.File(f, mode='r')
a = nc.variables['temperature_anomaly'][:] which uses the
'readahead':
'bytes':
@martindurant I'm not sure if you'll find it surprising (but I did) that |
That does sound wrong, worth investigating. |
Interesting; so I would say that readahead being signficantly faster |
Well, there's more locality this file, which only has a couple of unlimited variables. I'd be the locality is much worse in a file with many unlimited--but then again it depends on your access pattern. What I will say is that I can get much better performance just by even using block sizes of 128 kB or 512 kB. Given that those block sizes are cheap to access even on a mediocre cell phone connection, it'd be worth making a moderate blocksize the default if there's some way to do that. @martindurant Would you like me to open an issue over at intake/filesystem_spec or are you already planning on taking care of that? |
You are welcome to propose that, but there are counterarguments: actually loading longer blocks of data is the norm for most formats (indeed, I assume this is true for HDF-stored data chunks too). Pushing the connection overhead fraction down is important! Also note that (obviously) most big-data, high-performance work happens in the cloud, where indeed latency is better, but bandwidth is much better. Would it be reasonable to have different file objects for metadata and data, with different caching and block-sizes? zarr allows this, for instance. |
Fix for blockcache: fsspec/filesystem_spec#420 |
Apologies @martindurant , I was unclear (prose isn't working so well for me today...then again neither is code). The only issue I was going to open was regarding blockcache, but I see you've got that well in hand. My proposal was more a minimum improvement to make to (hopefully) the netcdf-c library. |
Understood, and sorted :) |
re: Issue Unidata#1848 The existing Virtual File Driver built to support byte-range read-only file access is quite old. It turns out to be extremely slow (reason unknown at the moment). Starting with HDF5 1.10.6, the HDF5 library has its own version of such a file driver. The HDF5 developers have better knowledge about building such a driver and what incantations are needed to get good performance. This PR modifies the byte-range code in hdf5open.c so that if the HDF5 file driver is available, then it is used in preference to the one written by the Netcdf group. Misc. Other Changes: 1. Moved all of nc4print code to ncdump to keep appveyor quiet.
I have checked in a PR (#1849) that |
So if I try to get the data for a variable using byte-range requests, it performs pretty terribly:
During this transfer, my system showed it was sustaining about 180 kB/s. If I instead just download the whole file:
I get that there's some overhead with byte range requests, but a difference of 40x is so slow as to make the byte-range support useless.
Do we know what chunk size is being requested?
See Unidata/netcdf4-python#1043 for the original issue that provoked this.
The text was updated successfully, but these errors were encountered: