-
-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk write speed using zarr and GCS #619
Comments
@rabernat I noticed that you have a PR (#252) which implements a version of GCSStore using google storage api! After noticing the performance difference, I took the important bits from that PR to write this GCSStore. class GCSStore(MutableMapping):
def __init__(self, bucket, prefix):
client = storage.Client(project=project(), credentials=credentials())
self.bucket = storage.Bucket(client, bucket)
if not prefix.endswith("/"):
prefix = prefix + "/"
self.prefix = prefix
def _full_name(self, key):
return f"{self.prefix}{key}"
def __contains__(self, key):
name = self._full_name(key)
return self.bucket.get_blob(name) is not None
def __iter__(self):
blobs = self.bucket.list_blobs(prefix=self.prefix)
for blob in blobs:
yield blob.name[len(self.prefix) :]
def __setitem__(self, key, value):
name = self._full_name(key)
blob = self.bucket.blob(name, chunk_size=hsize("1gb"))
blob.upload_from_string(value, content_type="application/octet-stream")
def __len__(self):
iterator = self.bucket.list_blobs(prefix=self.prefix)
return sum(1 for _ in iterator)
def __getitem__(self, key):
name = self._full_name(key)
blob = self.bucket.get_blob(name)
if not blob:
raise KeyError("Blob {name} not found")
return blob.download_as_bytes()
def __delitem__(self, key):
name = self._full_name(key)
self.bucket.delete_blob(name) Running the code with this change: store = GCSStore(str(_bucketname), "zarr-test/data.zarr")
root = zarr.group(store=cache) I get the following timings:
Since this is a significant speed up over gcsfs (which is taking ~21secs), I am going to continue using this in my project. (Note that I am passing 1G as the chunk size in my blob to get the best throughput for my data) @martindurant I spent a significant part of the afternoon to go over the code in gcfs (and fsspec), and I think some of the abstractions there are fantastic. I think the majority of the time spent is in this call (note sure if there are knobs to make this faster)
I don't think this call is affected by |
Whilst it can be good to have a specific and simple implementation that does the job well, I would prefer if we can backport the performance improvement to gcsfs, and I hope you can help with this, @skgbanga . Two points to note: zarr writes to the backend via setitem, which currently calls open/write, but ought to call pipe as follows: --- a/fsspec/mapping.py
+++ b/fsspec/mapping.py
@@ -148,8 +148,7 @@ class FSMap(MutableMapping):
"""Store value in key"""
key = self._key_to_str(key)
self.fs.mkdirs(self.fs._parent(key), exist_ok=True)
- with self.fs.open(key, "wb") as f:
- f.write(value)
+ self.fs.pipe_file(key, value)
def __iter__(self):
return (self._str_to_key(x) for x in self.fs.find(self.root)) Secondly, see #606 for application of concurrent/async methods for zarr - but this is only read-only. The same thing should/will be done for writing too, and would help you. Finally, it is a shame that GCS does not support chunkwise uploads - each new chunk can only start when the previous one is finished. |
btw: the first thing |
@martindurant Thanks for the reply. I will take a more detailed look, but when I was stepping through all the code, on One of the first thing I wanted to do was make fsspec default size as 100MB. With the default size, I was getting numbers ~14secs with GCSStore. |
This suggests that it is not memory copies that are at fault, at least not within the gcsfs code. |
Yeah, I was pretty sure it was not memory copies, that's why I mentioned this:
I have a naive question though. Is there any reason you are not using google cloud |
Originally, it was because the google code was so convoluted, a situation which has improved somewhat since. We find very valuable some features that can only be achieved with a separate implementation:
In addition, gcloud's dependencies are hard to manage, and we want as little direct hooks into their auth as possible. I cannot immediately find any references to asyncio having performance issues for streaming an upload. |
I will do some more testing over the weekend. |
So I added time calls around this function: import time
now = time.time()
async with self.session.request(
method=method,
url=path,
params=kwargs,
json=jsonin,
headers=headers,
data=datain,
timeout=self.requests_timeout,
) as r:
spent = time.time() - now
print("Time spent in request call is", spent) and can say that out of ~21 seconds, more than 18s are spent in this function. Unfortunately this is the first time I am looking at async stuff, so I don't think I will be able to help further :( I think it might just be down to how the two methods are doing their network calls. For google storage API, I noticed this via strace:
For gcsfs, I noticed this: (doing strace is slighly insidious on this method because it uses threading)
No clue why asyncio http request method is calling epoll_wait every time it sends some data. |
I don't think there's a way fo me to delve into the innards of aiohttp to change the network layer :| I may have a closer look at the gcloud code - I'm pretty surprised if it's using a nonstandard transport. I do notice that the gcsfs writes are apparently larger, so maybe wait happens in the gcloud stack too, but after many more writes - and the relative sizes of the network buffers might be important here. Did you try after replacement of the |
hey @martindurant, sorry for the late reply.
I did notice that. Those writes are even more than 65K TCP size window, so there is definitely a heavy flow control going on. (I think I will do a tcp capture to verify that). I will try to do that over the weekend. |
Can we consider this issue resolved? It has become quite broad... |
Please consider this closed from my end. @martindurant Unfortunately, I haven't been able to work on your stuff. |
This question is similar to #595, but instead of streaming, it concerns with the zarr array write using GCS as a store backend.
Consider the following code:
I am creating a 1G numpy array and assigning that to a zarr array (which has the same shape and only 1 chunk of the same size).
When I run the above code, I get the following numbers:
Compare this time with manually uploading the zarr file (created with store as DirectoryStore) directly:
This is surprising.
To make the comparison more 'fair', I used google's python api to upload this chunk of data using following code:
Running the above gives the following result:
To summarize:
@rabernat Did you ever observe something similar?
The text was updated successfully, but these errors were encountered: