-
-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: google cloud storage class #252
Conversation
Have you tried using this in your current setup already? Does it yield the speedup that you would expect? |
What is the recommended way to test the code locally? Is there a dummy server one can use or do we have to mock the google.cloud.storage Python functions? |
Thanks Ryan, personally I'm very happy to entertain this. We're actually giving some serious thought to setting up a pangeo-like environment in the cloud to support our genomics work, so having an efficient interface to cloud object storage will become important for us I expect. There is also precedent for moving storage classes into zarr to allow for optimisation and increased convenience. E.g., both the DirectoryStore and ZipStore classes replicate a lot of what was already provided in the zict package, but I ended up re-implementing because there were optimisations I wanted to add and it was easier to do that within zarr. I just took a browse through the gcsfs code and what you have implemented is a lot leaner, which is very nice. Out of interest, do you see a performance gain over using gcsfs, and do you know what is giving the gain? $64,000 question, if we did merge this, would you be willing to join as a core developer, with no commitment other than to maintain the gcs-related code? |
So far, in my simple benchmarks, the performance is exactly the same as gcsfs. However, my hope is that it will be easier to debug when problems occur (see e.g. pangeo-data/pangeo#166, fsspec/gcsfs#91, fsspec/gcsfs#90 fsspec/gcsfs#89, fsspec/gcsfs#61). Also, I expect that having tighter coupling between zarr and gcs will allow us to optimize performance more going forward.
I don't know about this. Finding a way to mock the google.cloud.storage would be useful. However, running the test suite on the real google service also has major advantages.
Sure. I would like to do a lot more testing before moving forward. In particular, I want to understand how this behaves in the context of distributed. Serializing this store is a bit tricky. This is pretty mature in gcsfs. |
On Fri, 30 Mar 2018, 18:43 Ryan Abernathey, ***@***.***> wrote:
Have you tried using this in your current setup already? Does it yield the
speedup that you would expect?
So far, in my simple benchmarks, the performance is exactly the same as
gcsfs. However, my hope is that it will be easier to debug when problems
occur (see e.g. pangeo-data/pangeo#166
<pangeo-data/pangeo#166>, fsspec/gcsfs#91
<fsspec/gcsfs#91>, fsspec/gcsfs#90
<fsspec/gcsfs#90> fsspec/gcsfs#89
<fsspec/gcsfs#89>, fsspec/gcsfs#61
<fsspec/gcsfs#61>). Also, I expect that having
tighter coupling between zarr and gcs will allow us to optimize performance
more going forward.
Scanning those issues it looks like there's a lot still to be understood
around authn/authz and performance. If having GCS support in Zarr helps
unpick those issues then I'm all for it, even if it's only a temporary
thing and the resolutions ultimately get implemented back in gcsfs.
What is the recommended way to test the code locally? Is there a dummy
server one can use or do we have to mock the google.cloud.storage Python
functions?
I don't know about this. Finding a way to mock the google.cloud.storage
would be useful. However, running the test suite on the *real* google
service also has major advantages.
I guess CI services would need to run tests against a mocked
google.cloud.storage, I don't know of any way of testing against the real
service from within CI that wouldn't be costing someone money, unless there
is a free CI service that runs inside Google cloud. But it would be
essential to verify somehow that the full test suite runs against the real
service.
$64,000 question, if we did merge this, would you be willing to join as a
core developer, with no commitment other than to maintain the gcs-related
code?
Sure.
I would like to do a lot more testing before moving forward. In
particular, I want to understand how this behaves in the context of
distributed. Serializing this store is a bit tricky. This is pretty mature
in gcsfs.
Sounds good.
|
It seems like this would a boon for Google as they would be making it easier for developers to build up an ecosystem around their service. Thus bringing in more customers reliant on that ecosystem. Makes me wonder if it really does exist and it will just take a little searching to find it. Even if they don't advertise it, we can probably ask and they may give us some free instance for testing. |
Reading
https://cloudplatform.googleblog.com/2018/03/optimizing-your-Cloud-Storage-performance-Google-Cloud-Performance-Atlas.html?m=1it
says that uploads are load balanced via the path of the object. If objects
have similar paths (as would chunks within a Zarr array) then load
balancing is not good. In this case faster parallel uploads can be achieved
by modifying paths, e.g., prepending a hash. I've heard this is the same
for S3 also.
Might be worth trying this, i.e., modifying getitem and setitem so they
transform keys by prepending hash before interacting with GCS.
There would be some consequences to prepending a hash in terms of other
operations less convenient or efficient, but maybe worth exploring.
…On Sat, 31 Mar 2018, 00:29 jakirkham, ***@***.***> wrote:
...unless there is a free CI service that runs inside Google cloud
It seems like this would a boon for Google as they would be making it
easier for developers to build up an ecosystem around their service. Thus
bringing in more customers reliant on that ecosystem. Makes me wonder if it
really does exist and it will just take a little searching to find it. Even
if they don't advertise it, we can probably ask and they may give us some
free instance for testing.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#252 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qn4UiEo6GpanK2NTjR1fO1cADDnfks5tjr_wgaJpZM4TA8T4>
.
|
I'll be happy to help here if there's anything I can do. |
Thanks Martin.
…On Sat, 7 Apr 2018, 16:32 Martin Durant, ***@***.***> wrote:
I'll be happy to help here if there's anything I can do.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#252 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QpxFmbL-_75sw-GnkRuj9G-UXzJtks5tmNwSgaJpZM4TA8T4>
.
|
I tried installing with this commit using:
which worked, and installed as:
However when I tried running ds.to_zarr(), I got the following error:
Any ideas on how to fix this? Thanks! |
Probably need to merge with latest |
Okay thanks. I will go ahead and fork from @rabernat and merge master to test this. |
Actually the right way to do this was to fork upstream and then add @rabernat's repo as a remote and merge his gcs_store branch. Git wranglin! |
now when i try ds.to_zarr('pangeo-asdf2'), I get:
|
@tjcrone: your error is unrelated to this PR. When you call In order to use this store, you need to create a gcsstore = zarr.storage.GCSStore('zarr-test', 'zarr-gcs-store', client_kwargs={'project': 'pangeo-181919'})
ds.to_zarr(gcsstore) Note that the branch is highly experimental, work in progress, and consequently highly error prone. |
Thanks @rabernat. I'll make sure to test this all locally inside the notebook server for now and not involve workers. When trying your suggestion,
(and changing out for my project id), I got the following error:
I feel like the project was passed. |
zarr/storage.py
Outdated
|
||
self.bucket_name = bucket_name | ||
self.prefix = normalize_storage_path(prefix) | ||
self.client_kwargs = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tjcrone I guess because client_kwargs is not actually initialized properly!
Thanks @rabernat. This fix worked! Nice. The only comment I have at this stage is that when a group within the bucket already exists we get:
which might be better if it was more explanatory and also better if it pulled the path into the error. |
If path is an empty string that means the root path within the store. That
isn't obvious and could be improved.
Zarr errs on the side of caution and will not overwrite an existing group
or array by default. You can force it to overwrite by passing
overwrite=True to most creation functions.
…On Fri, 13 Apr 2018, 22:39 Tim Crone, ***@***.***> wrote:
Thanks @rabernat <https://github.com/rabernat>. This fix worked! Nice.
The only comment I have at this stage is that when a group within the
bucket already exists we get:
ValueError: path '' contains a group
which might be better if it was more explanatory and also better if it
pulled the path into the error.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#252 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QqzGXSrbWxaNO3tiEnd7xqCGD61Uks5toRsFgaJpZM4TA8T4>
.
|
I think it is worth noting that for this particular extension, when the path is not empty, it shows up as empty in the error:
It is also worth noting that an indication that the path "already" contains a group would go a long way to making this error more helpful. Or saying that the group "already exists" also works. |
On Sat, 14 Apr 2018, 01:19 Tim Crone, ***@***.***> wrote:
I think it is worth noting that for this particular extension, when the
path is not empty, it shows up as empty in the error:
ValueError: path '' contains a group
Ah OK. Would you mind posting the code that generates this error, from the
point where you create the GCSStore.
It is also worth noting that an indication that the path "already"
contains a group would go a long way to making this error more helpful. Or
saying that the group "already exists" also works.
Yes that would be better.
—
… You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#252 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QnRJsNJuYPlrx-PO5BkyitO3Ym9hks5toUCogaJpZM4TA8T4>
.
|
Sorry for not including an example. This code works fine when it runs the first time, as long as the rte-pangeo-data bucket exists (it will not create a bucket):
But when run a second time it gives:
|
In case anyone wants to test this with multiple Dask workers, it is possible to create an Oauth credentials object using the following:
and then passing this credentials object when creating the gcsstore:
For this to work as written it would be necessary to authenticate using the Google Cloud SDK:
So far in all of my testing this code is working great! |
Can credentials be pickled?
…Sent from my iPhone
On Apr 14, 2018, at 10:53 AM, Tim Crone ***@***.***> wrote:
In case anyone wants to test this with multiple Dask workers, it is possible to create an Oauth credentials object using the following:
import google.auth
credentials, project = google.auth.default()
and then passing this credentials object when creating the gcsstore:
gcsstore = zarr.storage.GCSStore('rte-pangeo-data', 'test1', client_kwargs={'project': 'pangeo-198314', 'credentials': credentials})
For this to work as written it would be necessary to authenticate using the Google Cloud SDK:
gcloud auth application-default login --no-launch-browser
So far in all of my testing this code is working great!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Interesting question. How would I test this? Note that this credentials object is secret, and allows a lot of access to my own GCP resources. I'm not sure I would want to pickle and distribute. There are ways of creating credentials with reduced permissions: https://google-auth.readthedocs.io/en/latest/user-guide.html. |
FWIW opened issue ( https://github.com/zarr-developers/zarr/issues/290 ) to discuss cloud support generally. Please feel free to share anything relevant there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of comments from experiences trying this out and looking at performance for retrieving small objects.
`default credentials <https://cloud.google.com/sdk/gcloud/reference/auth/application-default/login>`_. | ||
""" | ||
|
||
def __init__(self, bucket_name, prefix=None, client_kwargs={}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth adding an option to use anonymous client. E.g., add an anonymous=False
keyword argument, then make use of storage.Client.create_anonymous_client()
when it comes to creating the client if user has provided anonymous=True
.
from google.cloud import storage | ||
# run `gcloud auth application-default login` from shell | ||
client = storage.Client(**self.client_kwargs) | ||
self.bucket = client.get_bucket(self.bucket_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it's also possible to do:
self.bucket = storage.Bucket(client, name=self.bucket_name)
...which involves no network communication. Not sure this is a good idea in general as may want to retrieve the bucket info, but just mentioning.
|
||
def __getitem__(self, key): | ||
blob_name = self.full_path(key) | ||
blob = self.bucket.get_blob(blob_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative here is to do:
from google.cloud import storage
blob = storage.Blob(blob_name, self.bucket)
...which involves less network communication (profiling shows number of calls to method 'read' of '_ssl._SSLSocket' objects
goes from 3 down to 1) and reduces the time to retrieve small objects by around 50%.
If this change was made, some rethinking of error handling may be needed, as the point at which a non-existing blob was detected might change.
@rabernat just to say that, FWIW, I think this is worth pursuing. I know @martindurant has just added in some improvements to gcsfs to reduce latency and use checksums to verify content, which is great, and so both the performance and error reporting issues that have come up with gcsfs may be resolved. But while we are still gaining experience with GCS, also having a mapping implementation based on the google cloud client library I think would be valuable, so we can compare performance and see if issues replicate across both this and gcsfs or not. Obviously contingent on you (or someone else) having time and inclination, but I'm currently also starting to use google cloud storage and so would be happy to chip in. |
Great! I agree it is a good way forward. Moving forward, I think a good question (related to #290) is whether we want to have a generic base class for object stores and then extend that for GCS, S3, ABS, etc. Maybe this is overcomplicating things though... The reality is that I am teaching two classes this semester and am unlikely to have the time to dig into this deeply any time soon. You, @martindurant, @tjcrone, and @jakirkham are all clearly qualified to pick up where I left off here. |
Having a mapping class over generic file-system implementations was one of the points of fsspec, which will, of course, look rather familiar. |
Thanks @martindurant, FWIW I absolutely see the virtues of the fsspec
approach.
For where we are right now, from a purely pragmatic point of view, think it
would be useful to have both gcsfs and the mapping implementation in this
PR. We can compare performance, see if issues replicate with both, and
compare how much code is needed and how it's organised. Also we have some
very helpful contacts at Google genomics who are standing by to help with
any issues that come up as we start to test at scale, so having a code path
that uses Google's own client library will be useful I think.
So I'm not trying to champion one approach over there other. Given that
this PR seems very close to being complete, I think it's worth finishing it
and releasing it as part of the Zarr experimental API. Long-term it might
get deprecated in favour of gcsfs.
All that said, my own time to work on Zarr is going to be quite limited for
the rest of the year, so I don't know yet if I'll have time to work on
this.
…On Fri, 7 Sep 2018, 04:29 Martin Durant, ***@***.***> wrote:
Having a mapping class over generic file-system implementations was one of
the points of fsspec
<https://github.com/martindurant/filesystem_spec/blob/master/fsspec/mapping.py>,
which will, of course, look rather familiar.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#252 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QjuH4zorgS80DStcgHiZfsXwGpnEks5uYef8gaJpZM4TA8T4>
.
|
Btw it looks like there is no local emulator for GCS, and this is an open issue for the google cloud Python client library: https://github.com/googleapis/google-cloud-python/issues/4840 (see also googleapis/google-cloud-python#4897). |
@alimanfoo - well aware of this, and had to jump through a number of uncomfortable hoops to test gcsfs using vcrpy (which can record and mock any urllib calls, but not easily). moto3 and azurite really help for s3 and azure-datalake/blob in this respect. |
@martindurant thanks, yes one of your comments was what prompted me to do a bit of digging. I guess it might be worth trying to put a bit of pressure on Google folks, it doesn't look like they've prioritised this highly. |
Is this still interesting, @rabernat? FWIW I'd be +1 on getting this integrated. Probably still some things to address before merging though. |
Btw I was recently looking at the API docs for the Google cloud storage SDK
and I believe there is support for local emulation now, which could make
unit testing much easier.
…On Sat, 2 Mar 2019, 01:45 jakirkham, ***@***.***> wrote:
Is this still interesting, @rabernat <https://github.com/rabernat>? FWIW
I'd be +1 on getting this integrated. Probably still some things to address
before merging though.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#252 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qj3tBXnrV2WB-IECtDB4qJfFWneVks5vSde-gaJpZM4TA8T4>
.
|
Where did you see that? It would make testing gcsfs much easier! |
I knew I should have posted a link, I thought it would be easy to find
again but I can't seem to find anything now, must have dreamt it! Sorry
for the confusion.
…On Sat, 2 Mar 2019 at 13:25, Martin Durant ***@***.***> wrote:
there is support for local emulation
Where did you see that? It would make testing gcsfs *much* easier!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#252 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QuBm5w6PtFXr04lsZORNMwaK5jxKks5vSnvIgaJpZM4TA8T4>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health
Big Data Institute
Li Ka Shing Centre for Health Information and Discovery
University of Oxford
Old Road Campus
Headington
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596 or +44 (0)7866 541624
Email: alimanfoo@googlemail.com
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: @alimanfoo <https://twitter.com/alimanfoo>
Please feel free to resend your email and/or contact me by other means if
you need an urgent reply.
|
Before adding any new storage classes, I think we should address #414 and #301. The storage module is already super long and, at this point, pretty random. We have direct support for Azure and Redis but not S3? I basically stopped working on this because I wanted to see how filesystemspec evolved. Maybe now is the time to assess whether we want to continue implementing more storage classes or add filesystemspec as a dependency. Someone also recently posted a link to another key/value abstraction we could look at, but I can't find the link. This is also tied to the spec discussion. As we start dumping zarr in to more and more stores, how do we ensure that these will be readable by all zarr implementations? My view is that we need more explicit specs for file and cloud storage that will allow other libraries to implement them in a compatible way. |
Should we discuss this explicitly at the next zarr meeting? I think a little effort should be spent integrating gcsfs, s3fs and adlfs (or blob) completely with fsspec to complete the picture. |
What follows are merely my opinions, please feel free to agree or disagree with them as you see fit. Refactoring out the existing storage layers into an independent library is probably already a worthwhile endeavor. There seem to be other people out there looking for or trying similar things. So it would be useful to engage them at that level. It will probably also make it easier to handle optional dependencies as was needed recently for Azure. So S3 already works without any effort on our end. Namely Personally I'm not convinced that Zarr needs to use Are you thinking of It's reasonable to be concerned about spec impact. Though I think the key-value stores are the least of our concerns as they are already well-articulated and unlikely to change (unless we start adding links and references). The concerning part from the spec point of view is how we handle translations of data to key-value pairs in these stores. For example object types are something that already is an obstacle to compatibility that we will need to figure out. There are a few other examples as well. |
S3Map and GCSMap are currently used for accessing zarr (which is why I was not convinced of the need for a google mapping implementation or in fact azure blob). Indeed, some access patterns were changed in gcsfs specifically because of zarr. This wholly substandard function exists for passing URL to Dask and picking the right mapper for zarr loading/saving. Taking that out of the hands of Dask without replicating it zarr or elsewhere, is exactly the sort of thing fsspec is for. It would always be optional, though, just as the current file-system backends are optional: the user would always be allowed to use their own mapping-compatible store. |
@rabernat, is there anything that needs resurrecting from this? or safe to close? |
Definitely safe to close. Gcsfs meets all our needs here. |
First, apologies for submitting an un-solicited pull request. I know that is against the contributor guidelines. I thought this idea would be easier to discuss with a concrete implementation to look at.
In my highly opinionated view, the killer feature of zarr is its ability to efficiently store array data in cloud storage. Currently, the recommended way to do this is via outside packages (e.g. s3fs, gcsfs), which provide a
MutableMapping
that zarr can store things in.In this PR, I have implemented an experimental google cloud storage class directly within zarr.
Why did I do this? Because, in the pangeo project, we are now making heavy use of the xarray -> zarr -> gcsfs -> cloud storage stack. I have come to the conclusion that a tighter coupling between zarr and gcs, via the
google.cloud.storage
API, may prove advantageous.In addition to performance benefits and easier debugging, I think there are social advantages to having cloud storage as a first-class part of zarr. Lots of people want to store arrays in the cloud, and if zarr can provide this capability more natively, it could increase adoption.
Thoughts?
These tests require GCP credentials and the google cloud storage package. It is possible add encrypted credentials to travis, but I haven't done that yet. Tests are (mostly) working locally for me.
TODO:
tox -e py36
orpytest -v --doctest-modules zarr
)tox -e py27
orpytest -v zarr
)tox -e py36
orflake8 --max-line-length=100 zarr
)tox -e py36
orpython -m doctest -o NORMALIZE_WHITESPACE -o ELLIPSIS docs/tutorial.rst
)tox -e docs
)