-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an API method to give us a streaming file object #29
Comments
@dmsolow Hmm, |
I don't think so. The situation is that it's often useful to start processing a file as it downloads instead of waiting until it's finished. For example if there's 1GB CSV file in google storage, it should be possible to parse it line by line as it's downloaded. It's fairly common for network libraries to offer this kind of functionality. For example in the standard
This parses the CSV as it's downloaded. I'd like to get the same functionality from google storage. If there's already a good way to do this with the current library, please let me know. |
Hmm, you'll need to have the "stream consumer" running in a separate thread / process to do much good. You can make this work using Python's $ bin/python pipe_test.py
reader: start
reader: read one chunk
reader: read one chunk
...
reader: read one chunk
reader: read 800000 bytes |
Using a separate thread kind of feels like a hack to me, but it is surely one way to do it. I think the ability to do this without using extra threads would be widely useful, but idk how hard it would be to implement. |
OK, looking at the underlying implementation in You could therefore pass in an instance of your own class which wrapped the underlying stream, e.g.: from google.cloud.storage import Client
class ChunkParser(object):
def __init__(self, fileobj):
self._fileobj = fileobj
def write(self, chunk):
self._fileobj.write(chunk)
self._do_something_with(chunk)
client = Client()
bucket = client.get_bucket('my_bucket_name')
blob = bucket.blob('my_blob.xml')'
with open('my_blob.xml', 'wb') as blob_file:
parser = ChunkParser(blob_file)
blob.download_to_file(parser) |
This was requested many times but was at some point turned down (googleapis/google-cloud-python#3903) As an alternative, one can use the |
It's a shame that this was turned down. It's a feature that every python dev is going to expect from a library like this, as evidenced by the fact that it keeps coming up. |
Unfortunately this doesn't work with uploading streams. Are there known workarounds? |
@akuzminsky The line you've linked to is in the implementation of @dmsolow Does my file-emulating wrapper class solution work for you? |
@tseaver No. I would like something that is a "file-like object." This means something that supports standard Python io methods like |
I was really surprised to see that not only is this feature not available, but it also has been brought up and closed in the past. It seems like an obvious and important feature to have. Fortunately, But |
@thnee you should check back, |
The lack of a simple streaming interface is a challenge to implementing a cloud function that reads/writes large files. I need the ability to read an object in from cloud storage, manipulate it, and write it out to another object. Since the only filestore available to GCF is /tmp which lives in the function memory space, you are limited to files less than 2 GB. |
Well, if this new method is so much wanted, I'd propose solution: class, that inherits Looks like it'll work, 'cause (as I know) most file methods works through |
Tensorflow have an implementation that gives a file like object for gc blobs: https://www.tensorflow.org/api_docs/python/tf/io/gfile/GFile Not sure if it actually streams or not though. |
smart_open now has support for streaming files to/from GCS. from smart_open import open
# stream from GCS
for line in open('gs://my_bucket/my_file.txt'):
print(line)
# stream content *into* GCS (write mode):
with open('gs://my_bucket/my_file.txt', 'wb') as fout:
fout.write(b'hello world') |
@petedannemann great work - any ETA for an official release? |
@rocketbitz no idea but for now you could install from Github pip install git+https://github.com/RaRe-Technologies/smart_open |
I've implemented gs-chunked-io to satisfy my own needs for GS read/write streams. It's designed to compliment the Google python API.
|
- annoyingly GCS doesn't support file-like objects: googleapis/python-storage#29 - use a small library for doing file-like object support for GCS: https://github.com/xbrianh/gs-chunked-io
Release 1.10 last night included GCS functionality |
Any update on this? |
It doesn't look like there's a way to get a streaming download from google storage in the Python API. We have
download_to_file
,download_to_string
, anddownload_to_filename
, but I don't see anything that returns a file-like object that can be streamed. This is a disadvantage for many file types which can usefully be processed as they download.Can a method like this be added?
The text was updated successfully, but these errors were encountered: