-
-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google Cloud Storage (GCS) #404
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for your contribution.
I skimmed the reader part and left you some comments here and there. Please have a look.
My main concern is that you're loading the whole blob into memory when reading. That's what the download_as_string
method appears to do. We definitely want to avoid this, because the remote object could be much larger than what can fit into memory.
Thanks for the review. A lot of these issues came from me reusing a lot of stuff based on how it was written in the s3 module, but it looks like that's not the approach we want to take. I'll work on addressing these issues, especially the one regarding memory |
That module evolved a bit over time, so you don't necessarily have to mimic everything it does. Try to work around the memory issue (from memory, the S3 module doesn't suffer from this problem, so that particular part may provide some hints for you) and let me know when you're ready for another round of review. Thanks! |
@mpenkov @piskvorky I've been trying to refactor the GCS mocks to imitate how moto works and provide test coverage for them. Unfortunately I think this will be a significantly larger endeavor than I originally thought. I also am worried about the maintainability of them and coupling them with the smart_open library doesn't seem like the right way of doing it. I see two options:
What do you think? |
I agree that 1) is probably too much for any of us to take on right now. I think 2) is more of a step in the right direction. If the goal is to prevent regression, then your existing tests and mocks are sufficient. To prove correctness, we will have to use actual GCS buckets (integration testing). |
@mpenkov I retract my last comment and think the mocks are actually the best testing methodology for this situation. I think I got caught up in scope creep from looking at all of the functionality of moto, but the mocks here only need functionality as it relates to the tests they're being used with. I added tests for the mocks and am ready for another review. |
OK, I'll have a look within a week and let you know. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking very good. Left you some minor comments.
Also, you're missing tests for the top-level module (smart_open_lib). These don't have to be super-detailed, but you want to make sure that smart_open.open("gs://bucket/key")
actually works.
smart_open/gcs.py
Outdated
bucket_id, | ||
blob_id, | ||
mode, | ||
buffering=DEFAULT_BUFFER_SIZE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this parameter name-clashes with the same parameter to the io.open
function. How are we handling this clash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to buffer_size
to avoid this issue and mimic the s3 module
smart_open/gcs.py
Outdated
# | ||
def _upload_next_part(self): | ||
part_num = self._total_parts + 1 | ||
logger.info("uploading part #%i, %i bytes (total %.3fGB)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do this:
logger.info(
param1,
param2,
param3,
)
This is called a hanging indent - this is the preferred indenting method for smart_open
.
Thanks @mpenkov! I addressed your comments and added those tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, looks good to me. Merging.
Thank you for your contribution!
Thanks for your help and patience! |
How does it handle authentication? gcloud and gsutil are well-configured and working well but |
@wookayin Can you please open a new ticket to describe your problem? Be sure to fill out the full issue template. |
@wookayin it uses Google's google-cloud-storage package so please refer to the google-cloud-python authentication documentation. The preferred method is to create a service account key file and set your GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of that file. |
@petedannemann Is that documented anywhere, like the readme or howto.md? If not, it'd be good to add some documentation to point people in the right direction (the above comment is already a solid start). |
It is not. I will work on a PR to add this missing documentation soon and consider adding some examples of how you can use other authentication methods by passing in your own |
Google Cloud Storage (GCS) Support
Motivation
Checklist
Before you create the PR, please make sure you have:
We will need to figure out how we plan to deal with integration testing on GCP. Would RaRe be willing to host the bucket? We will need to update Travis to include those tests if so.
EDIT: Removed comment about the testing timeout issue. Since fixing the memory issue with reads, it has gone away.