Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Cloud Storage (GCS) #404

Merged
merged 87 commits into from
Jan 24, 2020
Merged

Google Cloud Storage (GCS) #404

merged 87 commits into from
Jan 24, 2020

Conversation

petedannemann
Copy link
Contributor

@petedannemann petedannemann commented Jan 5, 2020

Google Cloud Storage (GCS) Support

Motivation

Checklist

Before you create the PR, please make sure you have:

  • Picked a concise, informative and complete title
  • Clearly explained the motivation behind the PR
  • Linked to any existing issues that your PR will be solving
  • Included tests for any new functionality
  • Checked that all unit tests pass

We will need to figure out how we plan to deal with integration testing on GCP. Would RaRe be willing to host the bucket? We will need to update Travis to include those tests if so.

EDIT: Removed comment about the testing timeout issue. Since fixing the memory issue with reads, it has gone away.

@petedannemann petedannemann changed the title Gcs WIP Gcs Jan 5, 2020
@petedannemann petedannemann changed the title WIP Gcs WIP Google Cloud Storage (GCS) Jan 5, 2020
Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for your contribution.

I skimmed the reader part and left you some comments here and there. Please have a look.

My main concern is that you're loading the whole blob into memory when reading. That's what the download_as_string method appears to do. We definitely want to avoid this, because the remote object could be much larger than what can fit into memory.

README.rst Show resolved Hide resolved
integration-tests/test_gcs.py Outdated Show resolved Hide resolved
smart_open/gcs.py Outdated Show resolved Hide resolved
smart_open/gcs.py Outdated Show resolved Hide resolved
smart_open/gcs.py Outdated Show resolved Hide resolved
smart_open/gcs.py Outdated Show resolved Hide resolved
smart_open/gcs.py Outdated Show resolved Hide resolved
smart_open/smart_open_lib.py Outdated Show resolved Hide resolved
smart_open/gcs.py Outdated Show resolved Hide resolved
smart_open/gcs.py Outdated Show resolved Hide resolved
@petedannemann
Copy link
Contributor Author

Hi, thank you for your contribution.

I skimmed the reader part and left you some comments here and there. Please have a look.

My main concern is that you're loading the whole blob into memory when reading. That's what the download_as_string method appears to do. We definitely want to avoid this, because the remote object could be much larger than what can fit into memory.

Thanks for the review. A lot of these issues came from me reusing a lot of stuff based on how it was written in the s3 module, but it looks like that's not the approach we want to take. I'll work on addressing these issues, especially the one regarding memory

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 5, 2020

That module evolved a bit over time, so you don't necessarily have to mimic everything it does.

Try to work around the memory issue (from memory, the S3 module doesn't suffer from this problem, so that particular part may provide some hints for you) and let me know when you're ready for another round of review.

Thanks!

@petedannemann
Copy link
Contributor Author

petedannemann commented Jan 12, 2020

@mpenkov @piskvorky I've been trying to refactor the GCS mocks to imitate how moto works and provide test coverage for them. Unfortunately I think this will be a significantly larger endeavor than I originally thought. I also am worried about the maintainability of them and coupling them with the smart_open library doesn't seem like the right way of doing it. I see two options:

  1. Develop a Python package for mocking GCS that lives outside of smart_open (this will probably take me months and will be a risk for smart_open)
  2. Refactor all gcs tests to simply assert calls to the google.cloud.storage library. These tests will not be as good as the current ones, but will still yield some testing benefits, primarily preventing regression. I could also move some of the current unit tests to the integration tests file to improve coverage.

What do you think?

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 13, 2020

I agree that 1) is probably too much for any of us to take on right now.

I think 2) is more of a step in the right direction. If the goal is to prevent regression, then your existing tests and mocks are sufficient. To prove correctness, we will have to use actual GCS buckets (integration testing).

@petedannemann
Copy link
Contributor Author

@mpenkov I retract my last comment and think the mocks are actually the best testing methodology for this situation. I think I got caught up in scope creep from looking at all of the functionality of moto, but the mocks here only need functionality as it relates to the tests they're being used with. I added tests for the mocks and am ready for another review.

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 21, 2020

OK, I'll have a look within a week and let you know.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking very good. Left you some minor comments.

Also, you're missing tests for the top-level module (smart_open_lib). These don't have to be super-detailed, but you want to make sure that smart_open.open("gs://bucket/key") actually works.

integration-tests/test_gcs.py Outdated Show resolved Hide resolved
bucket_id,
blob_id,
mode,
buffering=DEFAULT_BUFFER_SIZE,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this parameter name-clashes with the same parameter to the io.open function. How are we handling this clash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to buffer_size to avoid this issue and mimic the s3 module

#
def _upload_next_part(self):
part_num = self._total_parts + 1
logger.info("uploading part #%i, %i bytes (total %.3fGB)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do this:

logger.info(
    param1,
    param2,
    param3,
)

This is called a hanging indent - this is the preferred indenting method for smart_open.

smart_open/gcs.py Outdated Show resolved Hide resolved
smart_open/gcs.py Outdated Show resolved Hide resolved
smart_open/gcs.py Outdated Show resolved Hide resolved
smart_open/gcs.py Show resolved Hide resolved
smart_open/gcs.py Show resolved Hide resolved
smart_open/gcs.py Outdated Show resolved Hide resolved
smart_open/tests/test_gcs.py Outdated Show resolved Hide resolved
@petedannemann
Copy link
Contributor Author

petedannemann commented Jan 23, 2020

Looking very good. Left you some minor comments.

Also, you're missing tests for the top-level module (smart_open_lib). These don't have to be super-detailed, but you want to make sure that smart_open.open("gs://bucket/key") actually works.

Thanks @mpenkov! I addressed your comments and added those tests.

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, looks good to me. Merging.

Thank you for your contribution!

@mpenkov mpenkov merged commit a9aa466 into piskvorky:master Jan 24, 2020
@petedannemann
Copy link
Contributor Author

Thanks for your help and patience!

@petedannemann petedannemann mentioned this pull request Jan 24, 2020
5 tasks
@wookayin
Copy link

How does it handle authentication? gcloud and gsutil are well-configured and working well but open('gs://...') does not work.

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 28, 2020

@wookayin Can you please open a new ticket to describe your problem? Be sure to fill out the full issue template.

@petedannemann
Copy link
Contributor Author

petedannemann commented Jan 28, 2020

@wookayin it uses Google's google-cloud-storage package so please refer to the google-cloud-python authentication documentation. The preferred method is to create a service account key file and set your GOOGLE_APPLICATION_CREDENTIALS environment variable to the path of that file.

@mpenkov
Copy link
Collaborator

mpenkov commented Jan 28, 2020

@petedannemann Is that documented anywhere, like the readme or howto.md? If not, it'd be good to add some documentation to point people in the right direction (the above comment is already a solid start).

@petedannemann
Copy link
Contributor Author

It is not. I will work on a PR to add this missing documentation soon and consider adding some examples of how you can use other authentication methods by passing in your own google.cloud.storage.Client object to transport_kwargs.

@petedannemann petedannemann mentioned this pull request Jan 28, 2020
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants