Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow impersonation of Google Cloud service accounts with BigQuery #2672

Closed
preston-hf opened this issue Jul 30, 2020 · 2 comments · Fixed by #2677
Closed

Allow impersonation of Google Cloud service accounts with BigQuery #2672

preston-hf opened this issue Jul 30, 2020 · 2 comments · Fixed by #2677
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!

Comments

@preston-hf
Copy link

Describe the feature

I recently read this article about using the new IAM Credentials api in order to impersonate a service account by generating a OAuth2 bearer tokens that identifies as them. Adding these abilities to dbt would help increase security and allow for additional abstraction when running BigQuery queries on sensitive data. This way, we don't need to run dbt with a service account key, or worry about starting a node identified by a particular service account.

This would allow us to invert control, using service accounts to control access to different types of data, and granting access to the service accounts (the "abstraction") instead of having all of the users have IAM access to the datasets directly. This way you get the benefits of federated identity with the flexibility of service account and their keys, which are typically required for this workflow, but have many downsides as described in the article.

From an end-user perspective, I propose that dbt adds an impersonate_service_account option to the profiles file for the BigQuery adapter. Using this functionality would work like this:

dbt:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: myproject
      dataset: dbt_preston
      threads: 4
      timeout_seconds: 300
      location: US
      priority: interactive
      retries: 1
      impersonate_service_account: foobar@project.iam.gserviceaccount.com

In this case, the oauth/service account/service account file types would each be compatible. Instead of using the existing Credentials() directly as is the current functionality, when impersonate_service_account option is present, Credentials would be generated using the new IAM Credential API (example).

It seems like this would be pretty easy to send a patch for since the Google Auth library already has support for this method. There is one big difference that we may want to handle, these tokens are very short lived compared to ADC or Service Accounts. By default, these OAuth2 tokens expire after 1 hour. It is possible to increase this when requesting, but 1 hour is a soft limit, only able to be increased by adding the service account to an organization policy. This is usually only possible by GCP Org admins, so not your target audience. I think for an MVP it would be alright for this to be the fallback, but ideally there would be a way to detect the expired credentials and request a new set while dbt is running. This may already be possible, it's not clear how often new clients are requested. If a new client is used often, then this may not even be a concern since it would get a fresh token each time a job was created. In this case, we'd want to set the expiration much shorter than the default.

Describe alternatives you've considered

The only other option is to have GCE or Cloud Run instances run as service accounts. Cloud Run times out after 15 minutes though currently so that is a non-starter, and GCE isn't really designed for this purpose. There are a bunch of other downsides as well that I'm happy to go into if the value isn't obvious.

Additional context

This would be just for BigQuery users.

Who will this benefit?

It will benefit BigQuery users with many datasets, each needing to be and remain isolated. Any operation concerned with increased security would prefer to use this, and most of us are.

Are you interested in contributing this feature?

Yes, I am interested.

@preston-hf preston-hf added enhancement New feature or request triage labels Jul 30, 2020
@jtcohen6 jtcohen6 added bigquery good_first_issue Straightforward + self-contained changes, good for new contributors! and removed triage labels Jul 31, 2020
@jtcohen6
Copy link
Contributor

@preston-hf Thank you for the detailed write-up! I totally buy the value here in managing permissions via services accounts, while still enabling developers to locally run dbt via oauth instead of downloading long-lived keys.

Thanks also for offering to contribute—it's all yours! I think the change here should be simple, since it just requires an update to get_bigquery_credentials.

As far as token expiration: I believe that dbt opens a separate connection for each thread, and opens a new connection each time a thread begins running a new node. (Each time, it would call get_bigquery_credentials.) An hour should be plenty of time; the issue with decreasing the that window significantly, though, is the case of long-running models. What do you think about making token expiration equal to the existing profile config timeout_seconds?

@preston-hf
Copy link
Author

That is what I was hoping for. I'll give it a shot and see how it works in practice. I will also look into any potential issues that could arise from making an auth request for each model. I'm not too worried about the models once they are started, I would be surprised if Google cancelled a running job due to an auth token expiring. timeout_seconds seems like an easy default too though I would need the max to be an hour in case someone has a very high timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery enhancement New feature or request good_first_issue Straightforward + self-contained changes, good for new contributors!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants