Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding BigQuery authentication to credentials.yml #1621

Closed
jacobweiss2305 opened this issue Jun 14, 2022 · 10 comments
Closed

Adding BigQuery authentication to credentials.yml #1621

jacobweiss2305 opened this issue Jun 14, 2022 · 10 comments

Comments

@jacobweiss2305
Copy link
Contributor

Description

How do I authenticate using credentials.yml for pandas.GBQQueryDataSet?

I am trying to add big query authentication to credentials.yml file, but I can't load in data from big query. The documentation says I need to add a object or dictionary, but unsure what that looks like in credentials.yml.

Does anyone have an example of how to do this?

Steps to Reproduce

The kedro documentation says:

credentials: Credentials for accessing Google APIs.
    Either ``google.auth.credentials.Credentials`` object or dictionary with
    parameters required to instantiate ``google.oauth2.credentials.Credentials``.
    Here you can find all the arguments:
    https://google-auth.readthedocs.io/en/latest/reference/google.oauth2.credentials.html

Here is what I tried in credentials.yml:
gbq-creds: "~/.ssh/creds.json"

And in the corresponding catalog.yml:

some_table:
  type: pandas.GBQQueryDataSet
  sql: select * from some_table
  project: project_name
  credentials: gbq-creds 

But here is the error:

kedro.io.core.DataSetError: 
This library only supports credentials from google-auth-library-python. See https://google-auth.readthedocs.io/en/latest/ for help on authentication with this library..
Failed to instantiate DataSet 'query_pull_data_to_match' of type `kedro.extras.datasets.pandas.gbq_dataset.GBQQueryDataSet`.

Your Environment

  • kedro 0.18.0
  • ubuntu 20.04
  • python 3.9.6
@jacobweiss2305
Copy link
Contributor Author

BTW this is my temp fix:

import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = f"{os.path.expanduser('~')}/.ssh/creds.json"

@antonymilne
Copy link
Contributor

Hello @jacobweiss2305. If you want to provide credentials for this dataset using credentials.yml then you need to effectively copy and paste what's in your creds.json file into there, e.g.

gbq-creds:
  token: ...
  client_id: ...

This dictionary gets passed directly into Credentials so you can see the arguments it takes there.

If you want to continue using your .ssh/creds.json file then you'll need to make a custom dataset that does that, but it should be very easy to do so:

class GBQQueryDataSetJSONCredentials(GBQQueryDataSet):
    def __init__(self, *args, **kwargs):
        # Load up json credentials from filepath.
        kwargs["credentials"] = json.loads(
            Path(kwargs["credentials"]).resolve().read_text(encoding="utf-8")
        )
        super().__init__(*args, **kwargs)

@jacobweiss2305
Copy link
Contributor Author

@AntonyMilneQB , thank you so much, this is very helpful.

Does the credentials.yml need a specific format?

The creds.json file looks:

  {
    "type": "...",
    "project_id": "...",
    "private_key_id": "...",
    "private_key": "...",
    "client_email": "...",
    "client_id": "...",
    "auth_uri": "https://accounts.google.com/o/oauth2/auth",
    "token_uri": "https://oauth2.googleapis.com/token",
    "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
    "client_x509_cert_url": "https://www.googleapis.com/...."
  }

I translated that into the credentials.yml file:

gbq-creds:
  type: "..."
  project_id: "..."
  private_key_id: "..."
  private_key: "..."
  client_email: "..."
  client_id: "..."
  auth_uri: "https://accounts.google.com/o/oauth2/auth"
  token_uri: "https://oauth2.googleapis.com/token"
  auth_provider_x509_cert_url: "https://www.googleapis.com/oauth2/v1/certs"
  client_x509_cert_url: "https://www.googleapis.com/..."

But I get this error:

DataSetError: 
__init__() got an unexpected keyword argument 'type'.
DataSet 'query_pull_data_to_match' must only contain arguments valid for the constructor of `kedro.extras.datasets.pandas.gbq_dataset.GBQQueryDataSet`.

I also tried this in the credentials.yml:

gbq-creds:
  token:
    type: "..."
    project_id: "..."
    private_key_id: "..."
    private_key: "..."
    client_email: "..."
    client_id: "..."
    auth_uri: "https://accounts.google.com/o/oauth2/auth"
    token_uri: "https://oauth2.googleapis.com/token"
    auth_provider_x509_cert_url: "https://www.googleapis.com/oauth2/v1/certs"
    client_x509_cert_url: "https://www.googleapis.com/..."

But received this error:

The credentials have been revoked or expired, please re-run the application to re-authorize

Which doesn't make sense because I can get it to work through environment variables.

Do you see any obvious error I am committing or recommend any further tests?

@antonymilne
Copy link
Contributor

Hmm, this is a bit confusing. I feel like the 1st unnested way of doing things should be correct, but clearly the 2nd is actually the correct format 🤔 But the fact that also doesn't work is weird. What the kedro dataset is doing here is actually pretty simple:

if isinstance(credentials, dict):
credentials = Credentials(**credentials)
self._credentials = credentials
self._client = bigquery.Client(
project=self._project_id,
credentials=self._credentials,
location=self._load_args.get("location"),
)

where Credentials is from google.oauth2.credentials import Credentials. So I think best thing to try to debug further is to run a Python script la bit like this:

from google.oauth2.credentials import Credentials
from pathlib import Path

yml_credentials = yml.loads(Path("conf/local/credentials.yml").resolve().read_text(encoding="utf-8"))
json_credentials = json.loads(Path("~/.ssh/creds.json").resolve().read_text(encoding="utf-8"))

Credentials(**yml_credentials)
Credentials(**json_credentials)

Hopefully playing around with that will shed some light on what's going on. e.g. if the json_credentials also don't work but the GOOGLE_APPLICATION_CREDENTIALS environment variable does then maybe there's some other magic going on behind the scenes when you set that variable beyond just loading up Credentials with that path.

@jacobweiss2305
Copy link
Contributor Author

jacobweiss2305 commented Jun 15, 2022

I also noticed that kedro kedro.extras.datasets.pandas.GBQQueryDataSet is not using the service_account.

I was able to get this to work:

from google.oauth2 import service_account
import pandas as pd
import pandas_gbq

credentials = service_account.Credentials.from_service_account_file(
    '/home/ubuntu/.ssh/creds.json',
)

sql = "select * from dataset.table"

pd.read_gbq(sql , project_id="project", credentials=credentials)

from kedro.extras.datasets.pandas import GBQQueryDataSet

GBQQueryDataSet(sql, project='project', credentials=credentials).load()

@jacobweiss2305
Copy link
Contributor Author

jacobweiss2305 commented Jun 15, 2022

The other test I ran is reauth in catalog.yml:

table_name:
  type: pandas.GBQTableDataSet
  dataset: dataset
  table_name: table_name
  project: project
  credentials: creds
  load_args:
    reauth: True

But it forces you to go to the web and copy/paste an authentication code (this works to load in data).

I think that the fix is to add service_account.Credentials.from_service_account_file to kedro.extras.datasets.pandas.GBQQueryDataSet.

Would you agree?

@antonymilne
Copy link
Contributor

I'm not sure. I don't understand much (anything) about Google credentials but as far as I can tell there's two different questions here:

  1. Should we use google.oauth2.credentials.Credentials as we currently do or google.oauth2.service_account.Credentials? I don't understand the different uses of these well enough to comment.
  2. How should we specify those credentials in a .yml file?

For the 2nd of these, it seems that both of these can be instantiated in 3 different ways:

# directly using the constructor
credentials.Credentials()  # how we currently do it in kedro
service_account.Credentials()  # unusual to do this apparently

# from a dictionary
credentials.from_authorized_user_info()
service_account.Credentials.from_service_account_info()

# from a file
credentials.from_authorized_user_file()
service_account.Credentials.from_service_account_file()

"from a file" feels like not such a kedro way to do it, but I think one of the first two options should work correctly with credentials.yml.

@antonymilne
Copy link
Contributor

I think it's worth playing around with the script I showed in #1621 (comment) to understand what the correct way of loading a dictionary of credentials would be. e.g. if you use the "from a dictionary" option, does it work with credentials.Credentials() or service_account.Credentials()?

Overall I think once we've figured out the right way to get this working outside kedro, modifying the kedro dataset to match should be straightforward. But it's potentially a breaking change also.

@merelcht
Copy link
Member

merelcht commented Jan 3, 2023

I'm closing this issue since there hasn't been any recent activity. Feel free to re-open this if you're still facing problems!

@merelcht merelcht closed this as completed Jan 3, 2023
@astrojuanlu astrojuanlu closed this as not planned Won't fix, can't repro, duplicate, stale Jul 24, 2024
@astrojuanlu
Copy link
Member

For the record, I implemented my own BigQuery dataset and this is how it looks:

class PolarsBigQueryDataset(AbstractDataset):
    def __init__(self, sql: str, credentials: dict[str, t.Any] | None = None):
        self._sql = sql

        self._client = bigquery.Client.from_service_account_info(credentials)

so that the configs can look like this:

# catalog.yml
pypi_kedro:
  type: kedro_pypi_monitor.datasets.PolarsBigQueryDataset
  sql: ...
  credentials: gbq_credentials

# credentials.yml
gbq_credentials:
  type: service_account
  project_id: kedro-pypi-stats
  private_key_id: ...
  private_key: "-----BEGIN PRIVATE KEY-----\n...
  client_email: ...
  ...

Less than optimal but a good workaround. File-based credentials in Kedro aren't really developed at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants