Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Authentication with AAD tokens in Databricks provider #19335

Merged
merged 4 commits into from
Nov 8, 2021

Conversation

alexott
Copy link
Contributor

@alexott alexott commented Oct 30, 2021

Many organizations don't allow to use personal access tokens, and instead force to use native platform authentication. This PR adds the possibility to authenticate to Azure Databricks workspaces using the Azure Active Directory tokens generated from Azure Service Princpal's ID and secret. It supports authentication when SP is either user inside workspace or it's outside of the workspace

@boring-cyborg
Copy link

boring-cyborg bot commented Oct 30, 2021

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@alexott alexott force-pushed the databricks-support-aad-token branch from 7511075 to 611ef98 Compare October 31, 2021 10:22
@alexott alexott changed the title New functionality: Authentication with AAD tokens in Databricks provider [DRAFT] New functionality: Authentication with AAD tokens in Databricks provider Oct 31, 2021
@alexott alexott changed the title [DRAFT] New functionality: Authentication with AAD tokens in Databricks provider New functionality: Authentication with AAD tokens in Databricks provider Nov 1, 2021
Many organizations don't allow to use personal access tokens, and instead force to use
native platform authentication.  This PR adds the possibility to authenticate to Azure
Databricks workspaces using the Azure Active Directory tokens generated from Azure Service
Principal's ID and secret.
@alexott
Copy link
Contributor Author

alexott commented Nov 6, 2021

@mik-laj @vikramkoka would it possible to review this? I don't understand why Static check tests are failing

@alexott alexott changed the title New functionality: Authentication with AAD tokens in Databricks provider Authentication with AAD tokens in Databricks provider Nov 6, 2021
setup.py Outdated Show resolved Hide resolved

* ``token``: Specify PAT to use.

Following parameters are necessary if using authentication with AAD token:

* ``azure_client_id``: ID of the Azure Service Principal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we put the client_id and client_secret into the user and password fields of the connection? Ultimately, it's just a user and a password.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

elif (
'azure_client_id' in self.databricks_conn.extra_dejson
and 'azure_client_secret' in self.databricks_conn.extra_dejson
and 'azure_tenant_id' in self.databricks_conn.extra_dejson
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enforcing all three values to be present may result in unexpected behavior. Maybe we should raise an Exception if at least one, but not all three azure_* configs are set?

Copy link
Contributor Author

@alexott alexott Nov 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea, although isn't required when we change to login/password

self.log.info('Using AAD Token for SPN. ')

if 'host' in self.databricks_conn.extra_dejson:
host = self._parse_host(self.databricks_conn.extra_dejson['host'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I did not understand why host is allowed as an extra variable for token authentication - I suspected something about backward compatibility. For user / password authentication, host may only be submitted as a normal connection argument. I would prefer to do it the same way with SP based authentication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, looks like more for compatibility...

@alexott alexott force-pushed the databricks-support-aad-token branch 2 times, most recently from 106f07e to 55279fe Compare November 7, 2021 10:15
"resource": resource,
}
resp = requests.get(
"http://169.254.169.254/metadata/identity/oauth2/token",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why fixed IP address here? This sounds very wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understanda this is azrure's metadata server? But is there no better way to reach the metadata server (and should we only limit it if we check in the environment that we are running on Azure managed vm ?

For example in Google's VM you can use metadata.google.internal name https://cloud.google.com/compute/docs/metadata/overview

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a fixed address for internal metadata service - see the linked documentation

Copy link
Member

@potiuk potiuk Nov 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - but see my comment above. At the very least (if we cannot get symbolic name) this address should be extracte as "AZURE_METADATA_IP" for example and explained in comments, without the need of reading the docs actually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we should likely check if we are in Azure VM (and fail if we aren't) without even reaching out to the metadata server. Not really necessary but It would be nice to check it (via env vars I guess) before - otherwise you might get strange errors when you enable it by mistake on non-azure managed-identity server. Google's meta-data servers I think have the same IP address, so the responses from it might be confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See for example the comment here (accepted answer is to use metadata server but the comment is about long timeouts and confusing behaviour on non-azure instances): https://stackoverflow.com/questions/54913218/detect-whether-running-on-azure-iaas-vm-net-application

@potiuk
Copy link
Member

potiuk commented Nov 7, 2021

@mik-laj @vikramkoka would it possible to review this? I don't understand why Static check tests are failing

tests/providers/databricks/hooks/test_databricks.py:24:1: F401 'unittest.mock.call' imported but unused

@potiuk
Copy link
Member

potiuk commented Nov 7, 2021

I recommend to install pre-commit - this way you can run all the static checks locally before sending them to CI

also removed dependency on the azure-identity
@alexott alexott force-pushed the databricks-support-aad-token branch from 55279fe to b03306c Compare November 7, 2021 18:04
Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know too much abot AAD in Databricks, but it looks reasonable !

@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Nov 8, 2021
@github-actions
Copy link

github-actions bot commented Nov 8, 2021

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

@potiuk
Copy link
Member

potiuk commented Nov 8, 2021

@freget - wdyt ?

Copy link
Contributor

@freget freget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:documentation okay to merge It's ok to merge this PR as it does not require more tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants