-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Authentication with AAD tokens in Databricks provider #19335
Conversation
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
|
7511075
to
611ef98
Compare
Many organizations don't allow to use personal access tokens, and instead force to use native platform authentication. This PR adds the possibility to authenticate to Azure Databricks workspaces using the Azure Active Directory tokens generated from Azure Service Principal's ID and secret.
@mik-laj @vikramkoka would it possible to review this? I don't understand why Static check tests are failing |
|
||
* ``token``: Specify PAT to use. | ||
|
||
Following parameters are necessary if using authentication with AAD token: | ||
|
||
* ``azure_client_id``: ID of the Azure Service Principal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we put the client_id and client_secret into the user and password fields of the connection? Ultimately, it's just a user and a password.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
elif ( | ||
'azure_client_id' in self.databricks_conn.extra_dejson | ||
and 'azure_client_secret' in self.databricks_conn.extra_dejson | ||
and 'azure_tenant_id' in self.databricks_conn.extra_dejson |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enforcing all three values to be present may result in unexpected behavior. Maybe we should raise an Exception if at least one, but not all three azure_* configs are set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea, although isn't required when we change to login/password
self.log.info('Using AAD Token for SPN. ') | ||
|
||
if 'host' in self.databricks_conn.extra_dejson: | ||
host = self._parse_host(self.databricks_conn.extra_dejson['host']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I did not understand why host
is allowed as an extra variable for token authentication - I suspected something about backward compatibility. For user / password authentication, host
may only be submitted as a normal connection argument. I would prefer to do it the same way with SP based authentication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, looks like more for compatibility...
106f07e
to
55279fe
Compare
"resource": resource, | ||
} | ||
resp = requests.get( | ||
"http://169.254.169.254/metadata/identity/oauth2/token", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why fixed IP address here? This sounds very wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understanda this is azrure's metadata server? But is there no better way to reach the metadata server (and should we only limit it if we check in the environment that we are running on Azure managed vm ?
For example in Google's VM you can use metadata.google.internal
name https://cloud.google.com/compute/docs/metadata/overview
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a fixed address for internal metadata service - see the linked documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - but see my comment above. At the very least (if we cannot get symbolic name) this address should be extracte as "AZURE_METADATA_IP" for example and explained in comments, without the need of reading the docs actually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we should likely check if we are in Azure VM (and fail if we aren't) without even reaching out to the metadata server. Not really necessary but It would be nice to check it (via env vars I guess) before - otherwise you might get strange errors when you enable it by mistake on non-azure managed-identity server. Google's meta-data servers I think have the same IP address, so the responses from it might be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See for example the comment here (accepted answer is to use metadata server but the comment is about long timeouts and confusing behaviour on non-azure instances): https://stackoverflow.com/questions/54913218/detect-whether-running-on-azure-iaas-vm-net-application
|
I recommend to install |
also removed dependency on the azure-identity
55279fe
to
b03306c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know too much abot AAD in Databricks, but it looks reasonable !
The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease. |
@freget - wdyt ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Many organizations don't allow to use personal access tokens, and instead force to use native platform authentication. This PR adds the possibility to authenticate to Azure Databricks workspaces using the Azure Active Directory tokens generated from Azure Service Princpal's ID and secret. It supports authentication when SP is either user inside workspace or it's outside of the workspace