Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[INTEGRATION][SPARK] Support Azure Databricks Credential Passthrough #595

Merged

Conversation

wjohnson
Copy link
Contributor

@wjohnson wjohnson commented Mar 9, 2022

Problem

Azure Databricks provides a feature called "credential pass through" on Azure Blob File System (ABFS) paths which is fully supported in the standard so this isn't something that Databricks is doing on its own and may occur with any other Spark implementation that is connecting to ABFS and using a custom authentication method.

❌ When reading a file on ABFS and the credential pass through / custom auth implementation is enabled, OpenLineage throws an exception that it is unable to find the ADLS Gen 2 token.

✅ When writing a file with the credentials pass through / custom auth implementation, it continued to emit lineage correctly (assuming the inputs were not protected by credential passthrough)

Closes: #576

Solution

Attempting to extract HadoopFsRelation paths only if there is an exception.
The Azure Databricks Credential Passthrough attempts to use a custom auth when connecting to ABFSS.
As a result, the SparkListener is not running in a context that the custom auth can get the current user's identity, resulting in an exception (com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException).

Defining Datasets based off of relation URI paths is similar to InsertIntoHadoopFsRelationVisitor which is working well with the custom auth being used by Databricks.

  • Your change modifies the core OpenLineage model
  • Your change modifies one or more OpenLineage facets

If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports S3 and GCS filesystem operations, tested with AWS EMR).

Checklist

  • You've signed-off your work
  • Your pull request title follows our guidelines
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've updated the CHANGELOG.md with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)
  • You've versioned the core OpenLineage model or facets according to SchemaVer (if relevant)

@wjohnson wjohnson force-pushed the feature/handleDatabricksCredPassThru branch from d831358 to c8672df Compare March 9, 2022 17:28
@wjohnson
Copy link
Contributor Author

wjohnson commented Mar 9, 2022

I swear I ran ./gradlew :spotlessApply before pushing this 🤦‍♂️

.collect(Collectors.toList());
})
.orElse(Collections.emptyList());
} catch (Exception e) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we could distinguish kind of exception you expect from all the other possible? Otherwise we can miss important logs when something else goes wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pawel-big-lebowski the specific exception that I'm trying to catch is a private class inside of Databricks.

Would it be worth catching the exception, checking if the exception name matches that specific exception name and then if NOT throwing the exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check based on the exception class name in my latest push. Thank you for reviewing!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Nów I get the problem. Well, making global OpenLineage class aware of AzureCredentialNotFoundException is not super clean, but I cannot see any better approach to handle this (except for creating custom package with custom LogicalRelationDatasetBuilder).

Would love to get @collado-mike opinion on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@collado-mike Gentle nudge for your opinion on how to handle catching an exception that only shows up on Databricks because they're using a custom class to authenticate against our cloud storage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. My typical preference for this kind of thing is to try to identify what distinguishes this case from the other case and see if we can pull out a dedicated visitor. Will it always apply to Hadoop FS relations with an ABFS URL? Sometimes the file path points to a mount on the filesystem, right? So we can't always detect ahead of time when this will happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@collado-mike - Thank you for the reply! This credential pass through is supposed to be transparent. It's based on this standard in Hadoop.

I don't know if we can tell, at the spark level, that a particular path is going to be affected by this setting. In addition, if you have this configured, it might only be on ONE path you're using. For example, I might have an ABFS path that I'm accessing via a storage key, another ABFS path I'm accessing via OAuth and finally a third that is using this credentials passthrough feature. All three might be used in the same query.

Does OpenLineage already handle this credentials passthrough idea on AWS by any chance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the details here. No, we don't have any special handling for the AWS credentials passthrough either. I don't like the reflection based exception handling, but I think that until we have more general use cases, it's the most straightforward way to handle this case. It would be great to add the AWS use case to the list of issues so we can test and maybe come up with a more general way of addressing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@collado-mike - Does that mean we can accept this PR for now? I'll add an issue around "Handling pass through credentials more generically in S3 and ABFS filesystems".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I haven't looked thoroughly at this, but I think the custom exception handling is fine as is

@wjohnson wjohnson force-pushed the feature/handleDatabricksCredPassThru branch from c8672df to 137f7cd Compare March 26, 2022 03:48
Closes #576 by attempting to extract HadoopFsRelation paths only if there is an exception.
The Azure Databricks Credential Passthrough attempts to use a custom auth when connecting to ABFSS.
As a result, the SparkListener is not running in a context that the custom auth can get the current user's identity, resulting in an exception (com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException).

The defining based off of relation URI paths is similar to InsertIntoHadoopFsRelationVisitor.

Signed-off-by: Will Johnson <will@willj.co>
@wjohnson wjohnson force-pushed the feature/handleDatabricksCredPassThru branch from 137f7cd to 378a640 Compare May 4, 2022 15:32
@wjohnson wjohnson merged commit 7c3c160 into OpenLineage:main May 4, 2022
@wjohnson wjohnson deleted the feature/handleDatabricksCredPassThru branch May 4, 2022 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[INTEGRATION][SPARK] OL fails to emit lineage when using Databricks Credential Passthrough
3 participants