-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[INTEGRATION][SPARK] Support Azure Databricks Credential Passthrough #595
[INTEGRATION][SPARK] Support Azure Databricks Credential Passthrough #595
Conversation
d831358
to
c8672df
Compare
I swear I ran |
.collect(Collectors.toList()); | ||
}) | ||
.orElse(Collections.emptyList()); | ||
} catch (Exception e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we could distinguish kind of exception you expect from all the other possible? Otherwise we can miss important logs when something else goes wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pawel-big-lebowski the specific exception that I'm trying to catch is a private class inside of Databricks.
Would it be worth catching the exception, checking if the exception name matches that specific exception name and then if NOT throwing the exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check based on the exception class name in my latest push. Thank you for reviewing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Nów I get the problem. Well, making global OpenLineage class aware of AzureCredentialNotFoundException
is not super clean, but I cannot see any better approach to handle this (except for creating custom package with custom LogicalRelationDatasetBuilder
).
Would love to get @collado-mike opinion on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@collado-mike Gentle nudge for your opinion on how to handle catching an exception that only shows up on Databricks because they're using a custom class to authenticate against our cloud storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. My typical preference for this kind of thing is to try to identify what distinguishes this case from the other case and see if we can pull out a dedicated visitor. Will it always apply to Hadoop FS relations with an ABFS URL? Sometimes the file path points to a mount on the filesystem, right? So we can't always detect ahead of time when this will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@collado-mike - Thank you for the reply! This credential pass through is supposed to be transparent. It's based on this standard in Hadoop.
I don't know if we can tell, at the spark level, that a particular path is going to be affected by this setting. In addition, if you have this configured, it might only be on ONE path you're using. For example, I might have an ABFS path that I'm accessing via a storage key, another ABFS path I'm accessing via OAuth and finally a third that is using this credentials passthrough feature. All three might be used in the same query.
Does OpenLineage already handle this credentials passthrough idea on AWS by any chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details here. No, we don't have any special handling for the AWS credentials passthrough either. I don't like the reflection based exception handling, but I think that until we have more general use cases, it's the most straightforward way to handle this case. It would be great to add the AWS use case to the list of issues so we can test and maybe come up with a more general way of addressing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@collado-mike - Does that mean we can accept this PR for now? I'll add an issue around "Handling pass through credentials more generically in S3 and ABFS filesystems".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I haven't looked thoroughly at this, but I think the custom exception handling is fine as is
c8672df
to
137f7cd
Compare
Closes #576 by attempting to extract HadoopFsRelation paths only if there is an exception. The Azure Databricks Credential Passthrough attempts to use a custom auth when connecting to ABFSS. As a result, the SparkListener is not running in a context that the custom auth can get the current user's identity, resulting in an exception (com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException). The defining based off of relation URI paths is similar to InsertIntoHadoopFsRelationVisitor. Signed-off-by: Will Johnson <will@willj.co>
137f7cd
to
378a640
Compare
Problem
Azure Databricks provides a feature called "credential pass through" on Azure Blob File System (ABFS) paths which is fully supported in the standard so this isn't something that Databricks is doing on its own and may occur with any other Spark implementation that is connecting to ABFS and using a custom authentication method.
❌ When reading a file on ABFS and the credential pass through / custom auth implementation is enabled, OpenLineage throws an exception that it is unable to find the ADLS Gen 2 token.
✅ When writing a file with the credentials pass through / custom auth implementation, it continued to emit lineage correctly (assuming the inputs were not protected by credential passthrough)
Closes: #576
Solution
Attempting to extract HadoopFsRelation paths only if there is an exception.
The Azure Databricks Credential Passthrough attempts to use a custom auth when connecting to ABFSS.
As a result, the SparkListener is not running in a context that the custom auth can get the current user's identity, resulting in an exception (com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException).
Defining Datasets based off of relation URI paths is similar to InsertIntoHadoopFsRelationVisitor which is working well with the custom auth being used by Databricks.
If you're contributing a new integration, please specify the scope of the integration and how/where it has been tested (e.g., Apache Spark integration supports
S3
andGCS
filesystem operations, tested with AWS EMR).Checklist
CHANGELOG.md
with details about your change under the "Unreleased" section (if relevant, depending on the change, this may not be necessary)