Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[object_store] support azure managed and workload identities #3581

Merged
merged 9 commits into from
Jan 25, 2023

Conversation

roeap
Copy link
Contributor

@roeap roeap commented Jan 21, 2023

Which issue does this PR close?

Closes #3580

Rationale for this change

see linked issue

What changes are included in this PR?

added implementations for a managed identity and federated workload credential for azure along with corresponding configuration on the azure builder.

Are there any user-facing changes?

users can now use managed identities and workload credentials to authenticate object store requests.

@github-actions github-actions bot added the object-store Object Store Interface label Jan 21, 2023
@tustvold
Copy link
Contributor

I intend to review this first thing tomorrow

Comment on lines +657 to +658
assert_eq!(req.uri().path(), format!("/{}/oauth2/v2.0/token", tenant));
assert_eq!(req.method(), &Method::POST);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I really wanted to check here is if the right content from the federated token file is included in the client_assertion. However getting the body bytes is an async operation. We could probably get a tokio runtime internally, but I was wondering if there is a better way to achieve this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use futures::block_on

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just some minor nits, looks good thank you 👍

@@ -294,45 +306,241 @@ impl ClientSecretOAuthProvider {
pub fn new(
client_id: String,
client_secret: String,
tenant_id: String,
tenant_id: impl AsRef<str>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

/// Fetch a token
pub async fn fetch_token(
async fn fetch_token(&self, client: &Client, retry: &RetryConfig) -> Result<String> {
self.cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving the TokenCache onto CredentialProvider::TokenProvider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes absolute sense - done.

#[derive(serde::Deserialize, Debug)]
#[async_trait::async_trait]
pub trait TokenCredential: std::fmt::Debug + Send + Sync + 'static {
async fn fetch_token(&self, client: &Client, retry: &RetryConfig) -> Result<String>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the client should be passed at credential construction, this avoids issues with clients not permitting https

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I passed in a client for the managed identity, but I was unsure about the other credentials as they always use ssl. So we just ignore the client for the managed identity. If however the "savings" of not having a redundant instance of the client are not relevant, I'm happy to move it for all credentials.

(None, None, Some(msi_res_id)) => {
query_items.push(("msi_res_id", msi_res_id))
}
_ => (),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this silently ignores parameters if multiple are specified, I can't see why this would be the case. If this is intentional I think we should return an error for this case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True - I went for just having a hierarchy. Since object id and msi_res_id are more specific (only used by managed identity) and less likely to just "end up" in the environment, I thought they would get precedence.

.header("metadata", "true")
.query(&query_items);

if let Ok(val) = std::env::var(MSI_SECRET_ENV_KEY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to provide this in the constructor, and delegate to the config key system

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure about this value. From what I understood, its rotated at runtime.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm mistaken, environment variables can't reliably be altered at runtime by an external process. So I think it should be fine to move it outside

Comment on lines 641 to 649
let fs = LocalFileSystem::new_with_prefix(root.path()).unwrap();
let tenant = "tenant";
let tokenfile = root.path().join("tokenfile");
fs.put(
&crate::path::Path::from("tokenfile"),
bytes::Bytes::from("federated-token"),
)
.await
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use NamedTempFile here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point :)

Comment on lines +657 to +658
assert_eq!(req.uri().path(), format!("/{}/oauth2/v2.0/token", tenant));
assert_eq!(req.method(), &Method::POST);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use futures::block_on

object_store/src/azure/mod.rs Outdated Show resolved Hide resolved
msi_endpoint: Option<String>,
) -> Self {
let msi_endpoint = msi_endpoint.unwrap_or_else(|| {
"http://169.254.169.254/metadata/identity/oauth2/token".to_owned()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A problem we had to workaround with the AWS provider, is ensuring that the client used for this is not https only. In practice this meant creating a separate client for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

completely missed that - using a dedicated client here as well now.

@@ -824,18 +947,40 @@ impl MicrosoftAzureBuilder {
Ok(credential::CredentialProvider::AccessKey(bearer_token))
} else if let Some(access_key) = self.access_key {
Ok(credential::CredentialProvider::AccessKey(access_key))
} else if self.use_managed_identity {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familiar with Azure SDKs, but the AWS and Google SDKs default to using managed identity if nothing else is specified. I'm not sure if this is something we want to do here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default azure credential does a lot of opinionated things that I think we do not want to replicate here :). That said, it makes sense to use the metadata as default since it may work without any configuration.

@tustvold
Copy link
Contributor

I don't have the means to test this, so I am going to take it on faith that this works as advertised 😄

@tustvold tustvold merged commit 73ce076 into apache:master Jan 25, 2023
@ursabot
Copy link

ursabot commented Jan 25, 2023

Benchmark runs are scheduled for baseline = f0be9da and contender = 73ce076. 73ce076 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@roeap
Copy link
Contributor Author

roeap commented Jan 25, 2023

@tustvold - I also have to set up a real-world test and am in the process of doing this right now. Will open up a follow up PR if I see that changes need to be made. Also wanted to move that last variable :). Sorry for not mentioning that sooner!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[object_store] support more identity based auth flows for azure
3 participants