Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-2369: Fix/azure managed identities hns detection #17840

Merged

Conversation

andijcr
Copy link
Contributor

@andijcr andijcr commented Apr 12, 2024

summary:

  • tunable cluster property cloud_storage_azure_hierarchical_namespace_enabled optional. If set, will skip abs_client::self_configure and use the value of the property (set it to true if the azure_container has HNS enabled)
  • if using oauth on azure (cloud_storage_credentials_source= azure_aks_oidc_federation or azure_vm_instance_metadata) , discover if HNS is enabled via Set Blob Expiry endpoint
  • bugfixes in configuration validation

HNS (Hierarchical Namespace) is an ABS feature that changes the behavior of some APIs, like deleting/listing.

The current implementation for ABS checks with Get Account Information at startup if HNS is enabled for this account/container
#12632
https://learn.microsoft.com/en-us/rest/api/storageservices/get-account-information?tabs=shared-access-signatures

This rest endpoint is not compatible with OAuth, causing the cluster configured with
cloud_storage_credentials_source = azure_aks_oidc_federation | azure_vm_managed_identity
to fail at a startup

This PR uses a different API, Set Blob Expirty, if the authentication is OAuth.
This API will return 400 if HNS is not enabled and 200/404 if enabled. It is run on a non-existing object, so it's expected to return 404 (not found)

related to this issue, cloud_storage_enable=true would cause cloud_storage_credentials_source = azure_aks_oidc_federation | azure_vm_managed_identity to check for cloud_storage_region/cloud_storage_bucket instead of their azure counterparts.

Fixes: CORE-2369
Fixes: CORE-2375

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

@andijcr
Copy link
Contributor Author

andijcr commented Apr 12, 2024

/ci-repeat 1

@dotnwat
Copy link
Member

dotnwat commented Apr 13, 2024

Can you create a ticket and add it to the sprint?

@andijcr andijcr marked this pull request as ready for review April 15, 2024 16:14
@andijcr andijcr added this to the 24.1 milestone Apr 15, 2024
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Apr 15, 2024

new failures in https://buildkite.com/redpanda/redpanda/builds/47810#018ee2dc-1c1f-43a1-9d7d-ae07faa7fcfd:

"rptest.tests.cluster_config_test.ClusterConfigTest.test_cloud_validation"

new failures in https://buildkite.com/redpanda/redpanda/builds/47810#018ee2cc-df92-4f2b-8d57-4d08923eb84c:

"rptest.tests.cluster_config_test.ClusterConfigTest.test_cloud_validation"

@andijcr andijcr changed the title Fix/azure managed identities hns detection CORE-2369 Fix/azure managed identities hns detection Apr 15, 2024
@andijcr andijcr changed the title CORE-2369 Fix/azure managed identities hns detection CORE-2369: Fix/azure managed identities hns detection Apr 15, 2024
andijcr added 3 commits April 17, 2024 14:33
the property requires cloud_storage_azure_storage_account and
cloud_storage_azure_container, not the s3/gcp variants
the property needs cloud_storage_azure_storage_account and
cloud_storage_azure_container, not the s3/gcp variants
@andijcr andijcr force-pushed the fix/azure_managed_identities_hns_detection branch from e5f36bb to ceb6b5a Compare April 17, 2024 16:05
@andijcr andijcr requested a review from a team as a code owner April 17, 2024 16:05
@andijcr andijcr force-pushed the fix/azure_managed_identities_hns_detection branch from ceb6b5a to f330020 Compare April 17, 2024 21:40
andijcr added 7 commits April 18, 2024 00:03
This property is only meaninful for Azure, it's needed to decide if HNS
(Hierachical Namespace) is active for a storage account.

currently, if redpanda uses ABS, a REST endpoint is called at startup to
determine if HNS is active. This endpoint requires signed requests, so
cloud_storage_azure_shared_key has to be set and
cloud_storage_credentials_source has to be config_file.

To support cloud_storage_credential_source=azure_aks_oidc_federation or
azure_vm_instance_metadata, this property can be set to inform if the
HNS is active; this is needed because because these two credentials
source use OAuth instead of signing.

The property is an optional<bool>. if it's set,
abs_client::self_configure will return its value
use cloud_storage_azure_hierarchical_namespace_enabled to skip
self_configure, if set
specialize test_cloud_validation for azure abs new configuration source
query to determine if a credential applier is based on access tokens.

This is useful when using API that do not support oauth
Set Expiry api can be used to set autodelete on a file.

It's available only if Hierarchical Namespace are enabled on the
container, So it can be uses to test for HNS

200 or 404: HNS available
400: HNS not enabled
test if set_expiry is available on the container. try it on a file that
does not exists:

if HNS is not enabled, it will return 400 Bad Request

otherwise 200 / 404 if HNS is enabled
used in case the credential applier uses oauth
@andijcr andijcr force-pushed the fix/azure_managed_identities_hns_detection branch from f330020 to a15bf10 Compare April 17, 2024 22:04
@andijcr
Copy link
Contributor Author

andijcr commented Apr 17, 2024

tested on azure vm
with storage account without and with HNS enabled and a user assigned managed identity assigned as Storage Blob Data Contributor
using these configs

cloud_storage_azure_storage_account ${AZURE_ACCOUNT}
cloud_storage_azure_container ${AZURE_CONTAINER}
cloud_storage_azure_managed_identity_id ${AZURE_MANAGED_IDENTITY_CLIENT_ID}
cloud_storage_credentials_source azure_vm_instance_metadata
cloud_storage_enabled true

test image https://hub.docker.com/layers/andijcrrp/redpanda/azure_vm/images/sha256-69bc878972235e3b9ee98181742a56c559f2d24d93d1116d6a3f622b16116fd5?tab=layers

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks good to me. can you add some detail about the expiry-test example and where you learned that?

Comment on lines +289 to +292
/// \brief test if Hierarchical Namespace is enabled in the storage account
/// by creating a test file and calling Set Expiry on this. Set Expiry is
/// available only when HNS is enabled. use the result to infer if HNS is
/// enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a strategy you've seen used elsewhere, or documented somewhere? where can i learn more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the cover letter with a couple of links and I'll add details in the confluence page for this feature

tests/rptest/tests/cluster_config_test.py Show resolved Hide resolved
auto header = http::client::request_header{};

header.method(boost::beast::http::verb::put);
header.target(fmt::format("/{}/{}?comp=expiry", name(), key().string()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up add unit test to ensure that the target is properly formatted.
an improperly formatted target might result in 400 instead of 404

(an example:

key()          -> testingcontainer/"testHNS"?comp=expiry -> 400
key().string() -> testingcontainer/testHNS?comp=expiry   -> 404

src/v/cloud_storage_clients/abs_client.cc Show resolved Hide resolved
src/v/cloud_storage_clients/abs_client.h Show resolved Hide resolved
@andijcr andijcr merged commit 3b597d7 into redpanda-data:dev Apr 18, 2024
19 checks passed
@andijcr andijcr deleted the fix/azure_managed_identities_hns_detection branch April 18, 2024 09:29
"cloud_storage_azure_hierarchical_namespace_enabled",
"Whether or not Azure Hierarchical Namespaces are enabled on the "
"cloud_storage_azure_storage_account. If this property is not set, "
"cloud_storage_azure_shared_key must be set, and each node will try to "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this property is not set, cloud_storage_azure_shared_key must be set

Is cloud_storage_azure_shared_key mandatory? From reading the test code it doesn't seem so. It is just an override and can be left nullopt for auto inference in both cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not test covering this property. ⚠️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is cloud_storage_azure_shared_key mandatory

it was mandatory before the introduction of cloud_storage_credential_sources = azure_aks_oidc_federation / azure_vm_instance_metadata

There is not test covering this property. ⚠️

test_abs_cloud_validation (in this PR) is a positive test for valid configurations, but I'll add a negative test

src/v/redpanda/admin/server.cc Show resolved Hide resolved
/// available only when HNS is enabled. use the result to infer if HNS is
/// enabled.
ss::future<storage_account_info>
do_test_set_expiry_on_dummy_file(ss::lowres_clock::duration timeout);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is called do_test_set_expiry_on_dummy_file but it returns storage_account_info. That just doesn't make sense. Why not make it return a bool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's an alternative for
do_get_account_info() and it returns the same type.
maybe I can rename the caller to get_storage_account_info() , and the two functions to something like
do_storage_account_info_via_get_account_info and do_storage_account_info_via_set_expiry ? less confusing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not an alternative as it won't ever return account information. We just infer one property of the account. Instead of upgrading the "is hns supported test" to "account info" I suggest downgrading the existing "account info" method to something like "is hns supported" too.

@@ -826,7 +876,66 @@ ss::future<abs_client::list_bucket_result> abs_client::do_list_objects(

ss::future<result<abs_client::storage_account_info, error_outcome>>
abs_client::get_account_info(ss::lowres_clock::duration timeout) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this to check_hns_active. We don't really get the account info anymore. The naming is confusing. Imagine the frustraction of someone in the future trying to use this thinking that you really get account info in there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternative: #17840 (comment)

vassert(response_stream->is_header_done(), "Header is not received");
auto const& headers = response_stream->get_headers();

if (headers.result() == boost::beast::http::status::bad_request) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's our strategy for testing that this code is correct? In production? :)

I'm worried that we might break this request by accident and infer the wrong value for whether HNS gets enabled.

Also, do they return a good error code in the body we can also check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's our strategy for testing that this code is correct? In production? :)

waiting for CDT to be updated to run on azure with managed identities, so for now local emulation + manual testing.

I'm worried that we might break this request by accident and infer the wrong value for whether HNS gets enabled.

yes that's the reason for cloud_storage_azure_hierarchical_namespace_enabled

Also, do they return a good error code in the body we can also check?

testing shows x-ms-error-code: HierarchicalNamespaceNotEnabled in the header and
<Error><Code>HierarchicalNamespaceNotEnabled</Code></Error>
in the body, but it's not a code that's documented

we have to strategies: if we get a 400 but not the exact error code

  1. log error (warn?) but assume no HNS
  2. log error and stop redpanda from starting.

for 1, redpanda will start and if we are wrong there will be further errors

TRACE 2024-04-17 21:19:00,210 [shard 0:main] abs - abs_client.cc:840 - send https request:
GET /testingcontainer?restype=container&comp=list&prefix=cluster_metadata/1461eab4-668b-46bc-ae1e-97f188d46842/manifests/&delimiter=/ HTTP/1.1
Host: testingimds2ab.blob.core.windows.net
x-ms-version: 2023-01-03
Authorization: [secret]
 
ERROR 2024-04-17 21:19:00,240 [shard 0:main] abs - util.cc:108 - Unexpected error cloud_storage_clients::xml_parse_exception (Invalid state: parsing Last-Modified when not in Blob tag)
WARN  2024-04-17 21:19:00,240 [shard 0:main] cloud_storage - [fiber71~0|1|3599970ms] - remote.cc:1444 - ListObjectsV2 testingcontainer, unexpected error: cloud_storage_clients::error_outcome:1
WARN  2024-04-17 21:19:00,240 [shard 0:main] cluster - uploader.cc:377 - Manifest download failed in term 13: 2 0 List objects failedList objects failed
INFO  2024-04-17 21:19:00,240 [shard 0:main] cluster - uploader.cc:368 - Syncing cluster metadata manifest in term 13

in this case cluster/cloud_metadata/uploader.cc::upload_until_term_change() is not making progress because listing works differently under HNS

now, looking at the code it seems that the issue with listing might be salvageable with some tricks to tolerate the presence of keys for empty folders, and we might do the same for delete_object (the other method using adls endpoints to deal with directories),

so the end result would be a log with errors and potentially orphaned objects, but redpanda should keep working, if we guessed wrongly that HNS is disabled.

I'll check ifthis makes sense and if we have test coverage for delete/listing

@nvartolomei
Copy link
Contributor

nvartolomei commented Apr 18, 2024 via email

piyushredpanda added a commit that referenced this pull request Apr 19, 2024
CORE-2452 followup for azure managed identities #17840
nvartolomei added a commit to nvartolomei/redpanda that referenced this pull request Jul 31, 2024
…abled

Since redpanda-data#17840 we should be
able to always infer the correct setting. Update the docs accordingly to
avoid confusing users into thinking that they must set this setting if
`cloud_storage_azure_shared_key` is not set.
nvartolomei added a commit to nvartolomei/redpanda that referenced this pull request Jul 31, 2024
…abled

Since redpanda-data#17840 we should be
able to always infer the correct setting. Update the docs accordingly to
avoid confusing users into thinking that they must set this setting if
`cloud_storage_azure_shared_key` is not set.
nvartolomei added a commit to nvartolomei/redpanda that referenced this pull request Aug 1, 2024
…abled

Since redpanda-data#17840 we should be
able to always infer the correct setting. Update the docs accordingly to
avoid confusing users into thinking that they must set this setting if
`cloud_storage_azure_shared_key` is not set.
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this pull request Aug 8, 2024
…abled

Since redpanda-data#17840 we should be
able to always infer the correct setting. Update the docs accordingly to
avoid confusing users into thinking that they must set this setting if
`cloud_storage_azure_shared_key` is not set.

(cherry picked from commit f32c541)
vbotbuildovich pushed a commit to vbotbuildovich/redpanda that referenced this pull request Aug 8, 2024
…abled

Since redpanda-data#17840 we should be
able to always infer the correct setting. Update the docs accordingly to
avoid confusing users into thinking that they must set this setting if
`cloud_storage_azure_shared_key` is not set.

(cherry picked from commit f32c541)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants