-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CORE-2369: Fix/azure managed identities hns detection #17840
CORE-2369: Fix/azure managed identities hns detection #17840
Conversation
/ci-repeat 1 |
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47748#018ed433-9767-49fa-b5b6-8efa8d7e1c0d ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47810#018ee2cc-df96-4e00-9cf7-98e9c8463e21 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/47810#018ee2dc-1c25-467f-b270-fda369f0682b |
Can you create a ticket and add it to the sprint? |
new failures in https://buildkite.com/redpanda/redpanda/builds/47810#018ee2dc-1c1f-43a1-9d7d-ae07faa7fcfd:
new failures in https://buildkite.com/redpanda/redpanda/builds/47810#018ee2cc-df92-4f2b-8d57-4d08923eb84c:
|
the property requires cloud_storage_azure_storage_account and cloud_storage_azure_container, not the s3/gcp variants
the property needs cloud_storage_azure_storage_account and cloud_storage_azure_container, not the s3/gcp variants
e5f36bb
to
ceb6b5a
Compare
ceb6b5a
to
f330020
Compare
This property is only meaninful for Azure, it's needed to decide if HNS (Hierachical Namespace) is active for a storage account. currently, if redpanda uses ABS, a REST endpoint is called at startup to determine if HNS is active. This endpoint requires signed requests, so cloud_storage_azure_shared_key has to be set and cloud_storage_credentials_source has to be config_file. To support cloud_storage_credential_source=azure_aks_oidc_federation or azure_vm_instance_metadata, this property can be set to inform if the HNS is active; this is needed because because these two credentials source use OAuth instead of signing. The property is an optional<bool>. if it's set, abs_client::self_configure will return its value
use cloud_storage_azure_hierarchical_namespace_enabled to skip self_configure, if set
specialize test_cloud_validation for azure abs new configuration source
query to determine if a credential applier is based on access tokens. This is useful when using API that do not support oauth
Set Expiry api can be used to set autodelete on a file. It's available only if Hierarchical Namespace are enabled on the container, So it can be uses to test for HNS 200 or 404: HNS available 400: HNS not enabled
test if set_expiry is available on the container. try it on a file that does not exists: if HNS is not enabled, it will return 400 Bad Request otherwise 200 / 404 if HNS is enabled
used in case the credential applier uses oauth
f330020
to
a15bf10
Compare
tested on azure vm
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me. can you add some detail about the expiry-test example and where you learned that?
/// \brief test if Hierarchical Namespace is enabled in the storage account | ||
/// by creating a test file and calling Set Expiry on this. Set Expiry is | ||
/// available only when HNS is enabled. use the result to infer if HNS is | ||
/// enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a strategy you've seen used elsewhere, or documented somewhere? where can i learn more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated the cover letter with a couple of links and I'll add details in the confluence page for this feature
auto header = http::client::request_header{}; | ||
|
||
header.method(boost::beast::http::verb::put); | ||
header.target(fmt::format("/{}/{}?comp=expiry", name(), key().string())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow-up add unit test to ensure that the target is properly formatted.
an improperly formatted target might result in 400 instead of 404
(an example:
key() -> testingcontainer/"testHNS"?comp=expiry -> 400
key().string() -> testingcontainer/testHNS?comp=expiry -> 404
"cloud_storage_azure_hierarchical_namespace_enabled", | ||
"Whether or not Azure Hierarchical Namespaces are enabled on the " | ||
"cloud_storage_azure_storage_account. If this property is not set, " | ||
"cloud_storage_azure_shared_key must be set, and each node will try to " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this property is not set, cloud_storage_azure_shared_key must be set
Is cloud_storage_azure_shared_key
mandatory? From reading the test code it doesn't seem so. It is just an override and can be left nullopt for auto inference in both cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not test covering this property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is cloud_storage_azure_shared_key mandatory
it was mandatory before the introduction of cloud_storage_credential_sources = azure_aks_oidc_federation / azure_vm_instance_metadata
There is not test covering this property.
⚠️
test_abs_cloud_validation (in this PR) is a positive test for valid configurations, but I'll add a negative test
/// available only when HNS is enabled. use the result to infer if HNS is | ||
/// enabled. | ||
ss::future<storage_account_info> | ||
do_test_set_expiry_on_dummy_file(ss::lowres_clock::duration timeout); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is called do_test_set_expiry_on_dummy_file
but it returns storage_account_info
. That just doesn't make sense. Why not make it return a bool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's an alternative for
do_get_account_info()
and it returns the same type.
maybe I can rename the caller to get_storage_account_info()
, and the two functions to something like
do_storage_account_info_via_get_account_info
and do_storage_account_info_via_set_expiry
? less confusing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not an alternative as it won't ever return account information. We just infer one property of the account. Instead of upgrading the "is hns supported test" to "account info" I suggest downgrading the existing "account info" method to something like "is hns supported" too.
@@ -826,7 +876,66 @@ ss::future<abs_client::list_bucket_result> abs_client::do_list_objects( | |||
|
|||
ss::future<result<abs_client::storage_account_info, error_outcome>> | |||
abs_client::get_account_info(ss::lowres_clock::duration timeout) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rename this to check_hns_active
. We don't really get the account info anymore. The naming is confusing. Imagine the frustraction of someone in the future trying to use this thinking that you really get account info in there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternative: #17840 (comment)
vassert(response_stream->is_header_done(), "Header is not received"); | ||
auto const& headers = response_stream->get_headers(); | ||
|
||
if (headers.result() == boost::beast::http::status::bad_request) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's our strategy for testing that this code is correct? In production? :)
I'm worried that we might break this request by accident and infer the wrong value for whether HNS gets enabled.
Also, do they return a good error code in the body we can also check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's our strategy for testing that this code is correct? In production? :)
waiting for CDT to be updated to run on azure with managed identities, so for now local emulation + manual testing.
I'm worried that we might break this request by accident and infer the wrong value for whether HNS gets enabled.
yes that's the reason for cloud_storage_azure_hierarchical_namespace_enabled
Also, do they return a good error code in the body we can also check?
testing shows x-ms-error-code: HierarchicalNamespaceNotEnabled
in the header and
<Error><Code>HierarchicalNamespaceNotEnabled</Code></Error>
in the body, but it's not a code that's documented
we have to strategies: if we get a 400 but not the exact error code
- log error (warn?) but assume no HNS
- log error and stop redpanda from starting.
for 1, redpanda will start and if we are wrong there will be further errors
TRACE 2024-04-17 21:19:00,210 [shard 0:main] abs - abs_client.cc:840 - send https request:
GET /testingcontainer?restype=container&comp=list&prefix=cluster_metadata/1461eab4-668b-46bc-ae1e-97f188d46842/manifests/&delimiter=/ HTTP/1.1
Host: testingimds2ab.blob.core.windows.net
x-ms-version: 2023-01-03
Authorization: [secret]
ERROR 2024-04-17 21:19:00,240 [shard 0:main] abs - util.cc:108 - Unexpected error cloud_storage_clients::xml_parse_exception (Invalid state: parsing Last-Modified when not in Blob tag)
WARN 2024-04-17 21:19:00,240 [shard 0:main] cloud_storage - [fiber71~0|1|3599970ms] - remote.cc:1444 - ListObjectsV2 testingcontainer, unexpected error: cloud_storage_clients::error_outcome:1
WARN 2024-04-17 21:19:00,240 [shard 0:main] cluster - uploader.cc:377 - Manifest download failed in term 13: 2 0 List objects failedList objects failed
INFO 2024-04-17 21:19:00,240 [shard 0:main] cluster - uploader.cc:368 - Syncing cluster metadata manifest in term 13
in this case cluster/cloud_metadata/uploader.cc::upload_until_term_change() is not making progress because listing works differently under HNS
now, looking at the code it seems that the issue with listing might be salvageable with some tricks to tolerate the presence of keys for empty folders, and we might do the same for delete_object (the other method using adls endpoints to deal with directories),
so the end result would be a log with errors and potentially orphaned objects, but redpanda should keep working, if we guessed wrongly that HNS is disabled.
I'll check ifthis makes sense and if we have test coverage for delete/listing
I would stop Redpanda from starting in that case. We can override the
config locally (set manually hns override) in that case right?
…On Thu, 18 Apr 2024 at 12:01, Andrea Barbadoro ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In src/v/cloud_storage_clients/abs_client.cc
<#17840 (comment)>
:
> + abs_log.error,
+ "Failed to create set_expiry header: {}",
+ set_expiry_header.error());
+ throw std::system_error(set_expiry_header.error());
+ }
+
+ vlog(abs_log.trace, "send https request:\n{}", set_expiry_header.value());
+
+ auto response_stream = co_await _client.request(
+ std::move(set_expiry_header.value()), timeout);
+
+ co_await response_stream->prefetch_headers();
+ vassert(response_stream->is_header_done(), "Header is not received");
+ auto const& headers = response_stream->get_headers();
+
+ if (headers.result() == boost::beast::http::status::bad_request) {
What's our strategy for testing that this code is correct? In production?
:)
waiting for CDT to be updated to run on azure with managed identities, so
for now local emulation + manual testing.
I'm worried that we might break this request by accident and infer the
wrong value for whether HNS gets enabled.
yes that's the reason for
cloud_storage_azure_hierarchical_namespace_enabled
Also, do they return a good error code in the body we can also check?
testing shows x-ms-error-code: HierarchicalNamespaceNotEnabled in the
header and
<Error><Code>HierarchicalNamespaceNotEnabled</Code></Error>
in the body, but it's not a code that's documented
we have to strategies: if we get a 400 but not the exact error code
1. log error (warn?) but assume no HNS
2. log error and stop redpanda from starting.
for 1, redpanda will start and if we are wrong there will be further errors
TRACE 2024-04-17 21:19:00,210 [shard 0:main] abs - abs_client.cc:840 - send https request:
GET /testingcontainer?restype=container&comp=list&prefix=cluster_metadata/1461eab4-668b-46bc-ae1e-97f188d46842/manifests/&delimiter=/ HTTP/1.1
Host: testingimds2ab.blob.core.windows.net
x-ms-version: 2023-01-03
Authorization: [secret]
ERROR 2024-04-17 21:19:00,240 [shard 0:main] abs - util.cc:108 - Unexpected error cloud_storage_clients::xml_parse_exception (Invalid state: parsing Last-Modified when not in Blob tag)
WARN 2024-04-17 21:19:00,240 [shard 0:main] cloud_storage - [fiber71~0|1|3599970ms] - remote.cc:1444 - ListObjectsV2 testingcontainer, unexpected error: cloud_storage_clients::error_outcome:1
WARN 2024-04-17 21:19:00,240 [shard 0:main] cluster - uploader.cc:377 - Manifest download failed in term 13: 2 0 List objects failedList objects failed
INFO 2024-04-17 21:19:00,240 [shard 0:main] cluster - uploader.cc:368 - Syncing cluster metadata manifest in term 13
in this case
cluster/cloud_metadata/uploader.cc::upload_until_term_change() is not
making progress because listing works differently under HNS
now, looking at the code it *seems* that the issue with listing might be
salvageable with some tricks to tolerate the presence of keys for empty
folders, and we *might* do the same for delete_object (the other method
using adls endpoints to deal with directories),
so the end result would be a log with errors and potentially orphaned
objects, but redpanda should keep working, if we guessed wrongly that HNS
is disabled.
I'll check ifthis makes sense and if we have test coverage for
delete/listing
—
Reply to this email directly, view it on GitHub
<#17840 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEETWNFOPCRPMF3ENVNZ5DY56RW5AVCNFSM6AAAAABGEXJJVCVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMBYGU2DCNJQGE>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
CORE-2452 followup for azure managed identities #17840
…abled Since redpanda-data#17840 we should be able to always infer the correct setting. Update the docs accordingly to avoid confusing users into thinking that they must set this setting if `cloud_storage_azure_shared_key` is not set.
…abled Since redpanda-data#17840 we should be able to always infer the correct setting. Update the docs accordingly to avoid confusing users into thinking that they must set this setting if `cloud_storage_azure_shared_key` is not set.
…abled Since redpanda-data#17840 we should be able to always infer the correct setting. Update the docs accordingly to avoid confusing users into thinking that they must set this setting if `cloud_storage_azure_shared_key` is not set.
…abled Since redpanda-data#17840 we should be able to always infer the correct setting. Update the docs accordingly to avoid confusing users into thinking that they must set this setting if `cloud_storage_azure_shared_key` is not set. (cherry picked from commit f32c541)
…abled Since redpanda-data#17840 we should be able to always infer the correct setting. Update the docs accordingly to avoid confusing users into thinking that they must set this setting if `cloud_storage_azure_shared_key` is not set. (cherry picked from commit f32c541)
summary:
cloud_storage_azure_hierarchical_namespace_enabled
optional. If set, will skip abs_client::self_configure and use the value of the property (set it to true if the azure_container has HNS enabled)cloud_storage_credentials_source= azure_aks_oidc_federation or azure_vm_instance_metadata
) , discover if HNS is enabled via Set Blob Expiry endpointHNS (Hierarchical Namespace) is an ABS feature that changes the behavior of some APIs, like deleting/listing.
The current implementation for ABS checks with Get Account Information at startup if HNS is enabled for this account/container
#12632
https://learn.microsoft.com/en-us/rest/api/storageservices/get-account-information?tabs=shared-access-signatures
This rest endpoint is not compatible with OAuth, causing the cluster configured with
cloud_storage_credentials_source = azure_aks_oidc_federation | azure_vm_managed_identity
to fail at a startup
This PR uses a different API, Set Blob Expirty, if the authentication is OAuth.
This API will return 400 if HNS is not enabled and 200/404 if enabled. It is run on a non-existing object, so it's expected to return 404 (not found)
related to this issue, cloud_storage_enable=true would cause
cloud_storage_credentials_source = azure_aks_oidc_federation | azure_vm_managed_identity
to check for cloud_storage_region/cloud_storage_bucket instead of their azure counterparts.Fixes: CORE-2369
Fixes: CORE-2375
Backports Required
Release Notes