-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: AWS Credential Refresh in Provider Chain for MSK IAM and add TLS support #11211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAttach a persistent TLS instance and AWS provider to the MSK IAM context; refactor payload builder to accept explicit credentials from that provider; force immediate credential refresh (clear cache + set next_refresh = 1) in EC2/STS/EKS/HTTP refresh paths; initialize provider/TLS at registration (sync → async) and destroy them on teardown; adjust shared-credentials ENOENT logging to debug. Changes
Sequence Diagram(s)sequenceDiagram
participant Reg as register()
participant MSK as flb_aws_msk_iam (context)
participant Prov as AWS Provider
participant OAuth as oauthbearer_token_refresh_cb
participant Payload as build_msk_iam_payload(creds)
participant Meta as Metadata/STS/EKS/HTTP
Note over Reg,MSK: register() creates & sync-initializes cred_tls and Provider
Reg->>MSK: attach cred_tls & Provider (init sync)
Reg->>Prov: switch provider to async
OAuth->>Prov: request credentials
alt force refresh path
Prov->>Prov: destroy cached creds
Prov->>Prov: set implementation->next_refresh = 1
end
Prov->>Meta: obtain temporary credentials (IMDS/ST S/EKS/HTTP)
Meta-->>Prov: return credentials
Prov-->>OAuth: provide creds
OAuth->>Payload: build payload using creds
Payload->>Meta: signed POST (OAuth exchange)
Meta-->>Payload: token
Payload-->>OAuth: token returned
OAuth->>Prov: destroy local creds
Note over MSK: Provider & cred_tls persist until destroy()
Reg->>MSK: destroy() tears down Provider and cred_tls
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/aws/flb_aws_msk_iam.c (3)
219-220: Consider usingflb_debugfor per-refresh logging.This log statement executes on every OAuth token refresh (~12 minutes). For production environments with multiple output instances,
flb_infomay be too verbose. The PR objective mentions reducing endpoint selection logs to debug level—this payload generation log could follow the same pattern.Also, the log message references
build_msk_iam_payload_with_credsbut the actual function name isbuild_msk_iam_payload.- flb_info("[aws_msk_iam] build_msk_iam_payload_with_creds: generating payload for host: %s, region: %s", + flb_debug("[aws_msk_iam] build_msk_iam_payload: generating payload for host: %s, region: %s", host, config->region);
638-638: Consider aligning log level with other endpoint logs.Lines 626, 630, and 635 use
flb_debugfor endpoint selection, but line 638 still usesflb_info. For consistency with the PR's goal of reducing verbosity, this could also beflb_debug.- flb_info("[aws_msk_iam] requesting MSK IAM payload for region: %s, host: %s", config->region, host); + flb_debug("[aws_msk_iam] requesting MSK IAM payload for region: %s, host: %s", config->region, host);
745-761: TLS debug level may be verbose in production.The
debugparameter is set toFLB_LOG_DEBUG, which could generate verbose TLS-level logging in production environments. Consider using a lower debug level (e.g.,0orFLB_FALSE) to reduce log noise, or making this configurable.ctx->cred_tls = flb_tls_create(FLB_TLS_CLIENT_MODE, FLB_TRUE, - FLB_LOG_DEBUG, + 0, /* TLS debug level - 0 for production */ NULL, /* vhost */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/aws/flb_aws_msk_iam.c(10 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.
Applied to files:
src/aws/flb_aws_msk_iam.c
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.
Applied to files:
src/aws/flb_aws_msk_iam.c
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Applied to files:
src/aws/flb_aws_msk_iam.c
🧬 Code graph analysis (1)
src/aws/flb_aws_msk_iam.c (5)
src/aws/flb_aws_credentials.c (3)
flb_aws_credentials_destroy(752-767)flb_standard_chain_provider_create(269-327)flb_aws_provider_destroy(769-791)src/tls/flb_tls.c (2)
flb_tls_create(183-232)flb_tls_destroy(258-277)src/flb_sds.c (1)
flb_sds_destroy(389-399)include/fluent-bit/flb_mem.h (1)
flb_free(126-128)src/aws/flb_aws_util.c (1)
flb_aws_client_generator(280-283)
🔇 Additional comments (3)
src/aws/flb_aws_msk_iam.c (3)
41-48: LGTM - Struct changes align with the persistent provider design.The addition of
cred_tlsandproviderfields enables the credential lifecycle fix described in the PR objectives. The inline comments clearly document their purpose.
640-674: Credential lifecycle implementation looks correct.The pattern of fetching credentials from the persistent provider, using them for payload generation, and immediately destroying them prevents credential leakage and ensures fresh credentials on each refresh. This addresses the core issue of stale/expired IRSA credentials.
765-791: I'll help you verify the potential double-free issue with TLS ownership. Let me search the codebase for the relevant implementation details.
<function_calls>
#!/bin/bashSearch for flb_standard_chain_provider_create function definition
rg -n "flb_standard_chain_provider_create" src/aws/ -A 20 -B 2
Also search for where this function is defined (likely in a provider file)
fd ".provider.\.c$" src/aws/ | head -20
</function_calls>
| /* Destroy AWS provider (provider manages its own credential caching) */ | ||
| if (ctx->provider) { | ||
| flb_aws_provider_destroy(ctx->provider); | ||
| } | ||
|
|
||
| /* Clean up TLS instance */ | ||
| if (ctx->cred_tls) { | ||
| flb_tls_destroy(ctx->cred_tls); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double-free risk: verify TLS destruction order.
If flb_aws_provider_destroy also destroys the TLS instance (as shown in the relevant snippet where it calls flb_tls_destroy(provider->cred_tls)), then calling flb_tls_destroy(ctx->cred_tls) on line 821 after destroying the provider would cause a double-free.
If the provider does take ownership of the TLS, remove the explicit TLS destruction here:
/* Destroy AWS provider (provider manages its own credential caching) */
if (ctx->provider) {
flb_aws_provider_destroy(ctx->provider);
}
- /* Clean up TLS instance */
- if (ctx->cred_tls) {
- flb_tls_destroy(ctx->cred_tls);
- }
+ /* Note: TLS instance is owned and destroyed by the provider */If the provider does not take ownership, the current code is correct and this comment can be dismissed.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /* Destroy AWS provider (provider manages its own credential caching) */ | |
| if (ctx->provider) { | |
| flb_aws_provider_destroy(ctx->provider); | |
| } | |
| /* Clean up TLS instance */ | |
| if (ctx->cred_tls) { | |
| flb_tls_destroy(ctx->cred_tls); | |
| } | |
| /* Destroy AWS provider (provider manages its own credential caching) */ | |
| if (ctx->provider) { | |
| flb_aws_provider_destroy(ctx->provider); | |
| } | |
| /* Note: TLS instance is owned and destroyed by the provider */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a double-free issue. The current code is correct.
After thorough analysis of the codebase, I can confirm that there is no double-free risk in the current implementation. Here's why:
TLS Ownership Model
There are two different credential provider creation patterns in the codebase with different ownership semantics:
-
flb_standard_chain_provider_create(used by MSK IAM):- Does NOT take ownership of the TLS instance
- The TLS is only passed as a parameter to sub-providers (e.g., EKS provider) for their use
- The
provider->cred_tlsfield is never set in this path - The caller remains responsible for managing the TLS lifecycle
-
flb_managed_chain_provider_create(used elsewhere):- Takes ownership of the TLS instance
- Stores it in
provider->cred_tls flb_aws_provider_destroyautomatically destroys it
Code Evidence
In src/aws/flb_aws_credentials.c, the flb_aws_provider_destroy function only destroys the TLS if it's present:
void flb_aws_provider_destroy(struct flb_aws_provider *provider)
{
if (provider) {
// ...
/* free managed dependencies */
if (provider->cred_tls) {
flb_tls_destroy(provider->cred_tls);
}
// ...
}
}Since flb_standard_chain_provider_create never assigns the TLS to provider->cred_tls, calling flb_aws_provider_destroy will not attempt to destroy the TLS instance.
Conclusion
The MSK IAM module correctly manages the TLS instance it creates. The explicit call to flb_tls_destroy(ctx->cred_tls) in flb_aws_msk_iam_destroy is necessary and correct because the provider does not take ownership of the TLS in this code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find and examine the implementation of flb_standard_chain_provider_create
echo "=== Searching for flb_standard_chain_provider_create implementation ==="
rg -n -A 30 "flb_standard_chain_provider_create\s*\(" --type c
echo ""
echo "=== Searching for flb_managed_chain_provider_create implementation ==="
rg -n -A 30 "flb_managed_chain_provider_create\s*\(" --type c
echo ""
echo "=== Examining flb_aws_provider_destroy for TLS cleanup ==="
rg -n -B 5 -A 10 "if \(provider->cred_tls\)" --type c
Length of output: 45861
🏁 Script executed:
#!/bin/bash
# Find the implementation of standard_chain_create to verify TLS ownership
echo "=== Searching for standard_chain_create implementation ==="
rg -n -B 5 -A 50 "^static.*standard_chain_create\s*\(" --type c src/aws/
echo ""
echo "=== Looking for assignments to provider->cred_tls in standard_chain_create context ==="
rg -n "provider->cred_tls\s*=" --type c src/aws/
Length of output: 8230
🏁 Script executed:
#!/bin/bash
# Get more context around the cred_tls assignments
echo "=== Context around provider->cred_tls assignments (lines 460-490) ==="
rg -n -C 15 "aws_provider->cred_tls\s*=" --type c src/aws/flb_aws_credentials.c
echo ""
echo "=== Verify standard_chain_create does not set cred_tls ==="
rg -n -A 100 "^static struct flb_aws_provider \*standard_chain_create" --type c src/aws/flb_aws_credentials.c | head -120
Length of output: 6596
@kalavt, thank you for the thorough analysis! After examining the codebase, I can confirm that you are absolutely correct—there is no double-free risk in this code.
Verification Results
The code evidence clearly shows two distinct ownership patterns:
1. flb_standard_chain_provider_create (used by MSK IAM):
- Does NOT set
provider->cred_tls - The TLS instance is only passed to sub-providers (like EKS provider) for their use
- The caller retains ownership and is responsible for cleanup
2. flb_managed_chain_provider_create (used elsewhere):
- DOES set
aws_provider->cred_tls = cred_tls;(line 471) - Takes ownership of the TLS instance
flb_aws_provider_destroywill automatically clean it up
Why This Code is Safe
In flb_aws_provider_destroy (lines 782-784):
if (provider->cred_tls) {
flb_tls_destroy(provider->cred_tls);
}Since flb_standard_chain_provider_create never sets provider->cred_tls, the condition evaluates to false, and the TLS is not destroyed by the provider. The explicit call to flb_tls_destroy(ctx->cred_tls) in MSK IAM's destroy function is necessary and correct.
The error-handling paths in MSK IAM's initialization (lines 775, 786) also demonstrate the correct ownership model by explicitly destroying ctx->cred_tls after provider cleanup.
Your analysis is spot-on, and the current implementation properly follows the ownership semantics of the credential provider system. No changes are needed.
🧬 Code Graph Analysis Results
-
File: src/aws/flb_aws_credentials.c
- Snippet: flb_aws_provider_destroy function (complete)
-
Summary/Context:
- Description: Destroys an AWS provider and its managed TLS credential (cred_tls) if present.
- Parameters: struct flb_aws_provider *provider
- Return: void
- Behavior:
- If provider is non-NULL, destroys provider implementation, mutex, and any managed dependencies.
- If provider->cred_tls is non-NULL, calls flb_tls_destroy(provider->cred_tls).
- If provider->sts_tls is non-NULL, calls flb_tls_destroy(provider->sts_tls).
- Frees the provider structure.
-
Code:
void flb_aws_provider_destroy(struct flb_aws_provider *provider)
{
if (provider) {
if (provider->implementation) {
provider->provider_vtable->destroy(provider);
}pthread_mutex_destroy(&provider->lock); /* free managed dependencies */ if (provider->base_aws_provider) { flb_aws_provider_destroy(provider->base_aws_provider); } if (provider->cred_tls) { flb_tls_destroy(provider->cred_tls); } if (provider->sts_tls) { flb_tls_destroy(provider->sts_tls); } flb_free(provider); }}
-
- Snippet: flb_aws_provider_destroy function (complete)
-
File: src/aws/flb_aws_credentials.c
- Snippet: flb_standard_chain_provider_create function (complete)
-
Summary/Context:
- Description: Creates a standard AWS credentials provider chain. Accepts an optional TLS instance and region, and may wrap the base provider with additional layers (e.g., EKS Pod execution role for Fargate). The function may create a base provider via standard_chain_create and, in the EKS Pod Fargate path, wraps it with an STS provider.
- Parameters:
- config: struct flb_config*
- tls: struct flb_tls* (TLS instance; ownership semantics depend on caller path)
- region: char* (region string)
- sts_endpoint: char* (STS endpoint)
- proxy: char* (proxy)
- generator: struct flb_aws_client_generator*
- profile: char* (profile)
- Return: struct flb_aws_provider*
- Exceptions/Errors: Returns NULL on allocation/creation failure; ensures proper cleanup on error paths.
- Important implementation details:
- If EKS Pod Execution Role environment variable is set, it creates a temporary base provider with TLS but does not transfer TLS ownership to the base provider; instead, it wraps with an STS pod provider.
- In non-EKS-Fargate path, it creates a standard chain provider using the provided TLS and region.
-
Code:
struct flb_aws_provider *flb_standard_chain_provider_create(struct flb_config
*config,
struct flb_tls *tls,
char *region,
char *sts_endpoint,
char *proxy,
struct
flb_aws_client_generator
*generator,
char *profile)
{
struct flb_aws_provider *provider;
struct flb_aws_provider *tmp_provider;
char *eks_pod_role = NULL;
char *session_name;eks_pod_role = getenv(EKS_POD_EXECUTION_ROLE); if (eks_pod_role && strlen(eks_pod_role) > 0) { /* * eks fargate * standard chain will be base provider used to * assume the EKS_POD_EXECUTION_ROLE */ flb_debug("[aws_credentials] Using EKS_POD_EXECUTION_ROLE=%s", eks_pod_role); tmp_provider = standard_chain_create(config, tls, region, sts_endpoint, proxy, generator, FLB_FALSE, profile); if (!tmp_provider) { return NULL; } session_name = flb_sts_session_name(); if (!session_name) { flb_error("Failed to generate random STS session name"); flb_aws_provider_destroy(tmp_provider); return NULL; } provider = flb_sts_provider_create(config, tls, tmp_provider, NULL, eks_pod_role, session_name, region, sts_endpoint, NULL, generator); if (!provider) { flb_error("Failed to create EKS Fargate Credential Provider"); flb_aws_provider_destroy(tmp_provider); return NULL; } /* session name can freed after provider is created */ flb_free(session_name); session_name = NULL; return provider; } /* standard case- not in EKS Fargate */ provider = standard_chain_create(config, tls, region, sts_endpoint, proxy, generator, FLB_TRUE, profile); return provider;}
-
- Snippet: flb_standard_chain_provider_create function (complete)
-
File: src/aws/flb_aws_msk_iam.c
-
Snippet: MSK IAM destroy logic (summary)
-
Summary/Context:
- Description: Cleans up MSK IAM context. Destroys the AWS provider (which manages credential caching) and then destroys the TLS instance used for AWS credentials if present.
- Parameters: struct flb_aws_msk_iam *ctx
- Return: void
- Important implementation details:
- Calls flb_aws_provider_destroy(ctx->provider) to clean up the provider (which may destroy cred_tls if owned by the provider).
- Destroys ctx->cred_tls with flb_tls_destroy if it exists.
-
Code (excerpt from the file):
void flb_aws_msk_iam_destroy(struct flb_aws_msk_iam *ctx)
{
if (!ctx) {
return;
}flb_info("[aws_msk_iam] destroying MSK IAM config"); /* Destroy AWS provider (provider manages its own credential caching) */ if (ctx->provider) { flb_aws_provider_destroy(ctx->provider); } /* Clean up TLS instance */ if (ctx->cred_tls) { flb_tls_destroy(ctx->cred_tls); } /* Clean up other resources */ if (ctx->region) { flb_sds_destroy(ctx->region); } if (ctx->cluster_arn) { flb_sds_destroy(ctx->cluster_arn); } flb_free(ctx); flb_info("[aws_msk_iam] MSK IAM config destroyed");}
-
-
Snippet: MSK IAM provider/provider lifecycle and TLS ownership notes (from file)
- Context:
- The code creates a TLS instance for AWS credentials (STS) and passes that TLS handle to the standard credential provider creation path. In this path, the provider does not take ownership of the TLS; the TLS is owned by the MSK IAM module and is explicitly destroyed in its destroy function.
- This aligns with the described ownership model where flb_standard_chain_provider_create does not assign the TLS to provider->cred_tls, whereas flb_managed_chain_provider_create would take ownership and be able to destroy it via flb_aws_provider_destroy.
- Context:
-
Signed-off-by: Arbin <arbin.cheng@coins.ph>
Signed-off-by: Arbin <arbin.cheng@coins.ph>
Signed-off-by: Arbin <arbin.cheng@coins.ph>
5864292 to
1f9b74a
Compare
💡 Codex Reviewfluent-bit/src/aws/flb_aws_msk_iam.c Lines 782 to 784 in 5864292
The provider is now created and initialized during Kafka configuration, but lines 782‑784 call ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
src/aws/flb_aws_msk_iam.c (1)
805-822: Confirmcred_tlsownership to avoid a potential TLS double-freeThe destroy path now calls:
if (ctx->provider) { flb_aws_provider_destroy(ctx->provider); } if (ctx->cred_tls) { flb_tls_destroy(ctx->cred_tls); }A previous review noted that
flb_aws_provider_destroymight already destroy its internalcred_tls. If that’s still true, the explicitflb_tls_destroy(ctx->cred_tls)here would double-free the TLS object. If the provider does not own/destroy the TLS, this code is correct as-is.Please double-check
flb_aws_provider_destroyand any provider implementations used byflb_standard_chain_provider_createto confirm ownership, and then either keep or remove the explicit TLS destroy accordingly.You can run this from the repo root to inspect the destroy behavior:
#!/bin/bash # Inspect provider destroy and TLS ownership rg -n "flb_aws_provider_destroy" -C5 src # Also look for any flb_tls_destroy calls tied to provider fields rg -n "cred_tls" -C3 src
🧹 Nitpick comments (4)
src/aws/flb_aws_msk_iam.c (4)
41-48: MSK IAM context extension with TLS/provider looks good; unusedflb_configis a minor nitThe new
cred_tlsandproviderfields make sense for a persistent provider+TLS design, and the struct remains straightforward.ctx->flb_configis now only stored and never read, and the “lightweight config” description is a bit outdated given the embedded provider/TLS; consider either usingflb_configor dropping it and updating the comment to avoid confusion.
168-231: Creds-based payload builder is correct; align log naming / verbosityThe refactor to pass
struct flb_aws_credentials *credsexplicitly and the added NULL / completeness checks are solid. Two small nits:
- The log prefix
build_msk_iam_payload_with_credsdoesn’t match the actual function name, which can make grepping harder.- This
flb_infofires on every token refresh; consider downgrading toflb_debugif logs get noisy in large deployments.
640-687: Provider-backed credential retrieval & lifetime management are soundThe new flow—
get_credentialsfrom the persistent provider, building the payload with those creds, passingcreds->access_key_idas the principal, and then destroying the credentials exactly once in both success and failure paths—looks correct and avoids the previous “temporary provider per refresh” pattern. This assumesconfig->providerand its vtable are always valid once registration succeeds; if you expect any teardown concurrency between Kafka and the MSK IAM ctx, it may be worth adding a defensive NULL check or assertion aroundconfig->providerbefore dereferencing.
706-794: TLS + standard-chain provider init and error paths look good; stale comments can be updatedCreating a dedicated
cred_tlswithFLB_TLS_CLIENT_MODEand wiring it intoflb_standard_chain_provider_create, followed by an explicitprovider_vtable->init, gives a clear, single provider lifecycle, and the error paths correctly unwind TLS, region, ARN, and ctx on failures.Minor maintainability points:
- The comments at Lines 706–707 and 713 (“NO AWS provider!” / “on-demand provider creation”) no longer reflect the actual behavior now that the provider is created eagerly here.
- With the provider created only once,
ctx->flb_configappears unused beyond assignment; consider either using it (if you anticipate future re-init logic) or removing it to keep the context minimal.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/aws/flb_aws_msk_iam.c(10 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.
Applied to files:
src/aws/flb_aws_msk_iam.c
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.
Applied to files:
src/aws/flb_aws_msk_iam.c
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Applied to files:
src/aws/flb_aws_msk_iam.c
🧬 Code graph analysis (1)
src/aws/flb_aws_msk_iam.c (3)
src/aws/flb_aws_credentials.c (2)
flb_aws_credentials_destroy(752-767)flb_standard_chain_provider_create(269-327)src/tls/flb_tls.c (2)
flb_tls_create(183-232)flb_tls_destroy(258-277)src/aws/flb_aws_util.c (1)
flb_aws_client_generator(280-283)
🔇 Additional comments (1)
src/aws/flb_aws_msk_iam.c (1)
616-636: MSK generic endpoint selection and downgraded logging look correctUsing
kafka-serverless.<region>.amazonaws.comwhen the ARN ends with-s3andkafka.<region>.amazonaws.comotherwise matches the expected MSK/Serverless split, and switching the endpoint logs toflb_debugis appropriate to avoid log spam during frequent refreshes.
- Switch provider to sync mode before initialization to prevent hanging - Initialize provider with sync mode (required before event loop is available) - Switch back to async mode after successful initialization - Follows pattern used by other AWS credential providers This fixes potential credential initialization failures in IRSA/EKS deployments where HTTP requests during init would hang without the event loop. Signed-off-by: Arbin <arbin.cheng@coins.ph>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/aws/flb_aws_msk_iam.c (3)
219-220: Minor: Log message references non-existent function name.The log message says
build_msk_iam_payload_with_credsbut the actual function is namedbuild_msk_iam_payload. Consider updating for consistency.- flb_info("[aws_msk_iam] build_msk_iam_payload_with_creds: generating payload for host: %s, region: %s", + flb_info("[aws_msk_iam] build_msk_iam_payload: generating payload for host: %s, region: %s", host, config->region);
412-458: Redundant code: First base64 encoding is discarded.This entire block encodes
presigned_urlintopayload, but the result is immediately discarded whenpayloadis destroyed at line 483 and re-encoded with the User-Agent appended. The meaningful encoding happens at lines 479-517.Remove lines 412-458 and proceed directly to building the URL with User-Agent:
query = tmp; - /* Build the complete presigned URL */ - presigned_url = flb_sds_create_size(16384); - if (!presigned_url) { - goto error; - } - - presigned_url = flb_sds_printf(&presigned_url, "https://%s/?%s", host, query); - if (!presigned_url) { - goto error; - } - - /* Base64 URL encode the presigned URL */ - url_len = flb_sds_len(presigned_url); - encoded_len = ((url_len + 2) / 3) * 4 + 1; /* Base64 encoding size + null terminator */ - - payload = flb_sds_create_size(encoded_len); - if (!payload) { - goto error; - } - - encode_result = flb_base64_encode((unsigned char*) payload, encoded_len, &actual_encoded_len, - (const unsigned char*) presigned_url, url_len); - if (encode_result == -1) { - flb_error("[aws_msk_iam] build_msk_iam_payload: failed to base64 encode URL"); - goto error; - } - flb_sds_len_set(payload, actual_encoded_len); - - /* Convert to Base64 URL encoding (replace + with -, / with _, remove padding =) */ - p = payload; - while (*p) { - if (*p == '+') { - *p = '-'; - } - else if (*p == '/') { - *p = '_'; - } - p++; - } - - /* Remove padding */ - len = flb_sds_len(payload); - while (len > 0 && payload[len-1] == '=') { - len--; - } - flb_sds_len_set(payload, len); - payload[len] = '\0'; - /* Build the complete presigned URL */ - flb_sds_destroy(presigned_url); presigned_url = flb_sds_create_size(16384);
746-754: Consider TLS debug level for production.The TLS instance is created with
FLB_LOG_DEBUGas the debug parameter (line 748). This may produce verbose TLS debugging output in production. If this was intentional for diagnosing the credential expiration issue, consider reducing it to a lower verbosity level before release.ctx->cred_tls = flb_tls_create(FLB_TLS_CLIENT_MODE, FLB_TRUE, - FLB_LOG_DEBUG, + 0, /* disable TLS debug output */ NULL, /* vhost */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/aws/flb_aws_msk_iam.c(10 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.
Applied to files:
src/aws/flb_aws_msk_iam.c
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.
Applied to files:
src/aws/flb_aws_msk_iam.c
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Applied to files:
src/aws/flb_aws_msk_iam.c
🧬 Code graph analysis (1)
src/aws/flb_aws_msk_iam.c (3)
src/aws/flb_aws_credentials.c (3)
flb_aws_credentials_destroy(752-767)flb_standard_chain_provider_create(269-327)flb_aws_provider_destroy(769-791)src/tls/flb_tls.c (2)
flb_tls_create(183-232)flb_tls_destroy(258-277)src/aws/flb_aws_util.c (1)
flb_aws_client_generator(280-283)
🔇 Additional comments (4)
src/aws/flb_aws_msk_iam.c (4)
41-48: LGTM! Struct now properly encapsulates persistent provider and TLS state.The addition of
cred_tlsandproviderfields enables the core fix: maintaining a long-lived credential provider that handles automatic refresh, rather than creating ephemeral providers on each token refresh.
640-674: Well-structured credential lifecycle.The credential handling follows best practices:
- Credentials are obtained from the persistent provider (which handles caching/refresh internally)
- Used immediately for payload generation and token setting
- Destroyed immediately after use (line 673) to minimize exposure window
This pattern properly addresses the original bug where ephemeral providers could return stale credentials.
782-794: Provider initialization pattern looks correct.The sync → init → async sequence ensures the provider is properly initialized before the event loop takes over. This is the correct pattern for credential providers that need synchronous initialization during startup.
817-825: Cleanup logic is correct.As previously confirmed in the review discussion,
flb_standard_chain_provider_createdoes not take ownership of the TLS instance (unlikeflb_managed_chain_provider_create). The explicit cleanup of both the provider and the TLS instance in this order is correct and necessary.
- Add force refresh logic to EC2, STS, and EKS credential providers - Set next_refresh to 0 in refresh functions to ensure immediate credential update - Fixes MSK IAM authentication failures after ~1 hour due to stale credentials - Aligns with AWS SDK behavior where refresh() means force refresh This resolves the issue where OAuth token refresh (every ~15 minutes) would not actually refresh AWS credentials until next_refresh time was reached (typically 1 hour later), causing MSK connection failures with 'Access denied' errors. The fix ensures that every OAuth callback will fetch fresh credentials from AWS, matching the behavior of official AWS SDKs (Python, Java). Signed-off-by: Arbin <arbin.cheng@coins.ph>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/aws/flb_aws_msk_iam.c (1)
578-695: Potential use-after-free risk: librdkafka does not explicitly guarantee that token and principal strings are copiedThe review comment's assumption that librdkafka copies token and principal strings is not confirmed by official documentation. Web search of librdkafka C API reference shows:
- The
extensionsarray is explicitly noted as copied by librdkafka- The
token_valueandmd_principal_nameparameters have no explicit copy guarantee in the C API reference- Passing
creds->access_key_id(fromcreds) asmd_principal_name, then immediately destroyingcreds, could introduce a use-after-free if the strings are not copiedRequired actions:
Verify string lifetime guarantee against the exact librdkafka version in use, or consult the maintainers directly. If strings are NOT copied, the code must either:
- Copy the principal string to a persistent buffer before destroy, or
- Use a static/cached principal name if acceptable for your use case
Add refresh() error logging (original suggestion remains valid):
int rc = config->provider->provider_vtable->refresh(config->provider); if (rc < 0) { flb_warn("[aws_msk_iam] AWS provider refresh() failed (rc=%d), continuing to get_credentials()", rc); }This surfaces credential refresh failures earlier without changing behavior.
🧹 Nitpick comments (3)
src/aws/flb_aws_credentials_ec2.c (1)
128-136: Comment wording vsnext_refreshsemantics
implementation->next_refresh = 0;doesn't actually "mark as expired" (the auto-refresh path only triggers whennext_refresh > 0 && time() > next_refresh), it effectively disables auto-refresh untilget_creds_ec2sets a new positive deadline. Behavior is fine here sincerefresh_fn_ec2unconditionally callsget_creds_ec2on lock success, but consider rephrasing the comment to avoid confusion about how expiry is signaled.src/aws/flb_aws_credentials_sts.c (1)
173-189: Clarifynext_refresh = 0intent in STS/EKS refresh pathsIn both STS and EKS providers,
implementation->next_refresh = 0;does not really "mark as expired"; it disables the auto-refresh check (next_refresh > 0 && time() > next_refresh) until the explicit refresh call finishes and writes a new positive deadline. Functionally this is fine (explicit refresh still forces a new STS/EKS call), but the comment is a bit misleading—consider rephrasing to something like “reset auto-refresh deadline; will be updated by the refresh call”.Also applies to: 481-495
src/aws/flb_aws_msk_iam.c (1)
697-813: TLS and provider lifecycle look correct; update stale commentThe new registration path (create TLS → create standard-chain provider → init in sync mode → switch to async → register callback) and the corresponding destroy logic (
flb_aws_provider_destroythenflb_tls_destroythen SDS/free) are consistent with the ownership model offlb_standard_chain_provider_createand avoid the double-free concern discussed previously.One nit: the comment at Line 713 still says “NO AWS provider!”, but this function now does create and own an AWS provider. Updating that comment will prevent confusion for future readers.
Also applies to: 815-844
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
src/aws/flb_aws_credentials_ec2.c(1 hunks)src/aws/flb_aws_credentials_profile.c(1 hunks)src/aws/flb_aws_credentials_sts.c(2 hunks)src/aws/flb_aws_msk_iam.c(10 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.
Applied to files:
src/aws/flb_aws_msk_iam.c
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.
Applied to files:
src/aws/flb_aws_msk_iam.c
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Applied to files:
src/aws/flb_aws_msk_iam.c
🧬 Code graph analysis (1)
src/aws/flb_aws_msk_iam.c (3)
src/aws/flb_aws_credentials.c (3)
flb_aws_credentials_destroy(752-767)flb_standard_chain_provider_create(269-327)flb_aws_provider_destroy(769-791)src/tls/flb_tls.c (2)
flb_tls_create(183-232)flb_tls_destroy(258-277)src/aws/flb_aws_util.c (1)
flb_aws_client_generator(280-283)
🔇 Additional comments (1)
src/aws/flb_aws_credentials_profile.c (1)
664-672: LGTM: less noisy logging on missing shared credentials fileDowngrading the ENOENT case to
AWS_CREDS_DEBUGkeeps the control flow and error signaling intact while avoiding spurious error logs when the shared credentials file is legitimately absent and other providers are expected to handle credentials.
…d fails Signed-off-by: Arbin <arbin.cheng@coins.ph>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
src/aws/flb_aws_msk_iam.c (3)
168-578: Refactored payload builder correctly uses explicit creds and tightens cleanupThe new
build_msk_iam_payload(config, host, creds)flow looks sound: it validates inputs, uses the provided credentials consistently in the credential string and signing key, and the updated cleanup logic (includingempty_payload_hexand earlykeydestruction) avoids the earlier leak/double-free edge cases.If you want to polish further, consider extracting the repeated “Base64‑encode + URL‑safe conversion + padding removal” sequence (lines ~432–459 and ~479–517) into a small helper to reduce duplication and keep the signing logic easier to audit.
581-698: OAuth refresh now correctly pulls fresh creds from the shared providerThe callback’s switch to using the persistent provider (
refresh()+get_credentials()) and destroying credentials immediately after building the token is correct and should eliminate the earlier “stale IAM creds after ~1h” failure mode. Endpoint selection and logging changes (debug vs info) are also appropriate.If the
refreshvtable entry returns a status code in your implementation, you might optionally log failures there before attemptingget_credentials, but the current behavior is acceptable sinceget_credentialsis still the authoritative source.
700-816: TLS/provider initialization at registration is robust with clean error pathsCreating
cred_tlsonce, wiring it intoflb_standard_chain_provider_create, and synchronously initializing the provider (sync→init→async) during registration gives a clear, predictable lifecycle. All intermediate failure paths correctly releasectx,region,cluster_arn,cred_tls, andprovider, so there are no obvious leaks here.If
ctx->flb_configisn’t needed later, you could drop it to keep the context minimal, but it’s fine to leave as-is if you plan to reuse it.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/aws/flb_aws_msk_iam.c(10 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.
Applied to files:
src/aws/flb_aws_msk_iam.c
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.
Applied to files:
src/aws/flb_aws_msk_iam.c
📚 Learning: 2025-08-29T06:25:27.250Z
Learnt from: shadowshot-x
Repo: fluent/fluent-bit PR: 10794
File: tests/internal/aws_compress.c:93-107
Timestamp: 2025-08-29T06:25:27.250Z
Learning: In Fluent Bit, ZSTD compression is enabled by default and is treated as a core dependency, not requiring conditional compilation guards like `#ifdef FLB_HAVE_ZSTD`. Unlike some other optional components such as ARROW/PARQUET (which use `#ifdef FLB_HAVE_ARROW` guards), ZSTD support is always available and doesn't need build-time conditionals. ZSTD headers are included directly without guards across multiple plugins and core components.
Applied to files:
src/aws/flb_aws_msk_iam.c
🧬 Code graph analysis (1)
src/aws/flb_aws_msk_iam.c (3)
src/aws/flb_aws_credentials.c (3)
flb_aws_credentials_destroy(752-767)flb_standard_chain_provider_create(269-327)flb_aws_provider_destroy(769-791)src/tls/flb_tls.c (2)
flb_tls_create(183-232)flb_tls_destroy(258-277)src/aws/flb_aws_util.c (1)
flb_aws_client_generator(280-283)
🔇 Additional comments (2)
src/aws/flb_aws_msk_iam.c (2)
31-48: MSK IAM context extension for TLS/provider looks consistentIncluding
flb_tls.hand addingcred_tlsandprovidertostruct flb_aws_msk_iamaligns with the new design of a persistent AWS provider plus dedicated TLS for credential calls. Ownership is clearly confined to this context and is cleaned up in the destroy path below.
818-847: Destroy path correctly mirrors provider/TLS ownership semanticsThe teardown sequence—
flb_aws_provider_destroy(ctx->provider)followed byflb_tls_destroy(ctx->cred_tls)and SDS/ctx cleanup—matches the standard-chain provider’s ownership model (provider does not manage the caller’s TLS). This avoids leaks without introducing a double free and keeps the lifecycle symmetric with the registration logic above.
Signed-off-by: Arbin <arbin.cheng@coins.ph>
|
Concern: Potential use-after-free when passing Analysis Result: No use-after-free risk exists. I've verified this by examining the librdkafka 2.10.1 source code at rwlock_wrlock(&handle->lock);
RD_IF_FREE(handle->md_principal_name, rd_free);
handle->md_principal_name = rd_strdup(md_principal_name); // ✅ Copied
RD_IF_FREE(handle->token_value, rd_free);
handle->token_value = rd_strdup(token_value); // ✅ Copied
// Extensions are also copied via rd_strtup_new()
for (i = 0; i + 1 < extension_size; i += 2)
rd_list_add(&handle->extensions,
rd_strtup_new(extensions[i], extensions[i + 1])); // ✅ CopiedConclusion: librdkafka makes deep copies of all strings, so destroying credentials immediately after the call is safe. The current code is correct. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/aws/flb_aws_msk_iam.c (2)
31-48: Struct extensions and TLS include look correct; minor cleanup opportunityIncluding
flb_tls.hand extendingstruct flb_aws_msk_iamwithcred_tlsandprovideris consistent with how they're allocated inflb_aws_msk_iam_register_oauth_cband destroyed inflb_aws_msk_iam_destroy. The ownership story is coherent within this file.The only nit is that
ctx->flb_configis now written once and never read, and comments still talk about “lightweight config / on‑demand provider creation”, which no longer matches the implementation. Consider either removingflb_config(since the struct is private to this TU) or updating its usage/comment so future readers don’t assume a lazy/provider‑per‑refresh model still exists.
719-810: TLS/provider creation and initialization sequence is robust; refresh commentsThe new initialization path in
flb_aws_msk_iam_register_oauth_cblooks solid:
ctx->cred_tlsis created once withFLB_TLS_CLIENT_MODEandverify=FLB_TRUE, then reused byflb_standard_chain_provider_create, addressing the missing‑TLS issue for STS/IMDS.ctx->provideris created exactly once and initialized viasync→init→async, with comprehensive cleanup on each failure path (destroy provider, TLS, SDS fields, thenctx).- On success, the lifetime is clearly paired with
flb_aws_msk_iam_destroy.Two minor cleanups to consider:
- The comments
/* Allocate lightweight config - NO AWS provider! */and/* Store the flb_config for on-demand provider creation */are now misleading, since this function eagerly creates the provider instead of deferring it. Updating these comments to describe the current model (persistent provider with internal caching) would reduce confusion.- If
ctx->flb_configis no longer needed under the new design, you could drop it from the struct and the assignment at Line 727.These are documentation/clarity nits; behavior looks correct.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/aws/flb_aws_msk_iam.c(10 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit, the correct CMake flag for using system librdkafka is `FLB_PREFER_SYSTEM_LIB_KAFKA=ON`.
Applied to files:
src/aws/flb_aws_msk_iam.c
📚 Learning: 2025-08-31T12:46:11.940Z
Learnt from: ThomasDevoogdt
Repo: fluent/fluent-bit PR: 9277
File: .github/workflows/pr-compile-check.yaml:147-151
Timestamp: 2025-08-31T12:46:11.940Z
Learning: In fluent-bit CMakeLists.txt, the system library preference flags are defined as FLB_PREFER_SYSTEM_LIB_ZSTD and FLB_PREFER_SYSTEM_LIB_KAFKA with the FLB_ prefix.
Applied to files:
src/aws/flb_aws_msk_iam.c
🧬 Code graph analysis (1)
src/aws/flb_aws_msk_iam.c (4)
src/aws/flb_aws_credentials.c (3)
flb_aws_credentials_destroy(752-767)flb_standard_chain_provider_create(269-327)flb_aws_provider_destroy(769-791)src/tls/flb_tls.c (2)
flb_tls_create(183-232)flb_tls_destroy(258-277)include/fluent-bit/flb_mem.h (1)
flb_free(126-128)src/aws/flb_aws_util.c (1)
flb_aws_client_generator(280-283)
🔇 Additional comments (3)
src/aws/flb_aws_msk_iam.c (3)
168-225: Payload builder refactor and leak fix look soundChanging
build_msk_iam_payloadto takestruct flb_aws_credentials *credsexplicitly, plus the added validation/logging forconfig,host, andcreds, gives a clearer contract and fits the new “persistent provider + explicit creds” model.The memory management also looks correct now:
empty_payload_hexis freed on the success path (Lines 322–323) and again guarded in the error cleanup (Lines 573–575), so there’s no leak or double‑free.keyis destroyed immediately after use and nulled (Lines 373–375), with the error path only destroying it when non‑NULL.- All allocated SDS values (
credential,credential_enc,query,canonical,hexhash,string_to_sign,hexsig,action_enc,session_token_enc,presigned_url,payload) are either returned or freed on every exit path.No changes needed here.
Also applies to: 300-378, 519-575
581-701: OAuth refresh now correctly uses persistent provider and cleans up credsThe updated
oauthbearer_token_refresh_cblooks well‑structured:
- Validates the opaque/context and region before doing any work.
- Picks the correct generic MSK or serverless endpoint based on the ARN suffix and downgrades endpoint logging to debug, reducing noise.
- Proactively calls
provider_vtable->refresh()and thenget_credentials(), logging but not hard‑failing on refresh errors while still guarding againstcreds == NULL.- Builds the payload from those credentials, passes
creds->access_key_idintord_kafka_oauthbearer_set_token, and uses a fixed 900‑second lifetime.- Ensures
flb_aws_credentials_destroy(creds)is called on both success and payload‑generation failure paths, andpayloadis always freed before return.This matches the PR’s goal of avoiding expired IAM credentials without introducing leaks or double‑frees.
821-850: Destroy routine correctly mirrors allocation and avoids leaks
flb_aws_msk_iam_destroymirrors the successful registration path:
- Checks
ctxfor NULL.- Destroys the AWS provider (which handles its own internal resources) and then destroys
ctx->cred_tls, matching how TLS is created and owned in this module.- Frees
ctx->regionandctx->cluster_arnSDS values and finallyctxitself, with logging before and after.Given the ownership model of
flb_standard_chain_provider_createfor TLS in this path (provider does not owncred_tls), this cleanup order is appropriate and does not introduce double‑free risks.
…ials Signed-off-by: Arbin <arbin.cheng@coins.ph>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
src/aws/flb_aws_credentials_sts.c
Outdated
| flb_aws_credentials_destroy(implementation->creds); | ||
| implementation->creds = NULL; | ||
| } | ||
| /* Set to 1 (epoch start) to trigger immediate refresh via time check */ | ||
| implementation->next_refresh = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refresh clears cached STS creds before new ones succeed
In refresh_fn_sts, the refresh path destroys implementation->creds and marks the cache expired before calling sts_assume_role_request(). With the MSK OAuth callback now invoking provider->refresh() on every token refresh, any transient STS failure leaves the provider with no credentials at all even if the previous ones were still valid, causing OAuth token generation to fail unnecessarily. Cached credentials should remain usable until a new set is successfully obtained.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by 05ecb6d
Signed-off-by: Arbin <arbin.cheng@coins.ph>
|
@coderabbitai generate docstrings |
✅ Actions performedInitiated docstring generation; will generate only if new commits exist. |
|
Note Docstrings generation - SUCCESS |
Docstrings generation was requested by @kalavt. * #11211 (comment) The following files were modified: * `src/aws/flb_aws_credentials_ec2.c` * `src/aws/flb_aws_credentials_http.c` * `src/aws/flb_aws_credentials_profile.c` * `src/aws/flb_aws_credentials_sts.c` * `src/aws/flb_aws_msk_iam.c`
Fix AWS MSK IAM Authentication Failures After Credential Expiration
Summary
This PR fixes critical AWS MSK IAM authentication failures that occur after AWS temporary credentials expire (typically after 1 hour). The fix addresses two root causes: missing TLS support for AWS credential endpoints and a broken credential refresh mechanism in provider functions.
Problem Statement
Symptom
After ~1 hour of successful operation, AWS MSK IAM authentication fails with:
Root Causes
Issue 1: Missing TLS Support
The MSK IAM implementation lacked TLS configuration for AWS credential services (STS, IMDS). This caused:
Issue 2: Incomplete Credential Refresh Logic
MSK IAM OAuth tokens refresh every ~12-15 minutes, but AWS temporary credentials expire after ~1 hour. The credential refresh mechanism was broken:
The previous fix set
next_refresh = 0inrefresh_fn_*()to force refresh, butget_credentials()didn't recognize this signal:Timeline:
Solution
Part 1: MSK IAM Lifecycle Optimization
Changes in
src/aws/flb_aws_msk_iam.c:Benefits:
Part 2: Correct Provider refresh() Implementation
Fixed in credential providers:
Why this works:
next_refresh = 1ensuresget_credentials()detects expiration viaif (time(NULL) > 1)→ always TRUEnext_refresh = expiration - 60sApplied consistently to:
refresh_fn_sts()- STS AssumeRole providerrefresh_fn_eks()- EKS IRSA (Web Identity Token) providerrefresh_fn_ec2()- EC2 IMDS providerFiles Modified
Testing
Environment
Results
Benefits
Reliability
Performance
Security
Scope
Backward Compatibility
Why MSK IAM Needs Active refresh()
Unlike other AWS plugins (S3, CloudWatch) which rely on passive credential refresh when approaching expiration, MSK IAM actively calls
refresh()because:Documentation
Backporting
Recommended for stable releases - This is a critical reliability fix with no breaking changes.
Checklist
I understand that Fluent Bit is licensed under Apache 2.0, and by submitting this pull request, I acknowledge that this code will be released under the terms of that license.