Skip to content

out_prometheus_remote_write: Refresh expired AWS credentials#9765

Open
Tradunsky wants to merge 1 commit intofluent:masterfrom
Tradunsky:issue-9670
Open

out_prometheus_remote_write: Refresh expired AWS credentials#9765
Tradunsky wants to merge 1 commit intofluent:masterfrom
Tradunsky:issue-9670

Conversation

@Tradunsky
Copy link
Contributor

@Tradunsky Tradunsky commented Dec 25, 2024

Handle 403 http error code when credentials expired, credentials refreshed from ~/.aws/credentials.

Fixes: #9670
Similar implementation already exists for kinesis_streams:

aws_client->provider->provider_vtable->


Steps to reproduce:

  1. Create temporary credentials in ~/.aws/credentials with min session duration time 900 (anything to reproduce quickly). Put the credentials form the command output to ~/.aws/credentials under default profile.
aws sts assume-role --role-arn arn:aws:iam::<account_number>:role/prometheus_role --role-session-name tmp --duration-seconds 900
#or 
aws sts get-session-token     --duration-seconds 900     --serial-number arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):mfa/role_name     --token-code MFA-CODE <MFA code> 
  1. Start fluent-bit with the following configuration file:
[SERVICE]
    Flush                      1
    Log_Level                  DEBUG

[INPUT]
    Name                       node_exporter_metrics
    Tag                        metrics
    Scrape_interval            30

[OUTPUT]
    Name                       prometheus_remote_write
    Match                      metrics
    Host                       aps-workspaces.us-west-2.amazonaws.com
    Port                       443
    Uri                        /workspaces/ws-<your workspaceid>/api/v1/remote_write
    AWS_Auth                   true
    AWS_region                 us-west-2
    Tls                        On
    Tls.verify                 On
    add_label                  test test
./bin/fluent-bit -c fluent-bit.conf
  1. Wait until the credentials expire and fluent-bit prometheus_remote_write out plugin starts to fail with 403 credentials expired as shown in the example:
[2024/12/24 16:31:49] [error] [output:prometheus_remote_write:prometheus_remote_write.1] aps-workspaces.us-west-2.amazonaws.com:443, HTTP status=403
{"message":"The security token included in the request is expired"}
  1. Repeat the step #1 to refresh credentials in ~/.aws/credentials with much fresh credentials (usually done by an automation):

Before the PR fix: Fluent-bit keeps failing with 403 as it is using old expired credentials that are cached in memory

[error] [output:prometheus_remote_write:prometheus_remote_write.1] aps-workspaces.us-west-2.amazonaws.com:443, HTTP status=403
{"message":"The security token included in the request is expired"}

After the PR fix: Fluent-bit picks up fresh credentials without downtime.

[2024/12/24 18:45:21] [ info] [output:prometheus_remote_write:prometheus_remote_write.0] auth error, refreshing creds
[2024/12/24 18:45:21] [debug] [aws_credentials] Refresh called on the env provider
[2024/12/24 18:45:21] [debug] [aws_credentials] Refresh called on the profile provider
[2024/12/24 18:45:21] [debug] [aws_credentials] Reading shared config file.
[2024/12/24 18:45:21] [debug] [aws_credentials] Reading shared credentials file.
[2024/12/24 18:45:21] [debug] [upstream] KA connection #89 to aps-workspaces.us-west-2.amazonaws.com:443 is now available
[2024/12/24 18:45:21] [debug] [output:prometheus_remote_write:prometheus_remote_write.0] http_post result FLB_RETRY
...

[2024/12/24 18:45:43] [debug] [output:prometheus_remote_write:prometheus_remote_write.0] signing request with AWS Sigv4
[2024/12/24 18:45:43] [debug] [output:prometheus_remote_write:prometheus_remote_write.0] aps-workspaces.us-west-2.amazonaws.com:443, HTTP status=200
[2024/12/24 18:45:43] [debug] [upstream] KA connection #88 to aps-workspaces.us-west-2.amazonaws.com:443 is now available
[2024/12/24 18:45:43] [debug] [output:prometheus_remote_write:prometheus_remote_write.0] http_post result FLB_OK

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • [N/A] Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • [N/A] Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • [N/A] Documentation required for this feature

Backporting

  • [N/A] Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability for Prometheus Remote Write with AWS authentication: HTTP 403 responses now trigger an AWS credential refresh and a retry, reducing failures due to expired or invalid credentials.
    • Refined status handling to treat 403 separately from other non-OK responses; existing behaviors remain (success on 200–205, retry on other non-OK statuses, error on 400).

@Tradunsky
Copy link
Contributor Author

Hi team,
@edsiper , @leonardo-albertovich , @fujimotos , @koleini

Hope you had a great holiday season! 🤗

Please let me know if I can do anything else.

@Tradunsky
Copy link
Contributor Author

Hi folks,

Please let me know if I can do anything about this PR to get it merged. 🙏🏻

@github-actions
Copy link
Contributor

github-actions bot commented Sep 7, 2025

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Sep 7, 2025
@Tradunsky
Copy link
Contributor Author

Pending replies

@coderabbitai
Copy link

coderabbitai bot commented Sep 15, 2025

Walkthrough

Adds explicit HTTP 403 handling in Prometheus Remote Write: on 403, trigger AWS credentials refresh when SigV4 is enabled and mark request for retry. Adjusts the generic non-2xx path to exclude 403. Fixes typos in status-handling comments. No API signature changes.

Changes

Cohort / File(s) Summary
HTTP status handling + AWS refresh
plugins/out_prometheus_remote_write/remote_write.c
Add explicit 403 branch: if has_aws_auth, call aws_provider->refresh(...), then set FLB_RETRY. Exclude 403 from generic non-2xx path. Keep 400 as error, 200–205 as OK, others retry.
Comment/typo corrections
plugins/out_prometheus_remote_write/remote_write.c
Fix typos in comments for 203 and 400 handling.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor FB as Fluent Bit PW Output
  participant S as Remote Endpoint
  participant AWS as AWS Provider (optional)

  FB->>S: HTTP POST /remote_write (metrics)
  S-->>FB: HTTP Response (status)

  alt 200–205
    FB->>FB: Mark success (OK)
  else 400
    FB->>FB: Log error and do not retry
  else 403 (new path)
    opt SigV4 enabled
      FB->>AWS: Refresh credentials
      Note right of AWS: Refresh via provider vtable
    end
    FB->>FB: Set outcome = RETRY
  else Other non-OK
    FB->>FB: Set outcome = RETRY (optional payload log)
  end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

I hop between the codes at night,
403s that gave a fright—
I nibble creds, refresh the key,
then try again, resiliently.
Small typos trimmed, paths set aright—
metrics bounce back into the light. 🐇📈

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues Check ✅ Passed The changes address the core requirement from the linked issue [#9670] by treating HTTP 403 as an AWS auth failure when SigV4 is enabled, calling the AWS provider's refresh() path, logging the refresh, and returning FLB_RETRY so the request is retried with refreshed credentials; this implements automatic reload of rotated credentials without restarting fluent-bit. The implementation aligns with the issue's desired behavior and mirrors the referenced kinesis approach, and I see no coding-related objectives from the issue left unfulfilled.
Out of Scope Changes Check ✅ Passed The diff is narrowly scoped to plugins/out_prometheus_remote_write/remote_write.c and includes targeted 403 handling plus minor comment typos; there are no other module or public API changes in the provided summary, so no out-of-scope functional changes were introduced.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title precisely names the affected plugin and succinctly describes the primary change of refreshing expired AWS credentials, matching the pull request’s core functionality.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
plugins/out_prometheus_remote_write/remote_write.c (3)

186-186: Use the standard HTTP reason phrase for 203.

Replace with “Non-Authoritative Information” for clarity.

-         * - 203: no authoritative resp
+         * - 203: Non-Authoritative Information

206-207: Fix grammar in comment.

“an error” instead of “a error”.

-            /* Returned 400 status means unrecoverable. Immediately
-             * returning as a error. */
+            /* Returned 400 status means unrecoverable. Immediately
+             * returning as an error. */

220-226: Optional: log payload on 403 and consider 401 as a refresh trigger.

  • When log_response_payload is enabled, emit the response body at info/debug to aid diagnosis (e.g., expired token vs. access denied).
  • If you see 401 from some backends, optionally treat it like 403 for AWS-auth cases.

Would you like me to generate a small follow-up patch for payload logging and an opt-in 401 handling flag?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60db310 and 7c16c17.

📒 Files selected for processing (1)
  • plugins/out_prometheus_remote_write/remote_write.c (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (58)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
  • GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
  • GitHub Check: pr-compile-centos-7
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
  • GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
  • GitHub Check: PR - fuzzing test
  • GitHub Check: pr-compile-centos-7
  • GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
  • GitHub Check: PR - fuzzing test
🔇 Additional comments (2)
plugins/out_prometheus_remote_write/remote_write.c (2)

191-193: Good exclusion of 403 from the generic retry path.

This cleanly routes 403s to the specialized handler below.


220-226: Resolved — provider refresh is already synchronized

The AWS provider exposes a per-provider mutex and refresh implementations use try_lock_provider()/unlock_provider() (see include/fluent-bit/flb_aws_credentials.h and examples in src/aws/flb_aws_credentials_ec2.c and src/aws/flb_aws_credentials_profile.c); provider creation initializes the mutex. Calling ctx->aws_provider->provider_vtable->refresh(ctx->aws_provider) is safe — no extra mutex/rate-limit needed in this plugin.

@Tradunsky Tradunsky changed the title ISSUE 9670: Adds AWS credentials refresh to out_prometheus_remote_write out_prometheus_remote_write: Adds AWS credentials refresh to out_prometheus_remote_write Oct 6, 2025
@Tradunsky
Copy link
Contributor Author

Followed the code style.
Simplified the change.

Please review.

@Tradunsky Tradunsky changed the title out_prometheus_remote_write: Adds AWS credentials refresh to out_prometheus_remote_write out_prometheus_remote_write: refresh expired AWS credentials Oct 6, 2025
@Tradunsky Tradunsky changed the title out_prometheus_remote_write: refresh expired AWS credentials out_prometheus_remote_write: Refresh expired AWS credentials Oct 6, 2025
This fix handles case of fluent-bit stuck retying with expired AWS credentials.

Signed-off-by: Tradunsky <tradunskih@gmail.com>
@Tradunsky
Copy link
Contributor Author

Ping.

@github-actions github-actions bot removed the Stale label Oct 24, 2025
@Tradunsky
Copy link
Contributor Author

ping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AWS rolling credentials from file support for Prometheus

1 participant