Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(promtail): Fix bug with Promtail config reloading getting stuck indefinitely #12795

Merged
merged 4 commits into from
May 9, 2024

Conversation

ptodev
Copy link
Contributor

@ptodev ptodev commented Apr 25, 2024

What this PR does / why we need it:

Recently, a memory issue was reported with the Agent Static mode. The memory of the Agent was creeping up steadily, until it eventually OOMs. That Agent was having its config reloaded every 30 seconds.

A goroutine dump indicated that these calls have been taking a long time:

goroutine 152484 [chan receive, 1214 minutes]:
github.com/grafana/loki/clients/pkg/promtail/targets/file.(*FileTarget).Stop(...)
	/go/pkg/mod/github.com/grafana/loki@v1.6.2-0.20231004111112-07cbef92268a/clients/pkg/promtail/targets/file/filetarget.go:159

goroutine 152424 [chan send, 1220 minutes]:
github.com/grafana/loki/clients/pkg/promtail/targets/file.(*FileTarget).startWatching(0xc0030ab5f0, 0xc002d9fe48?)
	/go/pkg/mod/github.com/grafana/loki@v1.6.2-0.20231004111112-07cbef92268a/clients/pkg/promtail/targets/file/filetarget.go:314 +0x20a

goroutine 152426 [chan send, 1220 minutes]:
github.com/grafana/loki/clients/pkg/promtail/targets/file.(*FileTarget).startWatching(0xc0030ab6c0, 0xc002e31e48?)
	/go/pkg/mod/github.com/grafana/loki@v1.6.2-0.20231004111112-07cbef92268a/clients/pkg/promtail/targets/file/filetarget.go:314 +0x20a

goroutine 152428 [chan send, 1210 minutes]:
github.com/grafana/loki/clients/pkg/promtail/targets/file.(*FileTarget).stopWatching(0xc0030ab790, 0xc002da1d88?)
	/go/pkg/mod/github.com/grafana/loki@v1.6.2-0.20231004111112-07cbef92268a/clients/pkg/promtail/targets/file/filetarget.go:327 +0x20a

What is probably happening is that FileTargetManager begins a Stop(), but doesn't yet close the targetEventHandler channel. As a result, startWatching and stopWatching seem stuck with sending on the channel. This causes the sync call to never complete, which on the other hand means that the FileTarget's Stop() function can't complete.

The memory build up is probably due to lots of calls to the config reload function which never complete.

cc @paul1r who recently committed similar fixes.

Should I add a changelog entry? And do you think there is a way to test this? Also, I haven't yet tested with the customer. If you think the code looks ok, we could merge it and verify later that it does fix the customer issue?

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • Title matches the required conventional commits format, see here
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
  • If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

@ptodev ptodev requested a review from a team as a code owner April 25, 2024 17:28
@ptodev ptodev changed the title Fix issue with stopping a target during a sync Fix bug with Promtail config reloading getting stuck indefinitely Apr 25, 2024
@cstyan
Copy link
Contributor

cstyan commented Apr 25, 2024

Should I add a changelog entry? Also, I haven't yet tested with the customer.

Changelog entries will be auto generated now via conventional commit format for the PR title, see the failed check here

And do you think there is a way to test this?

We definitely need a test. I imagine if we added a test that calls Stop() or startWatchers/stopWatchers in a goroutine and then used a timeout to fail the test if those functions didn't return within X seconds it should fail without the changes in your current commit here, and pass with your changes.

If you think the code looks ok, we could merge it and verify later that it does fix the customer issue?

It looks okay but why can't we at least verify out of band before we consider merging? I would assume that if the agent or alloy codebase is still pulling in upstream promtail code then this change could be hacked in somehow (via go mod I guess since I think you guys are not using a vendor directory) and deployed so that we can trigger a config reload and see if there's still a deadlock.

…ndefinitely

Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
@ptodev ptodev force-pushed the ptodev/fix-target-stop branch from d97954b to 86ad684 Compare April 26, 2024 18:23
@pull-request-size pull-request-size bot added size/L and removed size/M labels Apr 26, 2024
@ptodev
Copy link
Contributor Author

ptodev commented Apr 26, 2024

@cstyan thank you so much for the quick and thorough feedback!

The problem with the test is that we need to make sure FileTarget has already started a sync, but has not yet sent all its data to the channel. If we call Stop() after it already sent the data on the channel, or before it kicked off a sync, then the test wouldn't be valid.

I updated the PR with a test which I believe works.

It looks okay but why can't we at least verify out of band before we consider merging?

I'm just not sure if I can replicate the circumstances required for this bug in real life. I think it's most likely to replicate if there is a long list of directories to watch, and a very quick config reload frequency. I could try replicating it next week, but if I'm not successful in a few hours I think we should just merge the PR. I do believe that the PR fixes a real bug anyway.

@ptodev
Copy link
Contributor Author

ptodev commented Apr 26, 2024

The problem with the test is that we need to make sure FileTarget has already started a sync, but has not yet sent all its data to the channel.

One way to do this is to call sync directly, just like some other tests do. However, I want to avoid this because I don't want to make assumptions about what sync's internals are.

@ptodev ptodev changed the title Fix bug with Promtail config reloading getting stuck indefinitely fix(promtail): Fix bug with Promtail config reloading getting stuck indefinitely Apr 26, 2024
Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
@ptodev ptodev force-pushed the ptodev/fix-target-stop branch from 86ad684 to 0bcaabc Compare April 26, 2024 18:38
@ptodev ptodev requested a review from cstyan April 30, 2024 17:43
Copy link
Contributor

@cstyan cstyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved 👍 just one last nit to be fixed

👍 thanks for your patience and continued effort with various promtail upstream work @ptodev

@ptodev ptodev requested a review from cstyan May 8, 2024 17:12
@ptodev
Copy link
Contributor Author

ptodev commented May 8, 2024

@cstyan No worries, sorry for the late reply - I removed the "continue" comment just now.

@cstyan cstyan merged commit 4d761ac into main May 9, 2024
58 checks passed
@cstyan cstyan deleted the ptodev/fix-target-stop branch May 9, 2024 17:56
@grafanabot
Copy link
Collaborator

Hello @MasslessParticle!
Backport pull requests need to be either:

  • Pull requests which address bugs,
  • Urgent fixes which need product approval, in order to get merged,
  • Docs changes.

Please, if the current pull request addresses a bug fix, label it with the type/bug label.
If it already has the product approval, please add the product-approved label. For docs changes, please add the type/docs label.
If the pull request modifies CI behaviour, please add the type/ci label.
If none of the above applies, please consider removing the backport label and target the next major/minor release.
Thanks!

@MasslessParticle MasslessParticle added the type/bug Somehing is not working as expected label May 10, 2024
@grafanabot
Copy link
Collaborator

The backport to k190 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-12795-to-k190 origin/k190
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x 4d761acd85b90cbdcafdf8d2547f0db14f6ae4dd

When the conflicts are resolved, stage and commit the changes:

git add . && git cherry-pick --continue

If you have the GitHub CLI installed:

# Push the branch to GitHub:
git push --set-upstream origin backport-12795-to-k190
# Create the PR body template
PR_BODY=$(gh pr view 12795 --json body --template 'Backport 4d761acd85b90cbdcafdf8d2547f0db14f6ae4dd from #12795{{ "\n\n---\n\n" }}{{ index . "body" }}')
# Create the PR on GitHub
echo "${PR_BODY}" | gh pr create --title "chore: [k190] fix(promtail): Fix bug with Promtail config reloading getting stuck indefinitely" --body-file - --label "size/L" --label "type/bug" --label "backport" --base k190 --milestone k190 --web

Or, if you don't have the GitHub CLI installed (we recommend you install it!):

# Push the branch to GitHub:
git push --set-upstream origin backport-12795-to-k190

# Create a pull request where the `base` branch is `k190` and the `compare`/`head` branch is `backport-12795-to-k190`.

# Remove the local backport branch
git switch main
git branch -D backport-12795-to-k190

MasslessParticle pushed a commit that referenced this pull request May 10, 2024
…ndefinitely (#12795)

Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
(cherry picked from commit 4d761ac)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants