-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(promtail): Fix bug with Promtail config reloading getting stuck indefinitely #12795
Conversation
Changelog entries will be auto generated now via conventional commit format for the PR title, see the failed check here
We definitely need a test. I imagine if we added a test that calls
It looks okay but why can't we at least verify out of band before we consider merging? I would assume that if the agent or alloy codebase is still pulling in upstream promtail code then this change could be hacked in somehow (via go mod I guess since I think you guys are not using a vendor directory) and deployed so that we can trigger a config reload and see if there's still a deadlock. |
…ndefinitely Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
d97954b
to
86ad684
Compare
@cstyan thank you so much for the quick and thorough feedback! The problem with the test is that we need to make sure I updated the PR with a test which I believe works.
I'm just not sure if I can replicate the circumstances required for this bug in real life. I think it's most likely to replicate if there is a long list of directories to watch, and a very quick config reload frequency. I could try replicating it next week, but if I'm not successful in a few hours I think we should just merge the PR. I do believe that the PR fixes a real bug anyway. |
One way to do this is to call |
Signed-off-by: Paulin Todev <paulin.todev@gmail.com>
86ad684
to
0bcaabc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved 👍 just one last nit to be fixed
👍 thanks for your patience and continued effort with various promtail upstream work @ptodev
@cstyan No worries, sorry for the late reply - I removed the "continue" comment just now. |
Hello @MasslessParticle!
Please, if the current pull request addresses a bug fix, label it with the |
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new branch
git switch --create backport-12795-to-k190 origin/k190
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x 4d761acd85b90cbdcafdf8d2547f0db14f6ae4dd When the conflicts are resolved, stage and commit the changes:
If you have the GitHub CLI installed: # Push the branch to GitHub:
git push --set-upstream origin backport-12795-to-k190
# Create the PR body template
PR_BODY=$(gh pr view 12795 --json body --template 'Backport 4d761acd85b90cbdcafdf8d2547f0db14f6ae4dd from #12795{{ "\n\n---\n\n" }}{{ index . "body" }}')
# Create the PR on GitHub
echo "${PR_BODY}" | gh pr create --title "chore: [k190] fix(promtail): Fix bug with Promtail config reloading getting stuck indefinitely" --body-file - --label "size/L" --label "type/bug" --label "backport" --base k190 --milestone k190 --web Or, if you don't have the GitHub CLI installed (we recommend you install it!): # Push the branch to GitHub:
git push --set-upstream origin backport-12795-to-k190
# Create a pull request where the `base` branch is `k190` and the `compare`/`head` branch is `backport-12795-to-k190`.
# Remove the local backport branch
git switch main
git branch -D backport-12795-to-k190 |
What this PR does / why we need it:
Recently, a memory issue was reported with the Agent Static mode. The memory of the Agent was creeping up steadily, until it eventually OOMs. That Agent was having its config reloaded every 30 seconds.
A goroutine dump indicated that these calls have been taking a long time:
What is probably happening is that
FileTargetManager
begins aStop()
, but doesn't yet close thetargetEventHandler
channel. As a result,startWatching
andstopWatching
seem stuck with sending on the channel. This causes thesync
call to never complete, which on the other hand means that theFileTarget
'sStop()
function can't complete.The memory build up is probably due to lots of calls to the config reload function which never complete.
cc @paul1r who recently committed similar fixes.
Should I add a changelog entry? And do you think there is a way to test this? Also, I haven't yet tested with the customer. If you think the code looks ok, we could merge it and verify later that it does fix the customer issue?
Checklist
CONTRIBUTING.md
guide (required)docs/sources/setup/upgrade/_index.md
production/helm/loki/Chart.yaml
and updateproduction/helm/loki/CHANGELOG.md
andproduction/helm/loki/README.md
. Example PRdeprecated-config.yaml
anddeleted-config.yaml
files respectively in thetools/deprecated-config-checker
directory. Example PR