-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loki Ruler: panic with 'fatal error: concurrent map read and map write' #11569
Comments
@els0r thanks a lot for this detailed bug report! |
…#11601) **What this PR does / why we need it**: A ruler handling many hundreds of rules can provoke a situation where the WAL appender reads & modifies tenant configs concurrently in an unsafe way; this PR protects that with a mutex. **Which issue(s) this PR fixes**: Fixes #11569 (cherry picked from commit cd3cf62)
…#11601) **What this PR does / why we need it**: A ruler handling many hundreds of rules can provoke a situation where the WAL appender reads & modifies tenant configs concurrently in an unsafe way; this PR protects that with a mutex. **Which issue(s) this PR fixes**: Fixes #11569 (cherry picked from commit cd3cf62)
@els0r woiuld you be willing to try a nightly build to see if your problem is addressed? |
@dannykopping : thanks a lot for the quick turnaround. My colleague @verejoel is in the process of deploying your fix in our prod environment. We'll be back with feedback within the next hours. |
We're running pretty stable so far:
|
Super 👍 thanks for testing it out. |
Expands on #11601 **What this PR does / why we need it**: Turns out the previous tests didn't expose all possible causes for data races (another one occurs at https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204). Moving the mutex to the calling function adds more safety. **Which issue(s) this PR fixes**: Fixes #11569 Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Expands on #11601 **What this PR does / why we need it**: Turns out the previous tests didn't expose all possible causes for data races (another one occurs at https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204). Moving the mutex to the calling function adds more safety. **Which issue(s) this PR fixes**: Fixes #11569 Signed-off-by: Danny Kopping <danny.kopping@grafana.com> (cherry picked from commit 61a4205)
Expands on #11601 **What this PR does / why we need it**: Turns out the previous tests didn't expose all possible causes for data races (another one occurs at https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204). Moving the mutex to the calling function adds more safety. **Which issue(s) this PR fixes**: Fixes #11569 Signed-off-by: Danny Kopping <danny.kopping@grafana.com> (cherry picked from commit 61a4205)
#11714) Backport 61a4205 from #11612 --- Expands on #11601 **What this PR does / why we need it**: Turns out the previous tests didn't expose all possible causes for data races (another one occurs at https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204). Moving the mutex to the calling function adds more safety. **Which issue(s) this PR fixes**: Fixes #11569 Co-authored-by: Danny Kopping <danny.kopping@grafana.com>
#11715) Backport 61a4205 from #11612 --- Expands on #11601 **What this PR does / why we need it**: Turns out the previous tests didn't expose all possible causes for data races (another one occurs at https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204). Moving the mutex to the calling function adds more safety. **Which issue(s) this PR fixes**: Fixes #11569 Co-authored-by: Danny Kopping <danny.kopping@grafana.com>
…grafana#11601) **What this PR does / why we need it**: A ruler handling many hundreds of rules can provoke a situation where the WAL appender reads & modifies tenant configs concurrently in an unsafe way; this PR protects that with a mutex. **Which issue(s) this PR fixes**: Fixes grafana#11569
Expands on grafana#11601 **What this PR does / why we need it**: Turns out the previous tests didn't expose all possible causes for data races (another one occurs at https://github.com/grafana/loki/blob/5a55158cc751465846383bc758aa0c169363b292/pkg/ruler/registry.go#L204). Moving the mutex to the calling function adds more safety. **Which issue(s) this PR fixes**: Fixes grafana#11569 Signed-off-by: Danny Kopping <danny.kopping@grafana.com>
Describe the bug
After running for about 15 minutes, the Loki ruler suddenly panics with
We evaluate around 600 rules every minute. The crashes are occurring regularly: we are seeing frequent container restarts, since the 25.12.2023 12:00 CET.
The stack trace suggest that this has to do with WAL handling.
To Reproduce
Steps to reproduce the behavior:
kubectl -n loki logs -f ruler-<num>
Expected behavior
Concurrency safe map access. Ruler keeps running.
Environment:
Screenshots, Promtail config, or terminal output
Panic (please let me know if you require the full stack trace)
Restart pattern from
kubectl -n loki get pods
:Please let me know if there's more information required to look at this in detail.
Thanks and happy new year folks! 🥳
The text was updated successfully, but these errors were encountered: