-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover from panics in Series calls #6077
Conversation
a55f72c
to
6b2819d
Compare
6b2819d
to
541ebb0
Compare
This commit adds panic recovery for Series calls in all Store servers. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
541ebb0
to
4c5a616
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙇
@bwplotka @GiedriusS feel free to suggestion other ways to fix this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good! 🙇
pkg/store/recover.go
Outdated
level.Error(r.logger).Log("err", err) | ||
} | ||
default: | ||
if err := srv.Send(storepb.NewWarnSeriesResponse(errors.New("Unknown error"))); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include e
here into the message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if that will work if err
is not a runtime error or if it has no String
method. Not too familiar with panic recovery tbh.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? It could be a regular string. I'd suggest adding fmt.Sprintf("unknown error while processing Series: %v", e)
. Example: https://go.dev/play/p/czfvKKtvWXG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, makes sense 👍
Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
* Update Thanos engine to latest version (thanos-io#6069) This commit updates the Thanos PromQL engine to the latest version. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Receive: Tenants' external labels proposal (thanos-io#5720) * Receive external labels proposal Signed-off-by: haanhvu <haanh6594@gmail.com> * Restructure and edit proposal's content Signed-off-by: haanhvu <haanh6594@gmail.com> * Update proposal Signed-off-by: haanhvu <haanh6594@gmail.com> * Fix doc error Signed-off-by: haanhvu <haanh6594@gmail.com> Signed-off-by: haanhvu <haanh6594@gmail.com> * fixing doc CI (thanos-io#6072) Signed-off-by: Ben Ye <benye@amazon.com> Signed-off-by: Ben Ye <benye@amazon.com> * Fix stores filtering resets on reload (thanos-io#6063) * Fix stores filtering resets on reload `g0.store_matches` parameter appears in the url but doesn't applies in the frontend. Looks like it has been done on purpose and by removing a small piece of code fixes this issue. variable named `debugMode` is used for the store filtering checkbox which is an unappropriate name. Using `enableStoreFiltering` variable to represent the state of checkbox. Signed-off-by: Pradyumna Krishna <git@onpy.in> * Regenerate bindata.go Signed-off-by: Pradyumna Krishna <git@onpy.in> Signed-off-by: Pradyumna Krishna <git@onpy.in> * Store: Make initial sync more robust Added re-try mechanism for store inital sync, where if the initial sync fails, it tries to do the initial sync again for given timeout duration. Signed-off-by: Kartik-Garg <kartik.garg@infracloud.io> * Recover from panics in Series calls (thanos-io#6077) * Recover from panics in Series calls This commit adds panic recovery for Series calls in all Store servers. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Apply error suggestion Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> --------- Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * query: reuse our own gate (thanos-io#6079) Do not call promgate directly but rather use our own wrapper that does everything we want - duration histogram, current in-flight calls, total calls. Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com> * Store: Support disable cache index header file. (thanos-io#5773) * Store: Support disable cache index header file. Signed-off-by: wanjunlei <wanjunlei@kubesphere.io> * Store: add a seprate flag to disable caching index header file Signed-off-by: wanjunlei <wanjunlei@kubesphere.io> * Tools: add cleanup API for bucket web Signed-off-by: wanjunlei <wanjunlei@kubesphere.io> * resolve conversation Signed-off-by: wanjunlei <wanjunlei@kubesphere.io> * resolve confilcts Signed-off-by: wanjunlei <wanjunlei@kubesphere.io> * change the flag to `--cache-index-header` Signed-off-by: wanjunlei <wanjunlei@kubesphere.io> * Wrap mem writer in file writer Signed-off-by: wanjunlei <wanjunlei@kubesphere.io> * update CHANGELOG Signed-off-by: wanjunlei <wanjunlei@kubesphere.io> * update CHANGELOG Signed-off-by: wanjunlei <wanjunlei@kubesphere.io> * fix bug Signed-off-by: wanjunlei <wanjunlei@kubesphere.io> --------- Signed-off-by: wanjunlei <wanjunlei@kubesphere.io> Co-authored-by: wanjunlei <wanjunlei@yujnify.com> * CVE: Fix Receiver malicious tenant (thanos-io#5969) If running as root or with enough privileges, receiver can create a directory outside of the configured TenantHeader. This commit fixes it up by sanitizing the user input and explicity not allowing such behavior. Signed-off-by: Daniel Mellado <dmellado@redhat.com> * Add adopter Grupo MasMovil (thanos-io#6084) Signed-off-by: Pablo Moncada Isla <pablo.moncada@masmovil.com> * fix typo (thanos-io#6087) Signed-off-by: cyip <cyip@jackhenry.com> Co-authored-by: cyip <cyip@jackhenry.com> * optimize selector to string (thanos-io#6076) Signed-off-by: Kama Huang <kamatogo13@gmail.com> * Fix: Failure to close BlockSeriesClient cause store-gateway deadlock (thanos-io#6086) * Fix: Failure to close BlockSeriesClient cause store-gateway deadlock Signed-off-by: Alan Protasio <alanprot@gmail.com> * Adding tests Signed-off-by: Alan Protasio <alanprot@gmail.com> * reverting the change on get series Signed-off-by: Alan Protasio <alanprot@gmail.com> * fix lint Signed-off-by: Alan Protasio <alanprot@gmail.com> --------- Signed-off-by: Alan Protasio <alanprot@gmail.com> * Cut 0.30.2 (thanos-io#6081) * tracing: fixed panic because of nil sampler (thanos-io#6066) * fixed panic because of nil sampler Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com> * added CHANGELOG entry Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com> Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com> * bump version to 0.30.2 Signed-off-by: Ben Ye <benye@amazon.com> * Updates busybox SHA (thanos-io#6046) Signed-off-by: GitHub <noreply@github.com> Signed-off-by: GitHub <noreply@github.com> Co-authored-by: yeya24 <yeya24@users.noreply.github.com> * Use `e2edb.NewMinio` to disable SSE-S3 in e2e tests (thanos-io#6055) * Use e2edb.NewMinio to disable SSE Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com> * Use temp fork for TLS Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com> * Fix broken rules api fanout test Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com> * Fix broken query compatibility test Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com> * Remove fork Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com> Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com> --------- Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com> Signed-off-by: Ben Ye <benye@amazon.com> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com> Co-authored-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: yeya24 <yeya24@users.noreply.github.com> Co-authored-by: Saswata Mukherjee <saswataminsta@yahoo.com> * cherry pick store gateway fix to release 0.30 (thanos-io#6089) * Fix: Failure to close BlockSeriesClient cause store-gateway deadlock (thanos-io#6086) * Fix: Failure to close BlockSeriesClient cause store-gateway deadlock Signed-off-by: Alan Protasio <alanprot@gmail.com> * Adding tests Signed-off-by: Alan Protasio <alanprot@gmail.com> * reverting the change on get series Signed-off-by: Alan Protasio <alanprot@gmail.com> * fix lint Signed-off-by: Alan Protasio <alanprot@gmail.com> --------- Signed-off-by: Alan Protasio <alanprot@gmail.com> * update changelog Signed-off-by: Ben Ye <benye@amazon.com> --------- Signed-off-by: Alan Protasio <alanprot@gmail.com> Signed-off-by: Ben Ye <benye@amazon.com> Co-authored-by: Alan Protasio <alanprot@gmail.com> * fix changelog entries Signed-off-by: Ben Ye <benye@amazon.com> * docs: improving the description for tsdb.retention on the receiver Signed-off-by: Victor Fernandes <victorhbfernandes@gmail.com> * Receiver: Use `intern` package when reallocating label strings (thanos-io#5926) * Cleanup go mod Signed-off-by: Matej Gera <matejgera@gmail.com> * Use string interning for labels realloc method Signed-off-by: Matej Gera <matejgera@gmail.com> * Enhance label realloc benchmarks Signed-off-by: Matej Gera <matejgera@gmail.com> * Make interning optional; put behind hiddend flag Signed-off-by: Matej Gera <matej.gera@coralogix.com> * Update CHANGELOG Signed-off-by: Matej Gera <matej.gera@coralogix.com> * Address feedback - Fix wrong condition - Adjust benchmarks Signed-off-by: Matej Gera <matej.gera@coralogix.com> --------- Signed-off-by: Matej Gera <matejgera@gmail.com> Signed-off-by: Matej Gera <matej.gera@coralogix.com> Signed-off-by: Matej Gera <38492574+matej-g@users.noreply.github.com> * Updaing README with drawing fixes and minor wording clarification (thanos-io#6078) * New drawing and wording for Thanos other deployment models Signed-off-by: Jonah Kowall <jkowall@kowall.net> * New drawing and wording for Thanos other deployment models Signed-off-by: Jonah Kowall <jkowall@kowall.net> * Added comments to README.md and updated the quick-tutorial.md with the same diagram updates and text to match Signed-off-by: Jonah Kowall <jkowall@kowall.net> * Ran make docs Signed-off-by: Jonah Kowall <jkowall@kowall.net> --------- Signed-off-by: Jonah Kowall <jkowall@kowall.net> * Compact: Remove spam of replica label removed log (thanos-io#6088) * Remove spam of replica label removed log Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> * Reduce amount of removed replica label instead of removing it Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> * Reformat code Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> --------- Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> * Store: Don't error when no stores are matched (thanos-io#6082) It's normal and not an error if a query does not match due to no downstream stores. This is common when querying with external labels and tiered query servers. This bug was introduced in thanos-io#5296 Fixes: thanos-io#5862 Signed-off-by: SuperQ <superq@gmail.com> * docs: Fix must have Ruler alerts definition (thanos-io#6058) * Fix must have Ruler alerts definition ThanosRuler missing rule intervals metric used the wrong comparator sign, confusing users trying to create the rule. Signed-off-by: Maxim Muzafarov <m.muzafarov@gmail.com> * Update docs/components/rule.md Co-authored-by: Saswata Mukherjee <saswataminsta@yahoo.com> Signed-off-by: Maxim Muzafarov <m.muzafarov@gmail.com> --------- Signed-off-by: Maxim Muzafarov <m.muzafarov@gmail.com> Co-authored-by: Saswata Mukherjee <saswataminsta@yahoo.com> * Fix conflicts Signed-off-by: haanhvu <haanh6594@gmail.com> * Specify overwriting behavior in flag and add validation Signed-off-by: haanhvu <haanh6594@gmail.com> * Add log and doc Signed-off-by: haanhvu <haanh6594@gmail.com> * Mixins(Rule): Fix query for long rule evaluations (thanos-io#6103) * mixin(Rule): Fix query for long rule evaluations Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> * Update changelog Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> --------- Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> --------- Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> Signed-off-by: haanhvu <haanh6594@gmail.com> Signed-off-by: Ben Ye <benye@amazon.com> Signed-off-by: Pradyumna Krishna <git@onpy.in> Signed-off-by: Kartik-Garg <kartik.garg@infracloud.io> Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com> Signed-off-by: wanjunlei <wanjunlei@kubesphere.io> Signed-off-by: Daniel Mellado <dmellado@redhat.com> Signed-off-by: Pablo Moncada Isla <pablo.moncada@masmovil.com> Signed-off-by: cyip <cyip@jackhenry.com> Signed-off-by: Kama Huang <kamatogo13@gmail.com> Signed-off-by: Alan Protasio <alanprot@gmail.com> Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com> Signed-off-by: GitHub <noreply@github.com> Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com> Signed-off-by: Victor Fernandes <victorhbfernandes@gmail.com> Signed-off-by: Matej Gera <matejgera@gmail.com> Signed-off-by: Matej Gera <matej.gera@coralogix.com> Signed-off-by: Matej Gera <38492574+matej-g@users.noreply.github.com> Signed-off-by: Jonah Kowall <jkowall@kowall.net> Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> Signed-off-by: SuperQ <superq@gmail.com> Signed-off-by: Maxim Muzafarov <m.muzafarov@gmail.com> Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com> Co-authored-by: Filip Petkovski <filip.petkovsky@gmail.com> Co-authored-by: Ha Anh Vu <75315486+haanhvu@users.noreply.github.com> Co-authored-by: Ben Ye <benye@amazon.com> Co-authored-by: Pradyumna Krishna <git@onpy.in> Co-authored-by: Kartik-Garg <kartik.garg@infracloud.io> Co-authored-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com> Co-authored-by: wanjunlei <53003665+wanjunlei@users.noreply.github.com> Co-authored-by: wanjunlei <wanjunlei@yujnify.com> Co-authored-by: Daniel Mellado <1313475+danielmellado@users.noreply.github.com> Co-authored-by: Pablo Moncada <pmoncadaisla@gmail.com> Co-authored-by: Chantel Yip <52993239+sshantel@users.noreply.github.com> Co-authored-by: cyip <cyip@jackhenry.com> Co-authored-by: Kama Huang <121007071+kama910@users.noreply.github.com> Co-authored-by: Alan Protasio <alanprot@gmail.com> Co-authored-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: yeya24 <yeya24@users.noreply.github.com> Co-authored-by: Saswata Mukherjee <saswataminsta@yahoo.com> Co-authored-by: Victor Fernandes <victorhbfernandes@gmail.com> Co-authored-by: Matej Gera <38492574+matej-g@users.noreply.github.com> Co-authored-by: Jonah Kowall <jkowall@kowall.net> Co-authored-by: Douglas Camata <159076+douglascamata@users.noreply.github.com> Co-authored-by: Ben Kochie <superq@gmail.com> Co-authored-by: Maxim Muzafarov <m.muzafarov@gmail.com> Co-authored-by: haanhvu <haanh6594@gmail.com>
* Recover from panics in Series calls This commit adds panic recovery for Series calls in all Store servers. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Apply error suggestion Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> --------- Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
* Recover from panics in Series calls This commit adds panic recovery for Series calls in all Store servers. Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> * Apply error suggestion Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com> --------- Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
This commit adds panic recovery for Series calls in all Store servers.
The PR is motivated by an issue we've started seeing on daily basis where Receive components crash during compaction.
A similar issue has been reported in #3497, and apparently fixed in prometheus/prometheus#8723, but I am not convinced the root cause is gone.
Changes
Verification