Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from panics in Series calls #6077

Merged
merged 2 commits into from
Jan 27, 2023

Conversation

fpetkovski
Copy link
Contributor

This commit adds panic recovery for Series calls in all Store servers.

The PR is motivated by an issue we've started seeing on daily basis where Receive components crash during compaction.
A similar issue has been reported in #3497, and apparently fixed in prometheus/prometheus#8723, but I am not convinced the root cause is gone.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

@fpetkovski fpetkovski force-pushed the recoverable-server branch 2 times, most recently from a55f72c to 6b2819d Compare January 27, 2023 12:54
@pull-request-size pull-request-size bot added size/L and removed size/M labels Jan 27, 2023
This commit adds panic recovery for Series calls in all Store servers.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
matej-g
matej-g previously approved these changes Jan 27, 2023
Copy link
Collaborator

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇

@fpetkovski
Copy link
Contributor Author

@bwplotka @GiedriusS feel free to suggestion other ways to fix this.

GiedriusS
GiedriusS previously approved these changes Jan 27, 2023
Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good! 🙇

level.Error(r.logger).Log("err", err)
}
default:
if err := srv.Send(storepb.NewWarnSeriesResponse(errors.New("Unknown error"))); err != nil {
Copy link
Member

@GiedriusS GiedriusS Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include e here into the message?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if that will work if err is not a runtime error or if it has no String method. Not too familiar with panic recovery tbh.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not? It could be a regular string. I'd suggest adding fmt.Sprintf("unknown error while processing Series: %v", e). Example: https://go.dev/play/p/czfvKKtvWXG

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, makes sense 👍

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
@fpetkovski fpetkovski dismissed stale reviews from GiedriusS and matej-g via e287217 January 27, 2023 15:09
@matej-g matej-g enabled auto-merge (squash) January 27, 2023 15:36
@matej-g matej-g merged commit 30fd69d into thanos-io:main Jan 27, 2023
rabenhorst added a commit to rabenhorst/thanos that referenced this pull request Feb 8, 2023
* Update Thanos engine to latest version (thanos-io#6069)

This commit updates the Thanos PromQL engine to the latest version.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Receive: Tenants' external labels proposal (thanos-io#5720)

* Receive external labels proposal

Signed-off-by: haanhvu <haanh6594@gmail.com>

* Restructure and edit proposal's content

Signed-off-by: haanhvu <haanh6594@gmail.com>

* Update proposal

Signed-off-by: haanhvu <haanh6594@gmail.com>

* Fix doc error

Signed-off-by: haanhvu <haanh6594@gmail.com>

Signed-off-by: haanhvu <haanh6594@gmail.com>

* fixing doc CI (thanos-io#6072)

Signed-off-by: Ben Ye <benye@amazon.com>

Signed-off-by: Ben Ye <benye@amazon.com>

* Fix stores filtering resets on reload (thanos-io#6063)

* Fix stores filtering resets on reload

`g0.store_matches` parameter appears in the url but doesn't applies
in the frontend. Looks like it has been done on purpose and by
removing a small piece of code fixes this issue.

variable named `debugMode` is used for the store filtering checkbox
which is an unappropriate name. Using `enableStoreFiltering`
variable to represent the state of checkbox.

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Regenerate bindata.go

Signed-off-by: Pradyumna Krishna <git@onpy.in>

Signed-off-by: Pradyumna Krishna <git@onpy.in>

* Store: Make initial sync more robust

Added re-try mechanism for store inital sync, where if the initial sync fails, it tries to do the initial sync again for given timeout duration.

Signed-off-by: Kartik-Garg <kartik.garg@infracloud.io>

* Recover from panics in Series calls (thanos-io#6077)

* Recover from panics in Series calls

This commit adds panic recovery for Series calls in all Store servers.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Apply error suggestion

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

---------

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* query: reuse our own gate (thanos-io#6079)

Do not call promgate directly but rather use our own wrapper that does
everything we want - duration histogram, current in-flight calls, total
calls.

Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>

* Store: Support disable cache index header file. (thanos-io#5773)

* Store: Support disable cache index header file.

Signed-off-by: wanjunlei <wanjunlei@kubesphere.io>

* Store: add a seprate flag to disable caching index header file

Signed-off-by: wanjunlei <wanjunlei@kubesphere.io>

* Tools: add cleanup API for bucket web

Signed-off-by: wanjunlei <wanjunlei@kubesphere.io>

* resolve conversation

Signed-off-by: wanjunlei <wanjunlei@kubesphere.io>

* resolve confilcts

Signed-off-by: wanjunlei <wanjunlei@kubesphere.io>

* change the flag to `--cache-index-header`

Signed-off-by: wanjunlei <wanjunlei@kubesphere.io>

* Wrap mem writer in file writer

Signed-off-by: wanjunlei <wanjunlei@kubesphere.io>

* update CHANGELOG

Signed-off-by: wanjunlei <wanjunlei@kubesphere.io>

* update CHANGELOG

Signed-off-by: wanjunlei <wanjunlei@kubesphere.io>

* fix bug

Signed-off-by: wanjunlei <wanjunlei@kubesphere.io>

---------

Signed-off-by: wanjunlei <wanjunlei@kubesphere.io>
Co-authored-by: wanjunlei <wanjunlei@yujnify.com>

* CVE: Fix Receiver malicious tenant (thanos-io#5969)

If running as root or with enough privileges, receiver can create a
directory outside of the configured TenantHeader.

This commit fixes it up by sanitizing the user input and explicity not
allowing such behavior.

Signed-off-by: Daniel Mellado <dmellado@redhat.com>

* Add adopter Grupo MasMovil (thanos-io#6084)

Signed-off-by: Pablo Moncada Isla <pablo.moncada@masmovil.com>

* fix typo (thanos-io#6087)

Signed-off-by: cyip <cyip@jackhenry.com>
Co-authored-by: cyip <cyip@jackhenry.com>

* optimize selector to string (thanos-io#6076)

Signed-off-by: Kama Huang <kamatogo13@gmail.com>

* Fix: Failure to close BlockSeriesClient cause store-gateway deadlock (thanos-io#6086)

* Fix: Failure to close BlockSeriesClient cause store-gateway deadlock

Signed-off-by: Alan Protasio <alanprot@gmail.com>

* Adding tests

Signed-off-by: Alan Protasio <alanprot@gmail.com>

* reverting the change on get series

Signed-off-by: Alan Protasio <alanprot@gmail.com>

* fix lint

Signed-off-by: Alan Protasio <alanprot@gmail.com>

---------

Signed-off-by: Alan Protasio <alanprot@gmail.com>

* Cut 0.30.2 (thanos-io#6081)

* tracing: fixed panic because of nil sampler (thanos-io#6066)

* fixed panic because of nil sampler

Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>

* added CHANGELOG entry

Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>

Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>

* bump version to 0.30.2

Signed-off-by: Ben Ye <benye@amazon.com>

* Updates busybox SHA (thanos-io#6046)

Signed-off-by: GitHub <noreply@github.com>

Signed-off-by: GitHub <noreply@github.com>
Co-authored-by: yeya24 <yeya24@users.noreply.github.com>

* Use `e2edb.NewMinio` to disable SSE-S3 in e2e tests (thanos-io#6055)

* Use e2edb.NewMinio to disable SSE

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Use temp fork for TLS

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Fix broken rules api fanout test

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Fix broken query compatibility test

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Remove fork

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>

---------

Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>
Signed-off-by: Ben Ye <benye@amazon.com>
Signed-off-by: GitHub <noreply@github.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Co-authored-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: yeya24 <yeya24@users.noreply.github.com>
Co-authored-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* cherry pick store gateway fix to release 0.30 (thanos-io#6089)

* Fix: Failure to close BlockSeriesClient cause store-gateway deadlock (thanos-io#6086)

* Fix: Failure to close BlockSeriesClient cause store-gateway deadlock

Signed-off-by: Alan Protasio <alanprot@gmail.com>

* Adding tests

Signed-off-by: Alan Protasio <alanprot@gmail.com>

* reverting the change on get series

Signed-off-by: Alan Protasio <alanprot@gmail.com>

* fix lint

Signed-off-by: Alan Protasio <alanprot@gmail.com>

---------

Signed-off-by: Alan Protasio <alanprot@gmail.com>

* update changelog

Signed-off-by: Ben Ye <benye@amazon.com>

---------

Signed-off-by: Alan Protasio <alanprot@gmail.com>
Signed-off-by: Ben Ye <benye@amazon.com>
Co-authored-by: Alan Protasio <alanprot@gmail.com>

* fix changelog entries

Signed-off-by: Ben Ye <benye@amazon.com>

* docs: improving the description for tsdb.retention on the receiver

Signed-off-by: Victor Fernandes <victorhbfernandes@gmail.com>

* Receiver: Use `intern` package when reallocating label strings (thanos-io#5926)

* Cleanup go mod

Signed-off-by: Matej Gera <matejgera@gmail.com>

* Use string interning for labels realloc method

Signed-off-by: Matej Gera <matejgera@gmail.com>

* Enhance label realloc benchmarks

Signed-off-by: Matej Gera <matejgera@gmail.com>

* Make interning optional; put behind hiddend flag

Signed-off-by: Matej Gera <matej.gera@coralogix.com>

* Update CHANGELOG

Signed-off-by: Matej Gera <matej.gera@coralogix.com>

* Address feedback

- Fix wrong condition
- Adjust benchmarks

Signed-off-by: Matej Gera <matej.gera@coralogix.com>

---------

Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matej.gera@coralogix.com>
Signed-off-by: Matej Gera <38492574+matej-g@users.noreply.github.com>

* Updaing README with drawing fixes and minor wording clarification (thanos-io#6078)

* New drawing and wording for Thanos other deployment models

Signed-off-by: Jonah Kowall <jkowall@kowall.net>

* New drawing and wording for Thanos other deployment models

Signed-off-by: Jonah Kowall <jkowall@kowall.net>

* Added comments to README.md and updated the quick-tutorial.md with the same diagram updates and text to match

Signed-off-by: Jonah Kowall <jkowall@kowall.net>

* Ran make docs

Signed-off-by: Jonah Kowall <jkowall@kowall.net>

---------

Signed-off-by: Jonah Kowall <jkowall@kowall.net>

* Compact: Remove spam of replica label removed log (thanos-io#6088)

* Remove spam of replica label removed log

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Reduce amount of removed replica label instead of removing it

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Reformat code

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

---------

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Store: Don't error when no stores are matched (thanos-io#6082)

It's normal and not an error if a query does not match due to no
downstream stores. This is common when querying with external labels and
tiered query servers.

This bug was introduced in thanos-io#5296

Fixes: thanos-io#5862

Signed-off-by: SuperQ <superq@gmail.com>

* docs: Fix must have Ruler alerts definition (thanos-io#6058)

* Fix must have Ruler alerts definition

ThanosRuler missing rule intervals metric used the wrong comparator sign, confusing users trying to create the rule.



Signed-off-by: Maxim Muzafarov <m.muzafarov@gmail.com>

* Update docs/components/rule.md

Co-authored-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Maxim Muzafarov <m.muzafarov@gmail.com>

---------

Signed-off-by: Maxim Muzafarov <m.muzafarov@gmail.com>
Co-authored-by: Saswata Mukherjee <saswataminsta@yahoo.com>

* Fix conflicts

Signed-off-by: haanhvu <haanh6594@gmail.com>

* Specify overwriting behavior in flag and add validation

Signed-off-by: haanhvu <haanh6594@gmail.com>

* Add log and doc

Signed-off-by: haanhvu <haanh6594@gmail.com>

* Mixins(Rule): Fix query for long rule evaluations (thanos-io#6103)

* mixin(Rule): Fix query for long rule evaluations

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

* Update changelog

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

---------

Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>

---------

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Signed-off-by: haanhvu <haanh6594@gmail.com>
Signed-off-by: Ben Ye <benye@amazon.com>
Signed-off-by: Pradyumna Krishna <git@onpy.in>
Signed-off-by: Kartik-Garg <kartik.garg@infracloud.io>
Signed-off-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Signed-off-by: wanjunlei <wanjunlei@kubesphere.io>
Signed-off-by: Daniel Mellado <dmellado@redhat.com>
Signed-off-by: Pablo Moncada Isla <pablo.moncada@masmovil.com>
Signed-off-by: cyip <cyip@jackhenry.com>
Signed-off-by: Kama Huang <kamatogo13@gmail.com>
Signed-off-by: Alan Protasio <alanprot@gmail.com>
Signed-off-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>
Signed-off-by: GitHub <noreply@github.com>
Signed-off-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Signed-off-by: Victor Fernandes <victorhbfernandes@gmail.com>
Signed-off-by: Matej Gera <matejgera@gmail.com>
Signed-off-by: Matej Gera <matej.gera@coralogix.com>
Signed-off-by: Matej Gera <38492574+matej-g@users.noreply.github.com>
Signed-off-by: Jonah Kowall <jkowall@kowall.net>
Signed-off-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Signed-off-by: SuperQ <superq@gmail.com>
Signed-off-by: Maxim Muzafarov <m.muzafarov@gmail.com>
Signed-off-by: Sebastian Rabenhorst <sebastian.rabenhorst@shopify.com>
Co-authored-by: Filip Petkovski <filip.petkovsky@gmail.com>
Co-authored-by: Ha Anh Vu <75315486+haanhvu@users.noreply.github.com>
Co-authored-by: Ben Ye <benye@amazon.com>
Co-authored-by: Pradyumna Krishna <git@onpy.in>
Co-authored-by: Kartik-Garg <kartik.garg@infracloud.io>
Co-authored-by: Giedrius Statkevičius <giedrius.statkevicius@vinted.com>
Co-authored-by: wanjunlei <53003665+wanjunlei@users.noreply.github.com>
Co-authored-by: wanjunlei <wanjunlei@yujnify.com>
Co-authored-by: Daniel Mellado <1313475+danielmellado@users.noreply.github.com>
Co-authored-by: Pablo Moncada <pmoncadaisla@gmail.com>
Co-authored-by: Chantel Yip <52993239+sshantel@users.noreply.github.com>
Co-authored-by: cyip <cyip@jackhenry.com>
Co-authored-by: Kama Huang <121007071+kama910@users.noreply.github.com>
Co-authored-by: Alan Protasio <alanprot@gmail.com>
Co-authored-by: Vasiliy Rumyantsev <4119114+xBazilio@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: yeya24 <yeya24@users.noreply.github.com>
Co-authored-by: Saswata Mukherjee <saswataminsta@yahoo.com>
Co-authored-by: Victor Fernandes <victorhbfernandes@gmail.com>
Co-authored-by: Matej Gera <38492574+matej-g@users.noreply.github.com>
Co-authored-by: Jonah Kowall <jkowall@kowall.net>
Co-authored-by: Douglas Camata <159076+douglascamata@users.noreply.github.com>
Co-authored-by: Ben Kochie <superq@gmail.com>
Co-authored-by: Maxim Muzafarov <m.muzafarov@gmail.com>
Co-authored-by: haanhvu <haanh6594@gmail.com>
ngraham20 pushed a commit to ngraham20/thanos that referenced this pull request Apr 17, 2023
* Recover from panics in Series calls

This commit adds panic recovery for Series calls in all Store servers.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Apply error suggestion

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

---------

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
ngraham20 pushed a commit to ngraham20/thanos that referenced this pull request Apr 17, 2023
* Recover from panics in Series calls

This commit adds panic recovery for Series calls in all Store servers.

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

* Apply error suggestion

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>

---------

Signed-off-by: Filip Petkovski <filip.petkovsky@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants