Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixing bug in chuck cache that can cause panic during shutdown #4398

Merged
merged 3 commits into from
Aug 26, 2021

Conversation

rsteneteg
Copy link
Contributor

Signed-off-by: Roger Steneteg rsteneteg@ea.com

What this PR does:
Fixes a bug in the chunk storage cache that can cause a panic in ingesters during shutdown

Which issue(s) this PR fixes:
Fixes #4397

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Roger Steneteg <rsteneteg@ea.com>
@rsteneteg rsteneteg force-pushed the rsteneteg/fix-ingester-panic branch from f546446 to 2e6b9a5 Compare August 3, 2021 21:36
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Are there any similar issues in other places where we do bounded parallelism?

One question before I approve.

pkg/chunk/cache/memcached.go Show resolved Hide resolved
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@pull-request-size pull-request-size bot added size/L and removed size/M labels Aug 4, 2021
@rsteneteg rsteneteg force-pushed the rsteneteg/fix-ingester-panic branch 2 times, most recently from 1a722ab to 46a1213 Compare August 4, 2021 18:56
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of thoughts

select {
case <-c.quit:
return
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this happen: quit is not closed, so we go to default and block on send.
Then quit is closed and all workers exit.
Now we are hung.
Should the chan send be brought up to the select, so either can proceed each time round the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit unsure about if we could use both send and receive cases in the same select, but seems to be OK since the channels are read/closed from other goroutines, so I moved it up to a case.

pkg/chunk/chunk_store_utils.go Show resolved Hide resolved
@rsteneteg rsteneteg force-pushed the rsteneteg/fix-ingester-panic branch from 46a1213 to 8b96709 Compare August 6, 2021 13:18
@rsteneteg rsteneteg marked this pull request as draft August 6, 2021 14:09
…on closed channel during stop

Signed-off-by: Roger Steneteg <rsteneteg@ea.com>
@rsteneteg rsteneteg force-pushed the rsteneteg/fix-ingester-panic branch from 8b96709 to 5a86c1a Compare August 6, 2021 17:54
@rsteneteg rsteneteg marked this pull request as ready for review August 6, 2021 18:28
@rsteneteg rsteneteg requested a review from bboreham August 6, 2021 18:28
Copy link
Contributor

@bboreham bboreham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@@ -239,11 +257,15 @@ func (c *Memcached) Store(ctx context.Context, keys []string, bufs [][]byte) {

// Stop does nothing.
func (c *Memcached) Stop() {
if c.inputCh == nil {
if c.quit == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] c.quit is set in the new function and I can't see where we ever set it to nil so I'm not sure we need this check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker, we can merge anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If batchsize or parallelism is off then we return the cache before we set the quit/inputCh channels

if cfg.BatchSize == 0 || cfg.Parallelism == 0 { return c }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thanks! Didn't notice it.

@pracucci pracucci merged commit 2d4d060 into cortexproject:master Aug 26, 2021
@rsteneteg rsteneteg deleted the rsteneteg/fix-ingester-panic branch August 26, 2021 16:15
@bboreham
Copy link
Contributor

bboreham commented Oct 1, 2021

I encountered a race warning at #4508, which seems to be a similar case; could you take a look at #4511 please @rsteneteg ?

alvinlin123 pushed a commit to ac1214/cortex that referenced this pull request Jan 14, 2022
…xproject#4398)

* fixing bug in chuck cache that can cause panic during shutdown

Signed-off-by: Roger Steneteg <rsteneteg@ea.com>

* adding separate quit channel for stopping chunkfetcher to avoid send on closed channel during stop

Signed-off-by: Roger Steneteg <rsteneteg@ea.com>
Signed-off-by: Alvin Lin <alvinlin@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ingesters panic during shutdown causing data loss
3 participants