-
Notifications
You must be signed in to change notification settings - Fork 825
Fail fast an ingester if unable to load existing TSDBs #3354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail fast an ingester if unable to load existing TSDBs #3354
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but please take a look at comment.
pkg/ingester/ingester_v2.go
Outdated
if err != nil { | ||
level.Warn(util.Logger).Log("msg", "skipped filesystem entry when looking for existing TSDB to open", "path", path, "err", err) | ||
return filepath.SkipDir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is any error traversing the tree, shouldn't we return such error? (Esp. if we're halting ingester if we fail to open TSDB)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, better to do error out. Done.
@@ -1114,14 +1131,23 @@ func (i *Ingester) openExistingTSDB(ctx context.Context) error { | |||
return filepath.SkipDir // Don't descend into directories | |||
}) | |||
|
|||
if walkErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cannot happen currently, since walkFn will filter out any error. (see other comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given I addressed the other comment, why can't happen? The Walk()
interrupts and returns error as soon as we return error. Errors returned by Walk()
itself are not filtered again via Walk()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even before it could, in case os.Open(path)
or f.Readdirnames(1)
failed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right.
Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
8e70495
to
1d67d44
Compare
@@ -1114,14 +1131,23 @@ func (i *Ingester) openExistingTSDB(ctx context.Context) error { | |||
return filepath.SkipDir // Don't descend into directories | |||
}) | |||
|
|||
if walkErr != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it can.
Signed-off-by: Marco Pracucci <marco@pracucci.com>
…rgid-ctx * 'master' of github.com:cortexproject/cortex: Enforce integration tests default flags config to never be overwritten (cortexproject#3370) Avoid deletion of blocks which are not shipped (cortexproject#3346) Upgrade Thanos to latest master (cortexproject#3363) Migrate CircleCI workflows to GitHub Actions (2/3) (cortexproject#3341) Remove comments that doesn't seem right (cortexproject#3361) add ingester interface (cortexproject#3352) Fail fast an ingester if unable to load existing TSDBs (cortexproject#3354) Fixed Gossip memberlist members joining when addresses are configured using DNS-based service discovery (cortexproject#3360) Export distributor method to get ingester replication set (cortexproject#3356) Correct link for Block Storage reference (cortexproject#3234) Added section on Cleaner. (cortexproject#3327) Update prometheus vendor to master (cortexproject#3345) adding GHA CI env variable check (cortexproject#3351) Add ingesters shuffle sharding support on the read path (cortexproject#3252)
What this PR does:
Tonight we had an issue in one ingester which had TSDB head chunks corrupted (root cause will be discussed separately). When a similar issue happen, the ingester skips the corrupted TSDB at startup, it joins the ring with
ACTIVE
state and, as soon as receive any write request from the tenant with the corrupted TSDB it will try to reopen the TSDB for every single write request. This leads to an undesirable situation, which will soon get the ingester to get killed (due to OOM).In this PR I'm proposing to fail fast the ingester if unable to load an existing TSDB. It's a loud and clear signal to the Cortex cluster operator and, in my opinion, it's better to fail fast an ingester before it start receiving write requests instead of having it failing few minutes after running.
Which issue(s) this PR fixes:
N/A
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]