-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix failover process after disk crash #637
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maksm90 Thanks for your PR! Some comments inline.
@@ -641,7 +640,7 @@ func (p *PostgresKeeper) GetPGState(pctx context.Context) (*cluster.PostgresStat | |||
|
|||
initialized, err := p.pgm.IsInitialized() | |||
if err != nil { | |||
return nil, err | |||
return pgState, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just do as the rest of the function, return the current pgstate (that will have healthy set to false) without returning an error. For the same reason we should change the signature of the function removing the error return since it will never return an error:
func (p *PostgresKeeper) GetPGState(pctx context.Context) *cluster.PostgresState
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done below. But I'm confused that the function GetPGState
is called in stage of node resync and returned error interrupts the resync process obviously. How to deal with this issue? For now, I doesn't interrupt the resync process anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maksm90 you're right. Currently GetPGState pourpose is confusing. We want to gather all the possible available information and return them also if failed with the exception of that case were we wan't to handle the error.
So, sorry, just ignore my previous request. The first version where you just ignored the error was better...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgotti I have reverted the last commit and leave the first version
@@ -558,7 +558,6 @@ func (p *PostgresKeeper) updatePGState(pctx context.Context) { | |||
pgState, err := p.GetPGState(pctx) | |||
if err != nil { | |||
log.Errorw("failed to get pg state", zap.Error(err)) | |||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the other comment just remote the error check since GetPGState will become:
func (p *PostgresKeeper) GetPGState(pctx context.Context) *cluster.PostgresState
@maksm90 Thanks. Can you please squash in a single commit so I can merge this PR? |
Done |
@maksm90 Thanks a lot! Merging. |
Thanks too! |
I have encountered the problem on test mini installation (cluster based on raspberry pi) that autofailover didn't performed after disk (SD card, in case of raspberry) manual insulation.
The problem lies in
PostgresKeeper.updatePGState
routine (cmd/keeper/cmd/keeper.go:561) that Postgres state isn't updated in DCS (consul or etcd) whenPostgresKeeper.GetPGState
routine returns some not null error. Such error is emitted after check whether PGDATA is initialized (Manager.IsInitialized
routine, cmd/keeper/cmd/keeper.go:641) what is induced after disk file system operation failure (my case).The current pull request fixes this issue.