Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix failover process after disk crash #637

Merged
merged 1 commit into from
May 27, 2019
Merged

Fix failover process after disk crash #637

merged 1 commit into from
May 27, 2019

Conversation

maksm90
Copy link
Contributor

@maksm90 maksm90 commented Apr 29, 2019

I have encountered the problem on test mini installation (cluster based on raspberry pi) that autofailover didn't performed after disk (SD card, in case of raspberry) manual insulation.

The problem lies in PostgresKeeper.updatePGState routine (cmd/keeper/cmd/keeper.go:561) that Postgres state isn't updated in DCS (consul or etcd) when PostgresKeeper.GetPGState routine returns some not null error. Such error is emitted after check whether PGDATA is initialized (Manager.IsInitialized routine, cmd/keeper/cmd/keeper.go:641) what is induced after disk file system operation failure (my case).

The current pull request fixes this issue.

Copy link
Member

@sgotti sgotti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maksm90 Thanks for your PR! Some comments inline.

@@ -641,7 +640,7 @@ func (p *PostgresKeeper) GetPGState(pctx context.Context) (*cluster.PostgresStat

initialized, err := p.pgm.IsInitialized()
if err != nil {
return nil, err
return pgState, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just do as the rest of the function, return the current pgstate (that will have healthy set to false) without returning an error. For the same reason we should change the signature of the function removing the error return since it will never return an error:

func (p *PostgresKeeper) GetPGState(pctx context.Context) *cluster.PostgresState

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done below. But I'm confused that the function GetPGState is called in stage of node resync and returned error interrupts the resync process obviously. How to deal with this issue? For now, I doesn't interrupt the resync process anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maksm90 you're right. Currently GetPGState pourpose is confusing. We want to gather all the possible available information and return them also if failed with the exception of that case were we wan't to handle the error.

So, sorry, just ignore my previous request. The first version where you just ignored the error was better...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgotti I have reverted the last commit and leave the first version

@@ -558,7 +558,6 @@ func (p *PostgresKeeper) updatePGState(pctx context.Context) {
pgState, err := p.GetPGState(pctx)
if err != nil {
log.Errorw("failed to get pg state", zap.Error(err))
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the other comment just remote the error check since GetPGState will become:

func (p *PostgresKeeper) GetPGState(pctx context.Context) *cluster.PostgresState

@sgotti
Copy link
Member

sgotti commented May 23, 2019

@maksm90 Thanks. Can you please squash in a single commit so I can merge this PR?

@maksm90
Copy link
Contributor Author

maksm90 commented May 27, 2019

Can you please squash in a single commit so I can merge this PR?

Done

@sgotti
Copy link
Member

sgotti commented May 27, 2019

@maksm90 Thanks a lot! Merging.

@sgotti sgotti merged commit 984dc87 into sorintlab:master May 27, 2019
@maksm90
Copy link
Contributor Author

maksm90 commented May 27, 2019

@maksm90 Thanks a lot! Merging.

Thanks too!

@sgotti sgotti added this to the v0.14.0 milestone Jun 6, 2019
@maksm90 maksm90 deleted the fix_disk_crash_failover branch January 25, 2020 10:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants