Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not check SSR State when Snapshotter should be stopped #680

Merged
merged 3 commits into from
Mar 26, 2024

Conversation

avestuk
Copy link
Contributor

@avestuk avestuk commented Nov 14, 2023

What this PR does / why we need it:

As reported in #676 we sometimes see that multiple snapshotters can be running at one time in a cluster.

I believe that the reason for this is that the snapshotter is not set to active state until an initial delta snapshot or a fullsnapshot has been taken. If a leadership election takes place while the snapshot is being taken and the current leader loses that election, then handleSsrRequest does not send on the ssrStopCh. This in turn causes ssr.snapshotEventHandler to never return and snapshots would continue to be taken.

Secondly I'm propsing a change to close the ssrStopCh rather than sending on it. This is to make it clear to the reader that the channel is no longer being used at this point.

Lastly the function signature changes are simply adding further type annotations to the channels passed to functions to help readers understand where channels are being read from/sent to.

Which issue(s) this PR fixes:
Fixes #676

Special notes for your reviewer:

The reason why the ssrState was being checked is not apparent to me so I'd be delighted to be told why this might be an awful idea.

Release note:

Do not rely on the snapshotter state when stopping the snapshotter. The snapshotter will now always be closed when a member goes from being the leader to any other state. 

@avestuk avestuk requested a review from a team as a code owner November 14, 2023 16:05
@gardener-robot
Copy link

@avestuk Thank you for your contribution.

@CLAassistant
Copy link

CLAassistant commented Nov 14, 2023

CLA assistant check
All committers have signed the CLA.

@gardener-robot gardener-robot added needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) labels Nov 14, 2023
@gardener-robot-ci-1
Copy link
Contributor

Thank you @avestuk for your contribution. Before I can start building your PR, a member of the organization must set the required label(s) {'reviewed/ok-to-test'}. Once started, you can check the build status in the PR checks section below.

@aaronfern aaronfern added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Nov 17, 2023
@gardener-robot-ci-3 gardener-robot-ci-3 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Nov 17, 2023
@avestuk
Copy link
Contributor Author

avestuk commented Jan 2, 2024

@ishan16696 Anything I can do to help move this along?

@ishan16696
Copy link
Member

Hi @avestuk ,
Sorry for delay. I recognise the importance of the issue, I will try to increase the priority of the issue mentioned by you and analyse it as this PR involves modifications to our legacy code, it warrants a thorough analysis.

@ishan16696
Copy link
Member

Thanks @avestuk for PR,
Overall looks good. I'm just testing out some edge cases.
Kindly requesting you to rebase this PR on latest master.

Copy link
Member

@ishan16696 ishan16696 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !!

@ishan16696
Copy link
Member

ishan16696 commented Mar 23, 2024

As requested in this comment: #680 (comment), please rebase this PR on latest master.

Alex Vest added 3 commits March 25, 2024 08:23
Allows type checking of channel interactions
I think that closing the channel makes it clearer that this channels purpose has been fulfilled and it no longer needs to be considered. All the places that use the channel are recieving all values from the channel so they will continue to function in the same way
The snapshotter is not set to active state until an initial delta
snapshot or a fullsnapshot has been taken. If a leadership election
takes place while the snapshot is being taken and the current leader
loses that election, then handleSsrRequest does not send on the
ssrStopCh. This in turn causes ssr.snapshotEventHandler to never return
and snapshots would continue to be taken.
@avestuk avestuk force-pushed the specify-channel-types branch from e59dd3f to c90aa01 Compare March 25, 2024 08:24
@avestuk
Copy link
Contributor Author

avestuk commented Mar 25, 2024

As requested in this comment: #680 (comment), please rebase this PR on latest master.

Thanks very much @ishan16696 I've rebased on latest master.

@ishan16696 ishan16696 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) merge/squash Should be merged via 'Squash and merge' labels Mar 25, 2024
@gardener-robot-ci-3 gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Mar 25, 2024
@renormalize
Copy link
Member

/hold until #716 is merged.

@gardener-robot gardener-robot added the reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies label Mar 26, 2024
Copy link
Member

@ishan16696 ishan16696 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!!, thanks for rebasing the PR

@ishan16696 ishan16696 removed the reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies label Mar 26, 2024
@ishan16696
Copy link
Member

/hold until #716 is merged.

PR #716 will take some time as it also need to resolve merge conflicts, let's not block this PR. Talked to @shreyas-s-rao and he is fine with it.

@ishan16696 ishan16696 merged commit 0247d8d into gardener:master Mar 26, 2024
9 checks passed
@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Mar 26, 2024
@shreyas-s-rao shreyas-s-rao added this to the v0.29.0 milestone Mar 26, 2024
ishan16696 pushed a commit to ishan16696/etcd-backup-restore that referenced this pull request Jul 15, 2024
* Specify send/recv

Allows type checking of channel interactions

* Close instead of sending on channel

I think that closing the channel makes it clearer that this channels purpose has been fulfilled and it no longer needs to be considered. All the places that use the channel are recieving all values from the channel so they will continue to function in the same way

* Do not check ssr state in handleSsrStopRequest

The snapshotter is not set to active state until an initial delta
snapshot or a fullsnapshot has been taken. If a leadership election
takes place while the snapshot is being taken and the current leader
loses that election, then handleSsrRequest does not send on the
ssrStopCh. This in turn causes ssr.snapshotEventHandler to never return
and snapshots would continue to be taken.
ishan16696 added a commit that referenced this pull request Jul 15, 2024
* Specify send/recv

Allows type checking of channel interactions

* Close instead of sending on channel

I think that closing the channel makes it clearer that this channels purpose has been fulfilled and it no longer needs to be considered. All the places that use the channel are recieving all values from the channel so they will continue to function in the same way

* Do not check ssr state in handleSsrStopRequest

The snapshotter is not set to active state until an initial delta
snapshot or a fullsnapshot has been taken. If a leadership election
takes place while the snapshot is being taken and the current leader
loses that election, then handleSsrRequest does not send on the
ssrStopCh. This in turn causes ssr.snapshotEventHandler to never return
and snapshots would continue to be taken.

Co-authored-by: Alex Vest <avestuk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge/squash Should be merged via 'Squash and merge' needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) needs/review Needs review size/xs Size of pull request is tiny (see gardener-robot robot/bots/size.py) status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Two snapshotters running concurrently
9 participants