Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make bootstrap operations deterministic #403

Merged
merged 13 commits into from
Sep 16, 2022

Conversation

adutra
Copy link
Contributor

@adutra adutra commented Sep 7, 2022

What this PR does:
Makes bootstrap order 100% balanced and deterministic.

Which issue(s) this PR fixes:
Fixes #381 .

Checklist

  • Changes manually tested
  • Automated Tests added/updated
  • Documentation added/updated
  • CHANGELOG.md updated (not required for documentation PRs)
  • CLA Signed: DataStax CLA

@burmanm
Copy link
Contributor

burmanm commented Sep 12, 2022

If you wish to write a test for this (using fakeclient), then it's possible with some helpers already in cass-operator. For example, https://github.com/k8ssandra/cass-operator/blob/master/pkg/reconciliation/reconcile_racks_test.go#L1625 calls startCassandra, but the more interesting for you is of course https://github.com/k8ssandra/cass-operator/blob/master/pkg/reconciliation/reconcile_racks_test.go#L1406

It shows how to mock the pod endpoint that the fakeclients then call, without doing too much coding (it's not intuitive at all, but copy-pasting parts of those existing tests usually ends up making one understand how to achieve something). And you can call a single test from the editor which makes iterating quite quick.

@adutra
Copy link
Contributor Author

adutra commented Sep 12, 2022

@burmanm thank you. I managed to create a unit test that is not too lame.

Would you mind reviewing this? It should be a quick review.

@adutra adutra marked this pull request as ready for review September 12, 2022 18:28
@adutra adutra requested a review from a team as a code owner September 12, 2022 18:29
@@ -617,11 +618,13 @@ func (rc *ReconciliationContext) CheckPodsReady(endpointData httphelper.CassMeta
return result.RequeueSoon(5)
}

needsMoreNodes, err := rc.startAllNodes(endpointData)
// step 4 - make sure all nodes are ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the step 3 comment is wrong as it used to be this step.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK will fix comment for step 3


labelSeedBeforeStart := readySeeds == 0 && len(rc.Datacenter.Spec.AdditionalSeeds) == 0 && externalSeedPoints == 0
for podRankWithinRack := 0; ; podRankWithinRack++ {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, would you explain the benefit vs, lets say following algorithm (the problem here looks very much like a dynamic programming)

smallestRack := 0
for i := 0; i < len(rc.statefulSets); i++ {
    if rc.statefulSets[i].Status.ReadyReplicas < rc.statefulSets[smallestRack].Status.ReadyReplicas {
        smallestRack = i
    }
}

fmt.Printf("Next rack to start is: %d\n", smallestRack)

Also, is .Replicas really the number you want to compare to vs. ReadyReplicas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, my understanding is that we want here to use the desired number of pods for each sts, thus we should use .Replicas.

Second, switching to .ReadyReplicas could indeed simplify the code, but can we guarantee that if a pod is counted as a ready replica, it is not only ready, but also started and has joined the ring? Because if that's not the case, we should NOT keep bootstrapping the next pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would Replicas help with that behavior? Isn't .Spec.Replicas updated when we add the pods? They don't have to be ready at all. Our startup process should take care of preventing next startup (there's one pod that's Started-but-not-Ready) before starting next ones. Though for that reason, perhaps it doesn't matter which one we use, since when we get to this point, we should have no starting pods anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than of course for the algorithm I mentioned, there the ReadyReplicas does matter since Replicas has wrong calculation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what should I do here? Should I change as you suggested above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, not necessarily. I only want to ensure that whoever reads it next can easily see what it's doing (and so avoid anything that looks pretty and complex). Would you think from that perspective the code is self-explanatory enough (the comment on the method only describes what we want to achieve, not how we achieve it) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me, it is self-explanatory as it is now, however that's a subjective notion. You are never going to get everybody on the same page on such matters.

@@ -1676,15 +1675,21 @@ func TestFailedStart(t *testing.T) {

func TestReconciliationContext_startAllNodes(t *testing.T) {

// A boolean representing the state of a pod (started or not).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very helpful. Thank you!

@burmanm burmanm merged commit c767dda into k8ssandra:master Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

K8SSAND-1697 ⁃ Make Bootstrap Operations Deterministic
3 participants