Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reconciliation on Endpoints's subset changes #1674

Merged
merged 1 commit into from
Jul 31, 2024

Conversation

abaguas
Copy link
Collaborator

@abaguas abaguas commented Jul 30, 2024

The idea was to still trigger reconciliation cycles if resources changes by checking if there was a generation update. https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#metadata

generation: a sequence number representing a specific generation of the desired state. Set by the system and monotonically increasing, per-resource. May be compared, such as for RAW and WAW consistency.

However, the change caused a degradation watching Endpoints resources. These resources don't have a generation because they don't have a status. As a result, a reconciliation was only triggered on creation and deletion, but not on subset update. This caused the controller to become too slow reacting to Endpoints changes, thus leading to slow failovers and failing e2e tests if the GSLB reconcile timer didn't fall during the expected window.

The solution proposed in this PR is to always trigger the endpointMapHandler function when an Endpoint resource changes. This is the same behavior as before #1652.


In addition, the bind image internetsystemsconsortium/bind9:9.16 pulled from the edgedns cluster was no longer available, so it is updated to version 9.18:

➜  ~ kubectl get pods --context k3d-edgedns
NAME                    READY   STATUS             RESTARTS   AGE
edge-5df69889cf-8vths   0/1     ImagePullBackOff   0          4m37s

@abaguas abaguas changed the title Fix reconciliation on Endpoints updates Fix reconciliation on Endpoints's subset changes Jul 30, 2024
@abaguas abaguas marked this pull request as draft July 30, 2024 15:23
@abaguas abaguas force-pushed the fix/degradation branch 6 times, most recently from d0bc6f7 to f3c225e Compare July 31, 2024 09:30
The idea was to still trigger reconciliation cycles if resources changes by checking if there was a `generation` update.
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#metadata
> generation: a sequence number representing a specific generation of the desired state. Set by the system and monotonically increasing, per-resource. May be compared, such as for RAW and WAW consistency.

However, the change caused a degradation in `Endpoints` resources. These resources don't have generation because they don't have a status. As a result, a reconciliation was only triggered on creation and deletion, but not on subset update.
This degradated also the behavior of the k8gb controller. The controller became too slow monitoring the health of applications since `Endpoints` updates no longer triggered a reconcile of the GSLB resource. This lead to slow failovers and failing e2e tests.

The solution proposed in this PR is to always trigger the `endpointMapHandler` function when an `Endpoint` resource changes. This is the same behavior as before k8gb-io#1652.

Signed-off-by: Andre Baptista Aguas <andre.aguas@protonmail.com>
@@ -169,7 +169,6 @@ func (r *GslbReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.
return result.RequeueError(fmt.Errorf("getting GSLB servers (%s)", err))
}
gslb.Status.Servers = servers
fmt.Printf("got servers: %v\n", servers)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug statement that was forgotten from a previous commit

@@ -61,6 +61,7 @@ func (a *logrSinkAdapter) Info(_ int, msg string, keysAndValues ...interface{})
a.WithValues(keysAndValues)
if a.name != "" {
a.z.Info().Msgf("%s: %s %s", a.name, msg, a.valuesAsJSON())
return
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lack of this return was leading to duplicated log lines

@@ -27,8 +27,6 @@ import (
)

func TestWeightsExistsInLocalDNSEndpoint(t *testing.T) {
t.Parallel()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There were some random failures in this test, I had to remove parallelism to have it stable. One cluster was taking too long discovering the DNS entries of the other cluster, but I am not sure why.
Before:

   --- FAIL: TestWeightsExistsInLocalDNSEndpoint (337.85s)
      --- FAIL: TestWeightsExistsInLocalDNSEndpoint/with_gslb_only (245.38s)
      --- PASS: TestWeightsExistsInLocalDNSEndpoint/with_ref (92.48s)

After:

   --- PASS: TestWeightsExistsInLocalDNSEndpoint (195.55s)
      --- PASS: TestWeightsExistsInLocalDNSEndpoint/with_gslb_only (128.23s)
      --- PASS: TestWeightsExistsInLocalDNSEndpoint/with_ref (67.32s)

@abaguas abaguas marked this pull request as ready for review July 31, 2024 10:01
Copy link
Collaborator

@kuritka kuritka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks, I'm glad you found the issue. I'm a bit worried that in the future maybe the tests will crash again at a different point and parallelism will be the cause. Anyway, thanks and happy to see the correction.

@abaguas
Copy link
Collaborator Author

abaguas commented Jul 31, 2024

Hi, thanks, I'm glad you found the issue. I'm a bit worried that in the future maybe the tests will crash again at a different point and parallelism will be the cause. Anyway, thanks and happy to see the correction.

Hopefully when we try to revamp the tests we can migrate them one by one to the new framework and can get to the bottom of these parallelism issues

Copy link
Member

@ytsarev ytsarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants