Modify tags related functions and some test files to fix flaky assertion errors #51

kyriechen96 · 2023-01-23T23:58:31Z

Issue #, if available:
MemoryDB Controller has some flaky tests that randomly show assertion error.
Root causes:

Snapshot test files didn't clean resources which were created from previous failed tests, so eventually test account reached node limit.
Delta compare considers order matters, so it considers two tag arrays are different even though they have same elements with different orders. Function of updating tags didn't include this corner case, and it caused infinite update.
When creating resources, some resources take a while to finish creation. Function of getting tags got error because it couldn't find resource, and this error covered previous ResourceNotFound error. Hence, reconciler didn't call requeue, because it didn't recognize the new error.
Cluster update validation test file has a step to update and validate Snapshot Window (daily time range to auto snapshot). If this step is executing within the time range of new Snapshot Window, update would succeed (ResourceSynced returns true) because two resources match, but cluster starts snapshotting after it's Snapshot Window is updated to the current time. Hence, for next step to update other fields of cluster, new fields couldn't be updated on time, because the status of cluster is snapshotting and cluster cannot be updated.

Description of changes:

Fix updateTags function and it covers situation of equal tag arrays. (It still wastes time to call sdkUpdate over and over. It is better to modify delta comparison of tags in code-generator. New PR is created (Comparing tags using common acktags.Tag structs community#1658).
Change updateTags functions from parameter group and subnet group to match the updateTags functions in other resources.
Add a status check for getTags. It is only called when status of resource (cluster, snapshot, acl, user) is active.
Increase wait time for the next update step after Snapshot Window update. The new wait time is enough to cover snapshotting process.
Use same method to create cluster for snapshot for snapshot_validate_tags.yaml test file.
Correct user names and delete unused resources in test files.
Modify some test IDs, step IDs, and descriptions which are not appropriate that makes debugging really hard.
Fix import order of hook files.
Modify equalStrings function.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ack-bot · 2023-01-23T23:58:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kyriechen96

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [kyriechen96]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nmvk · 2023-01-25T20:28:22Z

test/e2e/scenarios/Cluster/cluster_terminal_condition.yaml

+        aclName: open-access
+        numShards: 1
+        numReplicasPerShard: 0


This should not be added whole idea is test should run once you fix the spec that has issue?

Yeah the root reason is that test account is full now. Deleting some clusters in account.

nmvk · 2023-01-25T20:29:03Z

test/e2e/scenarios/Cluster/cluster_terminal_condition.yaml

+        nodeType: cache.t4g.small
        aclName: open-access
-        numShards: 2
+        numShards: 1


I dont think these changes are needed. cache is not valid node type for memorydb

Ideally I agree with you, but this causes node limit exceed. It doesn't have this issue before, but this happens recently. It may cause by some code changes on service side. The modification makes test pass now, so we need to investigate the root cause.

nmvk · 2023-01-25T20:38:25Z

test/e2e/scenarios/ACL/acl_create_update.yaml

        userNames:
          - userone$RANDOM_SUFFIX
          - usertwo$RANDOM_SUFFIX
-    wait:


wait: status: conditions: ACK.ResourceSynced: status: "True" timeout: 180

Why would hard coding the value give advantage over status change wait? I prefer the previous approach as it helps us catch any issue with status change flipping back and forth.

Previous approach causes random assertion error. My assumption is there is something wrong with previous approach, and I modified to the current approach to check whether it works. And I think we need to investigate the root cause.

jaypipes · 2023-01-26T20:11:42Z

@kyriechen96 Please see the error from the tests:

AssertionError: Condition status mismatch. Expected condition: ACK.ResourceSynced - {'status': 'True'} but found {'lastTransitionTime': '2023-01-26T17:22:26Z', 'message': 'Unable to determine if desired resource state matches latest observed state', 'reason': 'ACLNotFoundFault: acl-jl1ptly9i46hs93 is either not present or not available.', 'status': 'Unknown', 'type': 'ACK.ResourceSynced'}

The issue isn't timeouts. The issue seems to be a missing ACL dependency in one of the tests I think?

kyriechen96 · 2023-01-27T20:07:55Z

@kyriechen96 Please see the error from the tests:

AssertionError: Condition status mismatch. Expected condition: ACK.ResourceSynced - {'status': 'True'} but found {'lastTransitionTime': '2023-01-26T17:22:26Z', 'message': 'Unable to determine if desired resource state matches latest observed state', 'reason': 'ACLNotFoundFault: acl-jl1ptly9i46hs93 is either not present or not available.', 'status': 'Unknown', 'type': 'ACK.ResourceSynced'}

The issue isn't timeouts. The issue seems to be a missing ACL dependency in one of the tests I think?

We had a discussion of it. This is happened during condition check for ACL creation. I'm fixing it now.

kyriechen96 · 2023-01-31T11:52:31Z

/retest

ack-bot · 2023-01-31T11:54:56Z

@kyriechen96: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
memorydb-release-test	`6a8b7ed`	link	`/test memorydb-release-test`
memorydb-kind-e2e	`6a8b7ed`	link	`/test memorydb-kind-e2e`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

RedbackThomson · 2023-02-02T20:45:34Z

I'm not super familiar with the memorydb snapshotting process. Would you mind explaining this a little more:

If this step is executing within the time range of new Snapshot Window, update would succeed (ResourceSynced returns true) because two resources match, but cluster starts snapshotting after it's Snapshot Window is updated to the current time. Hence, for next step to update other fields of cluster, new fields couldn't be updated on time, because the status of cluster is snapshotting and cluster cannot be updated

a-hilaly

Great stuff Kyrie!

a-hilaly · 2023-02-03T14:10:38Z

pkg/resource/cluster/hooks.go

+		if hasSameKey {
+			continue
+		}


I see that you wanna ignore duplicated keys +++

a-hilaly · 2023-02-03T14:11:22Z

pkg/resource/cluster/hooks.go

+	condMsgCurrentlyDeleting            = "cluster currently being deleted"
+	condMsgNoDeleteWhileUpdating        = "cluster is being updated. cannot delete"


Can we make messages a bit consistent? cluster currently being updated

These two variables are not related to flaky tests. I didn't modify or create them in this PR, so I will mark it and change it in a new PR.

Yes please. The english grammar for these two messages should be improved. cluster is currently being deleted and cluster is currently being updated, cannot delete are my suggestions

a-hilaly · 2023-02-03T14:11:37Z

pkg/resource/parameter_group/hooks.go


 import (
 	"context"
+	ackutil "github.com/aws-controllers-k8s/runtime/pkg/util"


https://github.com/golang/go/wiki/CodeReviewComments#imports

Yeah it was automatically added there. I will change the order manually.

a-hilaly · 2023-02-03T14:13:08Z

pkg/resource/parameter_group/hooks.go

+func equalStrings(a, b *string) bool {
+	if a == nil {
+		return b == nil || *b == ""
+	}
+	return (*a == "" && b == nil) || *a == *b
+}


This might not panic in the tags case, it could with other fields. Correct implementation here: aws-controllers-k8s/community#1654

I will fix it for all resource hooks in the new PR.

You should be able to just copy the code from that issue into all of your equalStrings functions. It would be better if we address it now than leave it for another PR, where we might forget.

Fixed it for all hook files.

a-hilaly · 2023-02-03T14:13:21Z

pkg/resource/snapshot/hooks.go

+	svcsdk "github.com/aws/aws-sdk-go/service/memorydb"
+	corev1 "k8s.io/api/core/v1"
+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+
+	svcapitypes "github.com/aws-controllers-k8s/memorydb-controller/apis/v1alpha1"
+)


a-hilaly · 2023-02-03T14:13:35Z

pkg/resource/subnet_group/hooks.go


 import (
 	"context"
+	ackutil "github.com/aws-controllers-k8s/runtime/pkg/util"


Fixed import orders for all hooks.

RedbackThomson

I think this is ready to go now! Running the soak tests will be the real determination of whether all of these tests are solid, but for now this looks good.

I've left one nit that's applicable to all of the <resource>Active hook methods.

RedbackThomson · 2023-02-03T18:28:49Z

pkg/resource/user/hooks.go

 }

+// userActive returns true when the status of the given User is set to `active`
+func (rm *resourceManager) userActive(


nit: methods returning booleans should start with an article. For example isUserActive

@kyriechen96 wants to address this in a seperate PR

a-hilaly

Great job on this.
/lgtm

ack-prow · 2023-02-06T14:29:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: A-Hilaly, kyriechen96, nmvk, RedbackThomson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [A-Hilaly,RedbackThomson,kyriechen96,nmvk]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ion errors (aws-controllers-k8s#51) Issue #, if available: MemoryDB Controller has some flaky tests that randomly show assertion error. Root causes: 1. Snapshot test files didn't clean resources which were created from previous failed tests, so eventually test account reached node limit. 2. Delta compare considers order matters, so it considers two tag arrays are different even though they have same elements with different orders. Function of updating tags didn't include this corner case, and it caused infinite update. 3. When creating resources, some resources take a while to finish creation. Function of getting tags got error because it couldn't find resource, and this error covered previous ResourceNotFound error. Hence, reconciler didn't call requeue, because it didn't recognize the new error. 4. Cluster update validation test file has a step to update and validate Snapshot Window (daily time range to auto snapshot). If this step is executing within the time range of new Snapshot Window, update would succeed (ResourceSynced returns true) because two resources match, but cluster starts snapshotting after it's Snapshot Window is updated to the current time. Hence, for next step to update other fields of cluster, new fields couldn't be updated on time, because the status of cluster is snapshotting and cluster cannot be updated. Description of changes: 1. Fix updateTags function and it covers situation of equal tag arrays. (It still wastes time to call sdkUpdate over and over. It is better to modify delta comparison of tags in code-generator. New PR is created (aws-controllers-k8s/community#1658). 2. Change updateTags functions from parameter group and subnet group to match the updateTags functions in other resources. 3. Add a status check for getTags. It is only called when status of resource (cluster, snapshot, acl, user) is active. 4. Increase wait time for the next update step after Snapshot Window update. The new wait time is enough to cover snapshotting process. 5. Use same method to create cluster for snapshot for snapshot_validate_tags.yaml test file. 6. Correct user names and delete unused resources in test files. 7. Modify some test IDs, step IDs, and descriptions which are not appropriate that makes debugging really hard. 8. Fix import order of hook files. 9. Modify equalStrings function. By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

ack-bot requested review from kumargauravsharma and nmvk January 23, 2023 23:58

ack-bot added the approved label Jan 23, 2023

kyriechen96 force-pushed the fix_flaky_tests branch from 9499fff to f5b2d69 Compare January 24, 2023 22:19

nmvk suggested changes Jan 25, 2023

View reviewed changes

kyriechen96 changed the title ~~Fixes flaky tests~~ Modify method of wait for ACL/Snapshot creation and increase wait time for some steps in e2e test files to fix random assertion errors during e2e test Jan 25, 2023

kyriechen96 force-pushed the fix_flaky_tests branch from f5b2d69 to 696a07a Compare January 25, 2023 22:06

kyriechen96 changed the title ~~Modify method of wait for ACL/Snapshot creation and increase wait time for some steps in e2e test files to fix random assertion errors during e2e test~~ Modify wait time for e2e tests to fix flaky assertion errors Jan 25, 2023

kyriechen96 force-pushed the fix_flaky_tests branch 3 times, most recently from 02c45db to c5c60a0 Compare January 26, 2023 16:20

kyriechen96 force-pushed the fix_flaky_tests branch from c5c60a0 to 6a8b7ed Compare January 31, 2023 09:10

kyriechen96 changed the title ~~Modify wait time for e2e tests to fix flaky assertion errors~~ Modify tags updating and some test files to fix flaky assertion errors Jan 31, 2023

kyriechen96 force-pushed the fix_flaky_tests branch 6 times, most recently from bba90b9 to bc6af83 Compare February 1, 2023 22:09

kyriechen96 changed the title ~~Modify tags updating and some test files to fix flaky assertion errors~~ Modify tags related functions and some test files to fix flaky assertion errors Feb 1, 2023

kyriechen96 force-pushed the fix_flaky_tests branch 4 times, most recently from 068e1d0 to 79629a0 Compare February 2, 2023 20:08

a-hilaly reviewed Feb 3, 2023

View reviewed changes

kyriechen96 force-pushed the fix_flaky_tests branch from 01c8527 to a38db76 Compare February 3, 2023 17:48

Fixes flaky tests

f00537c

kyriechen96 force-pushed the fix_flaky_tests branch from a38db76 to f00537c Compare February 3, 2023 18:18

RedbackThomson approved these changes Feb 3, 2023

View reviewed changes

aws-controllers-k8s deleted a comment from kyriechen96 Feb 3, 2023

nmvk approved these changes Feb 4, 2023

View reviewed changes

a-hilaly approved these changes Feb 6, 2023

View reviewed changes

ack-prow bot assigned a-hilaly Feb 6, 2023

ack-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 6, 2023

ack-prow bot merged commit cab999c into aws-controllers-k8s:main Feb 6, 2023

		condMsgCurrentlyDeleting = "cluster currently being deleted"
		condMsgNoDeleteWhileUpdating = "cluster is being updated. cannot delete"

Modify tags related functions and some test files to fix flaky assertion errors #51

Modify tags related functions and some test files to fix flaky assertion errors #51

Uh oh!

Conversation

kyriechen96 commented Jan 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ack-bot commented Jan 23, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaypipes commented Jan 26, 2023

Uh oh!

kyriechen96 commented Jan 27, 2023

Uh oh!

kyriechen96 commented Jan 31, 2023

Uh oh!

ack-bot commented Jan 31, 2023

Uh oh!

RedbackThomson commented Feb 2, 2023

Uh oh!

a-hilaly left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RedbackThomson left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

a-hilaly left a comment

Choose a reason for hiding this comment

Uh oh!

ack-prow bot commented Feb 6, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kyriechen96 commented Jan 23, 2023 •

edited

Loading