Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify tags related functions and some test files to fix flaky assertion errors #51

Merged
merged 1 commit into from
Feb 6, 2023

Conversation

kyriechen96
Copy link
Contributor

@kyriechen96 kyriechen96 commented Jan 23, 2023

Issue #, if available:
MemoryDB Controller has some flaky tests that randomly show assertion error.
Root causes:

  1. Snapshot test files didn't clean resources which were created from previous failed tests, so eventually test account reached node limit.
  2. Delta compare considers order matters, so it considers two tag arrays are different even though they have same elements with different orders. Function of updating tags didn't include this corner case, and it caused infinite update.
  3. When creating resources, some resources take a while to finish creation. Function of getting tags got error because it couldn't find resource, and this error covered previous ResourceNotFound error. Hence, reconciler didn't call requeue, because it didn't recognize the new error.
  4. Cluster update validation test file has a step to update and validate Snapshot Window (daily time range to auto snapshot). If this step is executing within the time range of new Snapshot Window, update would succeed (ResourceSynced returns true) because two resources match, but cluster starts snapshotting after it's Snapshot Window is updated to the current time. Hence, for next step to update other fields of cluster, new fields couldn't be updated on time, because the status of cluster is snapshotting and cluster cannot be updated.

Description of changes:

  1. Fix updateTags function and it covers situation of equal tag arrays. (It still wastes time to call sdkUpdate over and over. It is better to modify delta comparison of tags in code-generator. New PR is created (Comparing tags using common acktags.Tag structs community#1658).
  2. Change updateTags functions from parameter group and subnet group to match the updateTags functions in other resources.
  3. Add a status check for getTags. It is only called when status of resource (cluster, snapshot, acl, user) is active.
  4. Increase wait time for the next update step after Snapshot Window update. The new wait time is enough to cover snapshotting process.
  5. Use same method to create cluster for snapshot for snapshot_validate_tags.yaml test file.
  6. Correct user names and delete unused resources in test files.
  7. Modify some test IDs, step IDs, and descriptions which are not appropriate that makes debugging really hard.
  8. Fix import order of hook files.
  9. Modify equalStrings function.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@ack-bot
Copy link
Collaborator

ack-bot commented Jan 23, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kyriechen96

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines 35 to 37
aclName: open-access
numShards: 1
numReplicasPerShard: 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be added whole idea is test should run once you fix the spec that has issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the root reason is that test account is full now. Deleting some clusters in account.

Comment on lines 16 to 18
nodeType: cache.t4g.small
aclName: open-access
numShards: 2
numShards: 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think these changes are needed. cache is not valid node type for memorydb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I agree with you, but this causes node limit exceed. It doesn't have this issue before, but this happens recently. It may cause by some code changes on service side. The modification makes test pass now, so we need to investigate the root cause.

@@ -58,12 +58,12 @@ steps:
userNames:
- userone$RANDOM_SUFFIX
- usertwo$RANDOM_SUFFIX
wait:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait:
      status:
        conditions:
          ACK.ResourceSynced:
            status: "True"
            timeout: 180

Why would hard coding the value give advantage over status change wait? I prefer the previous approach as it helps us catch any issue with status change flipping back and forth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous approach causes random assertion error. My assumption is there is something wrong with previous approach, and I modified to the current approach to check whether it works. And I think we need to investigate the root cause.

@kyriechen96 kyriechen96 changed the title Fixes flaky tests Modify method of wait for ACL/Snapshot creation and increase wait time for some steps in e2e test files to fix random assertion errors during e2e test Jan 25, 2023
@kyriechen96 kyriechen96 changed the title Modify method of wait for ACL/Snapshot creation and increase wait time for some steps in e2e test files to fix random assertion errors during e2e test Modify wait time for e2e tests to fix flaky assertion errors Jan 25, 2023
@kyriechen96 kyriechen96 force-pushed the fix_flaky_tests branch 3 times, most recently from 02c45db to c5c60a0 Compare January 26, 2023 16:20
@jaypipes
Copy link
Contributor

@kyriechen96 Please see the error from the tests:

AssertionError: Condition status mismatch. Expected condition: ACK.ResourceSynced - {'status': 'True'} but found {'lastTransitionTime': '2023-01-26T17:22:26Z', 'message': 'Unable to determine if desired resource state matches latest observed state', 'reason': 'ACLNotFoundFault: acl-jl1ptly9i46hs93 is either not present or not available.', 'status': 'Unknown', 'type': 'ACK.ResourceSynced'}

The issue isn't timeouts. The issue seems to be a missing ACL dependency in one of the tests I think?

@kyriechen96
Copy link
Contributor Author

@kyriechen96 Please see the error from the tests:

AssertionError: Condition status mismatch. Expected condition: ACK.ResourceSynced - {'status': 'True'} but found {'lastTransitionTime': '2023-01-26T17:22:26Z', 'message': 'Unable to determine if desired resource state matches latest observed state', 'reason': 'ACLNotFoundFault: acl-jl1ptly9i46hs93 is either not present or not available.', 'status': 'Unknown', 'type': 'ACK.ResourceSynced'}

The issue isn't timeouts. The issue seems to be a missing ACL dependency in one of the tests I think?

We had a discussion of it. This is happened during condition check for ACL creation. I'm fixing it now.

@kyriechen96 kyriechen96 changed the title Modify wait time for e2e tests to fix flaky assertion errors Modify tags updating and some test files to fix flaky assertion errors Jan 31, 2023
@kyriechen96
Copy link
Contributor Author

/retest

@ack-bot
Copy link
Collaborator

ack-bot commented Jan 31, 2023

@kyriechen96: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
memorydb-release-test 6a8b7ed link /test memorydb-release-test
memorydb-kind-e2e 6a8b7ed link /test memorydb-kind-e2e

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kyriechen96 kyriechen96 force-pushed the fix_flaky_tests branch 6 times, most recently from bba90b9 to bc6af83 Compare February 1, 2023 22:09
@kyriechen96 kyriechen96 changed the title Modify tags updating and some test files to fix flaky assertion errors Modify tags related functions and some test files to fix flaky assertion errors Feb 1, 2023
@kyriechen96 kyriechen96 force-pushed the fix_flaky_tests branch 4 times, most recently from 068e1d0 to 79629a0 Compare February 2, 2023 20:08
@RedbackThomson
Copy link
Contributor

I'm not super familiar with the memorydb snapshotting process. Would you mind explaining this a little more:

If this step is executing within the time range of new Snapshot Window, update would succeed (ResourceSynced returns true) because two resources match, but cluster starts snapshotting after it's Snapshot Window is updated to the current time. Hence, for next step to update other fields of cluster, new fields couldn't be updated on time, because the status of cluster is snapshotting and cluster cannot be updated

Copy link
Member

@a-hilaly a-hilaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great stuff Kyrie!

Comment on lines +363 to +365
if hasSameKey {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you wanna ignore duplicated keys +++

Comment on lines +33 to +34
condMsgCurrentlyDeleting = "cluster currently being deleted"
condMsgNoDeleteWhileUpdating = "cluster is being updated. cannot delete"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make messages a bit consistent? cluster currently being updated

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two variables are not related to flaky tests. I didn't modify or create them in this PR, so I will mark it and change it in a new PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes please. The english grammar for these two messages should be improved. cluster is currently being deleted and cluster is currently being updated, cannot delete are my suggestions

@@ -15,6 +15,7 @@ package parameter_group

import (
"context"
ackutil "github.com/aws-controllers-k8s/runtime/pkg/util"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it was automatically added there. I will change the order manually.

Comment on lines 288 to 294
func equalStrings(a, b *string) bool {
if a == nil {
return b == nil || *b == ""
}
return (*a == "" && b == nil) || *a == *b
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not panic in the tags case, it could with other fields. Correct implementation here: aws-controllers-k8s/community#1654

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it for all resource hooks in the new PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to just copy the code from that issue into all of your equalStrings functions. It would be better if we address it now than leave it for another PR, where we might forget.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it for all hook files.

Comment on lines +25 to +30
svcsdk "github.com/aws/aws-sdk-go/service/memorydb"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"

svcapitypes "github.com/aws-controllers-k8s/memorydb-controller/apis/v1alpha1"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -15,6 +15,7 @@ package subnet_group

import (
"context"
ackutil "github.com/aws-controllers-k8s/runtime/pkg/util"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed import orders for all hooks.

Copy link
Contributor

@RedbackThomson RedbackThomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to go now! Running the soak tests will be the real determination of whether all of these tests are solid, but for now this looks good.

I've left one nit that's applicable to all of the <resource>Active hook methods.

@@ -55,6 +59,14 @@ func (rm *resourceManager) validateUserNeedsUpdate(
return nil, nil
}

// userActive returns true when the status of the given User is set to `active`
func (rm *resourceManager) userActive(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: methods returning booleans should start with an article. For example isUserActive

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyriechen96 wants to address this in a seperate PR

Copy link
Member

@a-hilaly a-hilaly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job on this.
/lgtm

@ack-prow ack-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 6, 2023
@ack-prow
Copy link

ack-prow bot commented Feb 6, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: A-Hilaly, kyriechen96, nmvk, RedbackThomson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [A-Hilaly,RedbackThomson,kyriechen96,nmvk]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ack-prow ack-prow bot merged commit cab999c into aws-controllers-k8s:main Feb 6, 2023
kyriechen96 added a commit to kyriechen96/memorydb-controller that referenced this pull request Feb 6, 2023
…ion errors (aws-controllers-k8s#51)

Issue #, if available:
MemoryDB Controller has some flaky tests that randomly show assertion error.
Root causes:
1. Snapshot test files didn't clean resources which were created from previous failed tests, so eventually test account reached node limit.
2. Delta compare considers order matters, so it considers two tag arrays are different even though they have same elements with different orders. Function of updating tags didn't include this corner case, and it caused infinite update.
3. When creating resources, some resources take a while to finish creation. Function of getting tags got error because it couldn't find resource, and this error covered previous ResourceNotFound error. Hence, reconciler didn't call requeue, because it didn't recognize the new error.
4. Cluster update validation test file has a step to update and validate Snapshot Window (daily time range to auto snapshot). If this step is executing within the time range of new Snapshot Window, update would succeed (ResourceSynced returns true) because two resources match, but cluster starts snapshotting after it's Snapshot Window is updated to the current time. Hence, for next step to update other fields of cluster, new fields couldn't be updated on time, because the status of cluster is snapshotting and cluster cannot be updated.

Description of changes:
1. Fix updateTags function and it covers situation of equal tag arrays. (It still wastes time to call sdkUpdate over and over. It is better to modify delta comparison of tags in code-generator. New PR is created (aws-controllers-k8s/community#1658).
2. Change updateTags functions from parameter group and subnet group to match the updateTags functions in other resources.
3. Add a status check for getTags. It is only called when status of resource (cluster, snapshot, acl, user) is active.
4. Increase wait time for the next update step after Snapshot Window update. The new wait time is enough to cover snapshotting process.
5. Use same method to create cluster for snapshot for snapshot_validate_tags.yaml test file.
6. Correct user names and delete unused resources in test files.
7. Modify some test IDs, step IDs, and descriptions which are not appropriate that makes debugging really hard.
8. Fix import order of hook files.
9. Modify equalStrings function.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants