Refactor entity cleanup test to make it less flakey #2871

davidmrdavid · 2024-07-08T21:15:28Z

The DurableEntity_CleanEntityStorage test has been extremely flakey ever since our migration to GitHub actions.
The test works broadly as follows.
It sets up 2 orchestrators: "A" and "B".
(1) Orchestrator "A" takes a lock on Entity "E", and completes without releasing that lock.
(2) Orchestrator "B" tries to take a lock on "E", and gets stuck until "E"'s current locking by A is forcibly broken.
(3) The test sends a "release orphaned locks" request to "E" so that "B" may obtain the lock and compete.
(4) The test waits for "B" to complete, then performs some assertions.

The test would often fail on step (4), because "B" would fail to receive the lock. The reason for this is that step (3) had a race condition. When sending a "release orphaned locks" external event to "E", that external event is sent with the entity's executionID, which may change if the entity is mid-processing (after every processed batch, and entity calls continue as new, changing it's executionID). Therefore, in many cases entity "E" would receive a "release orphaned locks" event with an older executionID, and therefore discard that event, keeping it itself locked, and leading the test to failure.

To put it another way, with concrete values, this is the race condition that caused the test to fail:
(0) Entity currently has instanceID "123" with executionID "A"
(1) Client sends "fix orphaned locks" message to entity instanceId "123" with executionID "A"
(2) Entity processes receives some request (not the fix orphaned locks message) and processes it. Therefore, it calls continue-as-new and changes it executionID to "B".
(3) Entity receives "Fix orphaned locks" message but it's for the wrong executionID. Therefore, it discards that message.
(4) Entity remains locked, the test fails

This PR makes the test more resilient to this by waiting some time before sending the "release orphaned locks" request. This will ensure that the entity's executionID is stable (that it won't change) by the time we send the request, so that the test may succeed.

davidmrdavid · 2024-07-08T23:40:29Z

test/Common/HttpApiHandlerTests.cs

- Assert.True(stopwatch.Elapsed < TimeSpan.FromSeconds(10));
+ Assert.True(stopwatch.Elapsed < TimeSpan.FromSeconds(15));


GitHub actions sometimes run slightly slower, so needed this extra buffer to pass the test

davidmrdavid · 2024-07-08T23:41:19Z

test/Common/DurableTaskEndToEndTests.cs

- // check that the empty entity record has been removed from storage
- result = await client.InnerClient.ListEntitiesAsync(query, CancellationToken.None);
- Assert.DoesNotContain(result.Entities, s => s.EntityId.Equals(emptyEntityId));
-


this was moved to earlier in the test, lines ~5095 to 5097

davidmrdavid · 2024-07-08T23:42:37Z

test/Common/DurableTaskEndToEndTests.cs

+ // remove release orphaned lock to unblock orchestration B
+ // Note: do NOT remove empty entities yet: we want to keep the empty entity so it can unblock orchestration B
+ response = await client.InnerClient.CleanEntityStorageAsync(removeEmptyEntities: false, releaseOrphanedLocks: true, CancellationToken.None);
+ Assert.Equal(0, response.NumberOfEmptyEntitiesRemoved);


this is the main thing I changed. Previously, we were deleting both empty entities and orphaned locks. We only want to remove orphaned locks because, if we remove empty entities, we could remove the entity that we need to unlock orchestration B, possibly causing this test to hang (as orchestration B will never complete)

davidmrdavid · 2024-07-08T23:43:49Z

test/Common/DurableTaskEndToEndTests.cs

+ var response = await client.InnerClient.CleanEntityStorageAsync(removeEmptyEntities: true, releaseOrphanedLocks: false, CancellationToken.None);
+ Assert.Equal(1, response.NumberOfEmptyEntitiesRemoved);
+ Assert.Equal(0, response.NumberOfOrphanedLocksRemoved);
+
+ // check that the empty entity record has been removed from storage
+ result = await client.InnerClient.ListEntitiesAsync(query, CancellationToken.None);
+ Assert.DoesNotContain(result.Entities, s => s.EntityId.Equals(emptyEntityId));


this used to occur later in the test, but there was no clear reason to do that. So I've moved this earlier, which separates the tests into 2 distinct parts: (1) testing the deletion or empty entities, and (2) testing the releasing of orphaned locks. This should make the test easier to read, I hope

bachuv

Left one suggestion. Thanks for making this change!

bachuv · 2024-07-09T16:51:19Z

test/Common/DurableTaskEndToEndTests.cs

+ // remove release orphaned lock to unblock orchestration B
+ // Note: do NOT remove empty entities yet: we want to keep the empty entity so it can unblock orchestration B
+ response = await client.InnerClient.CleanEntityStorageAsync(removeEmptyEntities: false, releaseOrphanedLocks: true, CancellationToken.None);
+ Assert.Equal(0, response.NumberOfEmptyEntitiesRemoved);


Can we move this line to after Assert.Equal(1, response.NumberOfOrphanedLocksRemoved); just for readability? It helps when comparing this to the lines later in this test (5121, 5122)

Assert.Equal(0, response.NumberOfOrphanedLocksRemoved); Assert.Equal(1, response.NumberOfEmptyEntitiesRemoved);

Pretty much just think that we should check NumberOfOrphanedLocksRemoved and NumberOfEmptyEntitiesRemoved in the same order throughout this test.

Incorporated! 9166608

unflake test

f4692f7

davidmrdavid marked this pull request as ready for review July 8, 2024 21:15

davidmrdavid added 3 commits July 8, 2024 15:22

reorder steps

b5b4720

add delay

e7e833f

make test slightly longer

b6c17b8

davidmrdavid commented Jul 8, 2024

View reviewed changes

davidmrdavid changed the title ~~[WIP] Make entity cleanup test not flaky~~ Make entity cleanup test not flaky Jul 9, 2024

davidmrdavid changed the title ~~Make entity cleanup test not flaky~~ Refactor entity cleanup test to make it less flakey Jul 9, 2024

bachuv approved these changes Jul 9, 2024

View reviewed changes

incorporate feedback

9166608

davidmrdavid merged commit 08c385b into dev Jul 9, 2024
4 checks passed

davidmrdavid deleted the dajusto/fix-entity-deletion-test branch July 9, 2024 17:53

bachuv pushed a commit that referenced this pull request Jul 9, 2024

Refactor entity cleanup test to make it less flakey (#2871)

0647368

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor entity cleanup test to make it less flakey #2871

Refactor entity cleanup test to make it less flakey #2871

davidmrdavid commented Jul 8, 2024 •

edited

Loading

davidmrdavid Jul 8, 2024

davidmrdavid Jul 8, 2024

davidmrdavid Jul 8, 2024

davidmrdavid Jul 8, 2024

bachuv left a comment

bachuv Jul 9, 2024

davidmrdavid Jul 9, 2024

		Assert.True(stopwatch.Elapsed < TimeSpan.FromSeconds(10));
		Assert.True(stopwatch.Elapsed < TimeSpan.FromSeconds(15));

Refactor entity cleanup test to make it less flakey #2871

Refactor entity cleanup test to make it less flakey #2871

Conversation

davidmrdavid commented Jul 8, 2024 • edited Loading

davidmrdavid Jul 8, 2024

Choose a reason for hiding this comment

davidmrdavid Jul 8, 2024

Choose a reason for hiding this comment

davidmrdavid Jul 8, 2024

Choose a reason for hiding this comment

davidmrdavid Jul 8, 2024

Choose a reason for hiding this comment

bachuv left a comment

Choose a reason for hiding this comment

bachuv Jul 9, 2024

Choose a reason for hiding this comment

davidmrdavid Jul 9, 2024

Choose a reason for hiding this comment

davidmrdavid commented Jul 8, 2024 •

edited

Loading