Query: Fixes SplitHandling bug caused by caches not getting refreshed #2004

bchong95 · 2020-11-16T08:40:16Z

SplitHandling : Adds refresh cache fix.

Description

Pagination lib was calling into GetTargetPartitionKeyRangeByFeedRangeAsync which does not refresh the partition topology cache by default, nor does it offer a way for the user to say that they want to refresh the cache. This ends up having the client go into a loop whenever it encounters a split. The client will target partition A and if that splits into B and C instead of retrying on B and C it will retry on A until the cache is invalidated. The cache by default is invalidated in the background on a timer (something like every 10 or 20 minutes).

Instead of having a boolean flag (that defaults to false) we decided to expose a RefreshProvider method to IFeedRangeProvider to make it explicit that the cache needs to be refreshed.

Right now this PR has no way of being tested, since we don't have access to the service fabric emulator with partition splits. The InMemoryContainer can only validate the InMemory code path, but misses this case where the cache was not being invalidated.

For now we have manually validated this with the following:

[TestClass]
public class SplitTests
{
	[TestMethod]
	public async Task TestSplitAsync()
	{
		string endpoint = "REDACTED";
		string authKey = "REDACTED";
		string databaseId = $"Database{Guid.NewGuid()}";
		string containerId = $"Container{Guid.NewGuid()}";
		using (CosmosClient client = new CosmosClient(endpoint, authKey))
		{
			int throughput = 6000;
			Database database = await client.CreateDatabaseIfNotExistsAsync(databaseId);
			Container container = await database.CreateContainerAsync(
				new ContainerProperties(id: containerId, partitionKeyPath: "/id"),
				ThroughputProperties.CreateManualThroughput(throughput: throughput));
			for (int i = 0; i < 100; i++)
			{
				string content = CosmosObject.Create(new Dictionary<string, CosmosElement>()
			{
				{ "id", CosmosString.Create(i.ToString()) }
			}).ToString();

				await container.CreateItemStreamAsync(new MemoryStream(Encoding.UTF8.GetBytes(content)), new Cosmos.PartitionKey(i.ToString()));
			}

			FeedIterator feedIterator = container.GetItemQueryStreamIterator(
				queryText: "SELECT * FROM c", requestOptions: new QueryRequestOptions() { MaxItemCount = 1 });

			Console.WriteLine($"{DateTime.Now}: Starting a query");
			ResponseMessage responseMessage = await feedIterator.ReadNextAsync(default);
			responseMessage.EnsureSuccessStatusCode();
			Console.WriteLine($"{DateTime.Now}: Finished a query");

			throughput *= 10;
			Console.WriteLine($"{DateTime.Now}: Adjusting throughput to: {throughput}");
			await container.ReplaceThroughputAsync(throughput: throughput);

			await Task.Delay(1000 * 65);

			Console.WriteLine($"{DateTime.Now}: Starting a query");
			ResponseMessage responseMessage2 = await feedIterator.ReadNextAsync(default);
			responseMessage2.EnsureSuccessStatusCode();
			Console.WriteLine($"{DateTime.Now}: Finished a query");

			await database.DeleteAsync();
		}
	}
}

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Pagination/InMemoryContainer.cs

ealsur

Please add UTs that verify the new APIs are called with the expected flags on a split

ealsur · 2020-11-16T16:30:05Z

Microsoft.Azure.Cosmos/src/Pagination/NetworkAttachedDocumentContainer.cs

+            try
+            {
+                // We can refresh the cache by just getting all the ranges for this container using the force refresh flag
+                _ = await this.cosmosQueryClient.TryGetOverlappingRangesAsync(


Since this is in the CosmosQueryClient API, can we add a UT that tests this method is called with forceRefresh: true as a Mock.Verify?

No amount of unit testing is going to solve this scenario. I did a manual integration test.

I disagree. You are adding an API that has the expectation to call the cosmosQueryClient.TryGetOverlappingRangesAsync with the forceRefresh flag as true. You can cover this with a UT that asserts the behavior and makes sure that expectation doesn't regress. That is the point of a unit test. And there is nothing blocking you from adding this to secure the behavior, all involved types are mockable.

So if it can be done, and there is value in securing the behavior, why are we valuing engineering time over code quality/coverage?

That's not what I said. The mindset that we can just keep adding unit tests to secure this code path is wrong. The fact that we are making this fix is evident of that.

I'd argue that if when the code was first introduced, we would have exercised the validation of the assumption that cache refreshes should happen on a split, then we probably have found the bug/missing path earlier because we would've seen that no cache refreshes were happening.
The goal of UTs is not to add them after the fact, but to set the expectations first and see if the scenario works.

This bug either means that we never set the expectations or we didn't want to cover them.

Hence my ask of, can we now add a UT to validate the expectation to avoid a future regression?

@bchong95 $2c. i see that this PR is merged in now. i definitely agree with @ealsur about a UT that asserts the behavior to prevent a regression in future (especially since this caused a PRD issue on our side).

@ealsur The problem with unit tests is that you have know to add them during the testing phase. The problem with this code is that is has independently moving components with soft contracts. One will never know of all the cases to cover. The standard approach is that you first add an integration test to discover what soft contracts need to be tested for and then you go back to add unit tests, since they run faster and are easier to debug.

@rmandvikar unit tests are not the way to go for this code path. Mocking out a bunch of assumptions for soft contracts is fragile and they are bound to break as the system evolves. Maintaining them is basically a lifestyle. If I had a mock / unit test I would have to update it for this PR:

#2010

Since the soft contract we are testing for is "only refresh when needed" instead of "refresh after every split". If an independent developer made the optimization and failed my random unit test they would have to sit there and go through git history to figure out the what the original soft contract was and if it's okay for them to update the test to reflect the new soft contract.

Here is the PR for the proper way of stopping regressions:

#2026

It simulates the caching behavior that prod sees and asserts the soft contract without the need for manual unit tests that will break in the future.

Again an integration test would catch all these situations plus other soft contract violations we have yet to up with.

Microsoft.Azure.Cosmos/src/Pagination/NetworkAttachedDocumentContainer.cs

sboshra

ghost · 2021-12-15T00:03:08Z

Closing due to in-activity, pease feel free to re-open.

wired through refresh cache code

224a3a6

bchong95 requested review from ealsur, FabianMeiswinkel, j82w, khdang, kirankumarkolli, kirillg and sboshra as code owners November 16, 2020 08:40

j82w reviewed Nov 16, 2020

View reviewed changes

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Pagination/InMemoryContainer.cs Show resolved Hide resolved

ealsur reviewed Nov 16, 2020

View reviewed changes

added assert validations

7301319

j82w reviewed Nov 16, 2020

View reviewed changes

Microsoft.Azure.Cosmos/src/Pagination/NetworkAttachedDocumentContainer.cs Outdated Show resolved Hide resolved

bchong95 added 3 commits November 16, 2020 13:11

not refreshing cache unnecessarily

832c9d0

Merge branch 'master' into users/brchon/ForceRefreshPartitionToplogy

02bbf3f

Merge branch 'master' into users/brchon/ForceRefreshPartitionToplogy

71bb0e7

sboshra approved these changes Nov 17, 2020

View reviewed changes

sboshra merged commit fce7045 into master Nov 17, 2020

sboshra deleted the users/brchon/ForceRefreshPartitionToplogy branch November 17, 2020 03:24

j82w changed the title ~~SplitHandling : Adds refresh cache fix.~~ Query: Fixes SplitHandling bug caused by caches not getting refreshed Nov 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query: Fixes SplitHandling bug caused by caches not getting refreshed #2004

Query: Fixes SplitHandling bug caused by caches not getting refreshed #2004

bchong95 commented Nov 16, 2020 •

edited

Loading

ealsur left a comment

ealsur Nov 16, 2020

bchong95 Nov 16, 2020

ealsur Nov 16, 2020

bchong95 Nov 16, 2020

ealsur Nov 16, 2020

rmandvikar Nov 20, 2020 •

edited

Loading

bchong95 Nov 21, 2020 •

edited

Loading

sboshra left a comment

ghost commented Dec 15, 2021

Query: Fixes SplitHandling bug caused by caches not getting refreshed #2004

Query: Fixes SplitHandling bug caused by caches not getting refreshed #2004

Conversation

bchong95 commented Nov 16, 2020 • edited Loading

SplitHandling : Adds refresh cache fix.

Description

ealsur left a comment

Choose a reason for hiding this comment

ealsur Nov 16, 2020

Choose a reason for hiding this comment

bchong95 Nov 16, 2020

Choose a reason for hiding this comment

ealsur Nov 16, 2020

Choose a reason for hiding this comment

bchong95 Nov 16, 2020

Choose a reason for hiding this comment

ealsur Nov 16, 2020

Choose a reason for hiding this comment

rmandvikar Nov 20, 2020 • edited Loading

Choose a reason for hiding this comment

bchong95 Nov 21, 2020 • edited Loading

Choose a reason for hiding this comment

sboshra left a comment

Choose a reason for hiding this comment

ghost commented Dec 15, 2021

bchong95 commented Nov 16, 2020 •

edited

Loading

rmandvikar Nov 20, 2020 •

edited

Loading

bchong95 Nov 21, 2020 •

edited

Loading