Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query: Fixes SplitHandling bug caused by caches not getting refreshed #2004

Merged
merged 5 commits into from
Nov 17, 2020

Conversation

bchong95
Copy link
Contributor

@bchong95 bchong95 commented Nov 16, 2020

SplitHandling : Adds refresh cache fix.

Description

Pagination lib was calling into GetTargetPartitionKeyRangeByFeedRangeAsync which does not refresh the partition topology cache by default, nor does it offer a way for the user to say that they want to refresh the cache. This ends up having the client go into a loop whenever it encounters a split. The client will target partition A and if that splits into B and C instead of retrying on B and C it will retry on A until the cache is invalidated. The cache by default is invalidated in the background on a timer (something like every 10 or 20 minutes).

Instead of having a boolean flag (that defaults to false) we decided to expose a RefreshProvider method to IFeedRangeProvider to make it explicit that the cache needs to be refreshed.

Right now this PR has no way of being tested, since we don't have access to the service fabric emulator with partition splits. The InMemoryContainer can only validate the InMemory code path, but misses this case where the cache was not being invalidated.

For now we have manually validated this with the following:

[TestClass]
public class SplitTests
{
	[TestMethod]
	public async Task TestSplitAsync()
	{
		string endpoint = "REDACTED";
		string authKey = "REDACTED";
		string databaseId = $"Database{Guid.NewGuid()}";
		string containerId = $"Container{Guid.NewGuid()}";
		using (CosmosClient client = new CosmosClient(endpoint, authKey))
		{
			int throughput = 6000;
			Database database = await client.CreateDatabaseIfNotExistsAsync(databaseId);
			Container container = await database.CreateContainerAsync(
				new ContainerProperties(id: containerId, partitionKeyPath: "/id"),
				ThroughputProperties.CreateManualThroughput(throughput: throughput));
			for (int i = 0; i < 100; i++)
			{
				string content = CosmosObject.Create(new Dictionary<string, CosmosElement>()
			{
				{ "id", CosmosString.Create(i.ToString()) }
			}).ToString();

				await container.CreateItemStreamAsync(new MemoryStream(Encoding.UTF8.GetBytes(content)), new Cosmos.PartitionKey(i.ToString()));
			}

			FeedIterator feedIterator = container.GetItemQueryStreamIterator(
				queryText: "SELECT * FROM c", requestOptions: new QueryRequestOptions() { MaxItemCount = 1 });

			Console.WriteLine($"{DateTime.Now}: Starting a query");
			ResponseMessage responseMessage = await feedIterator.ReadNextAsync(default);
			responseMessage.EnsureSuccessStatusCode();
			Console.WriteLine($"{DateTime.Now}: Finished a query");

			throughput *= 10;
			Console.WriteLine($"{DateTime.Now}: Adjusting throughput to: {throughput}");
			await container.ReplaceThroughputAsync(throughput: throughput);

			await Task.Delay(1000 * 65);

			Console.WriteLine($"{DateTime.Now}: Starting a query");
			ResponseMessage responseMessage2 = await feedIterator.ReadNextAsync(default);
			responseMessage2.EnsureSuccessStatusCode();
			Console.WriteLine($"{DateTime.Now}: Finished a query");

			await database.DeleteAsync();
		}
	}
}

Copy link
Member

@ealsur ealsur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add UTs that verify the new APIs are called with the expected flags on a split

try
{
// We can refresh the cache by just getting all the ranges for this container using the force refresh flag
_ = await this.cosmosQueryClient.TryGetOverlappingRangesAsync(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is in the CosmosQueryClient API, can we add a UT that tests this method is called with forceRefresh: true as a Mock.Verify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No amount of unit testing is going to solve this scenario. I did a manual integration test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. You are adding an API that has the expectation to call the cosmosQueryClient.TryGetOverlappingRangesAsync with the forceRefresh flag as true. You can cover this with a UT that asserts the behavior and makes sure that expectation doesn't regress. That is the point of a unit test. And there is nothing blocking you from adding this to secure the behavior, all involved types are mockable.

So if it can be done, and there is value in securing the behavior, why are we valuing engineering time over code quality/coverage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what I said. The mindset that we can just keep adding unit tests to secure this code path is wrong. The fact that we are making this fix is evident of that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that if when the code was first introduced, we would have exercised the validation of the assumption that cache refreshes should happen on a split, then we probably have found the bug/missing path earlier because we would've seen that no cache refreshes were happening.
The goal of UTs is not to add them after the fact, but to set the expectations first and see if the scenario works.

This bug either means that we never set the expectations or we didn't want to cover them.

Hence my ask of, can we now add a UT to validate the expectation to avoid a future regression?

Copy link

@rmandvikar rmandvikar Nov 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bchong95 $2c. i see that this PR is merged in now. i definitely agree with @ealsur about a UT that asserts the behavior to prevent a regression in future (especially since this caused a PRD issue on our side).

Copy link
Contributor Author

@bchong95 bchong95 Nov 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ealsur The problem with unit tests is that you have know to add them during the testing phase. The problem with this code is that is has independently moving components with soft contracts. One will never know of all the cases to cover. The standard approach is that you first add an integration test to discover what soft contracts need to be tested for and then you go back to add unit tests, since they run faster and are easier to debug.

@rmandvikar unit tests are not the way to go for this code path. Mocking out a bunch of assumptions for soft contracts is fragile and they are bound to break as the system evolves. Maintaining them is basically a lifestyle. If I had a mock / unit test I would have to update it for this PR:

#2010

Since the soft contract we are testing for is "only refresh when needed" instead of "refresh after every split". If an independent developer made the optimization and failed my random unit test they would have to sit there and go through git history to figure out the what the original soft contract was and if it's okay for them to update the test to reflect the new soft contract.

Here is the PR for the proper way of stopping regressions:

#2026

It simulates the caching behavior that prod sees and asserts the soft contract without the need for manual unit tests that will break in the future.

Again an integration test would catch all these situations plus other soft contract violations we have yet to up with.

Copy link
Contributor

@sboshra sboshra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@sboshra sboshra merged commit fce7045 into master Nov 17, 2020
@sboshra sboshra deleted the users/brchon/ForceRefreshPartitionToplogy branch November 17, 2020 03:24
@j82w j82w changed the title SplitHandling : Adds refresh cache fix. Query: Fixes SplitHandling bug caused by caches not getting refreshed Nov 17, 2020
@ghost
Copy link

ghost commented Dec 15, 2021

Closing due to in-activity, pease feel free to re-open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants