-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Query: Fixes SplitHandling bug caused by caches not getting refreshed #2004
Conversation
Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.Tests/Pagination/InMemoryContainer.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add UTs that verify the new APIs are called with the expected flags on a split
try | ||
{ | ||
// We can refresh the cache by just getting all the ranges for this container using the force refresh flag | ||
_ = await this.cosmosQueryClient.TryGetOverlappingRangesAsync( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is in the CosmosQueryClient API, can we add a UT that tests this method is called with forceRefresh: true as a Mock.Verify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No amount of unit testing is going to solve this scenario. I did a manual integration test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. You are adding an API that has the expectation to call the cosmosQueryClient.TryGetOverlappingRangesAsync
with the forceRefresh flag as true. You can cover this with a UT that asserts the behavior and makes sure that expectation doesn't regress. That is the point of a unit test. And there is nothing blocking you from adding this to secure the behavior, all involved types are mockable.
So if it can be done, and there is value in securing the behavior, why are we valuing engineering time over code quality/coverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not what I said. The mindset that we can just keep adding unit tests to secure this code path is wrong. The fact that we are making this fix is evident of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that if when the code was first introduced, we would have exercised the validation of the assumption that cache refreshes should happen on a split, then we probably have found the bug/missing path earlier because we would've seen that no cache refreshes were happening.
The goal of UTs is not to add them after the fact, but to set the expectations first and see if the scenario works.
This bug either means that we never set the expectations or we didn't want to cover them.
Hence my ask of, can we now add a UT to validate the expectation to avoid a future regression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ealsur The problem with unit tests is that you have know to add them during the testing phase. The problem with this code is that is has independently moving components with soft contracts. One will never know of all the cases to cover. The standard approach is that you first add an integration test to discover what soft contracts need to be tested for and then you go back to add unit tests, since they run faster and are easier to debug.
@rmandvikar unit tests are not the way to go for this code path. Mocking out a bunch of assumptions for soft contracts is fragile and they are bound to break as the system evolves. Maintaining them is basically a lifestyle. If I had a mock / unit test I would have to update it for this PR:
Since the soft contract we are testing for is "only refresh when needed" instead of "refresh after every split". If an independent developer made the optimization and failed my random unit test they would have to sit there and go through git history to figure out the what the original soft contract was and if it's okay for them to update the test to reflect the new soft contract.
Here is the PR for the proper way of stopping regressions:
It simulates the caching behavior that prod sees and asserts the soft contract without the need for manual unit tests that will break in the future.
Again an integration test would catch all these situations plus other soft contract violations we have yet to up with.
Microsoft.Azure.Cosmos/src/Pagination/NetworkAttachedDocumentContainer.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Closing due to in-activity, pease feel free to re-open. |
SplitHandling : Adds refresh cache fix.
Description
Pagination lib was calling into
GetTargetPartitionKeyRangeByFeedRangeAsync
which does not refresh the partition topology cache by default, nor does it offer a way for the user to say that they want to refresh the cache. This ends up having the client go into a loop whenever it encounters a split. The client will target partition A and if that splits into B and C instead of retrying on B and C it will retry on A until the cache is invalidated. The cache by default is invalidated in the background on a timer (something like every 10 or 20 minutes).Instead of having a boolean flag (that defaults to false) we decided to expose a
RefreshProvider
method toIFeedRangeProvider
to make it explicit that the cache needs to be refreshed.Right now this PR has no way of being tested, since we don't have access to the service fabric emulator with partition splits. The InMemoryContainer can only validate the InMemory code path, but misses this case where the cache was not being invalidated.
For now we have manually validated this with the following: