Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LeaseLostException still bubbled up #3379

Closed
levimatheri opened this issue Aug 9, 2022 · 3 comments · Fixed by #3401
Closed

LeaseLostException still bubbled up #3379

levimatheri opened this issue Aug 9, 2022 · 3 comments · Fixed by #3401

Comments

@levimatheri
Copy link
Contributor

levimatheri commented Aug 9, 2022

We are continuously addressing and improving the SDK, if possible, make sure the problem persist in the latest SDK version.

Describe the bug
We have implemented ChangeFeedProcessor with multiple instances. Since we don't have enough data yet to distribute the load across the instances, we are occasionally getting LeaseLostExceptions. Error message just says Lease was lost. From my understanding and based on this PR, this should be caught and handled by the SDK and not bubbled up to user code.

To Reproduce
Run change feed processor with multiple instances.

Expected behavior
LeaseLostException should not be bubbled up to user code

Actual behavior
LeaseLostException is bubbled up to user code

Environment summary
SDK Version: 3.29.0
OS Version (e.g. Windows, Linux, MacOSX): Windows Server 2019

Additional context
Here's the stack trace:

Microsoft.Azure.Cosmos.ChangeFeed.Exceptions.LeaseLostException:
   at Microsoft.Azure.Cosmos.ChangeFeed.LeaseManagement.DocumentServiceLeaseManagerCosmos+<>c__DisplayClass8_0.<AcquireAsync>b__0 (Microsoft.Azure.Cosmos.Client, Version=3.29.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)
   at Microsoft.Azure.Cosmos.ChangeFeed.LeaseManagement.DocumentServiceLeaseUpdaterCosmos+<UpdateLeaseAsync>d__3.MoveNext (Microsoft.Azure.Cosmos.Client, Version=3.29.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.Cosmos.ChangeFeed.LeaseManagement.DocumentServiceLeaseManagerCosmos+<AcquireAsync>d__8.MoveNext (Microsoft.Azure.Cosmos.Client, Version=3.29.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Azure.Cosmos.ChangeFeed.FeedManagement.PartitionControllerCore+<AddOrUpdateLeaseAsync>d__9.MoveNext (Microsoft.Azure.Cosmos.Client, Version=3.29.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35)

cc. @ealsur

@ealsur
Copy link
Member

ealsur commented Aug 15, 2022

There are still scenarios where LeaseLost will get reported that are related to this stack trace and not covered by the linked PR:

  1. If the lease is not found (the document was deleted, which would indicate an invalid state): https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/LeaseManagement/DocumentServiceLeaseUpdaterCosmos.cs#L66-L70
  2. If there are constant conflicts beyond the retry scope (basically getting 412, which would indicate a very high concurrency of workers) https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/LeaseManagement/DocumentServiceLeaseUpdaterCosmos.cs#L35
  3. When a worker processed a batch and was about to checkpoint but the lease was stolen (https://github.com/Azure/azure-cosmos-dotnet-v3/blob/master/Microsoft.Azure.Cosmos/src/ChangeFeedProcessor/LeaseManagement/DocumentServiceLeaseManagerCosmos.cs#L92) which would indicate that a batch will probably be reprocessed by another instance.

In those cases it is relevant to report this as an error, and should not be hidden. But I agree that the type of error should be more clear in what it means when looking at error logs.

I'll keep this Issue open to track the improvement, thanks for reporting!

@ealsur
Copy link
Member

ealsur commented Aug 22, 2022

@levimatheri The linked PR should take care of these notifications, as I mentioned, we don't want to lose notifying these potential cases but the exception should not be a LeaseLost, but rather a CosmosException (our public type) with the related status code

@levimatheri
Copy link
Contributor Author

Awesome, thanks @ealsur

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
2 participants