Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential deadlock related to CosmosDB write conflict retires #176

Open
chrnola opened this issue Aug 25, 2020 · 2 comments
Open

Potential deadlock related to CosmosDB write conflict retires #176

chrnola opened this issue Aug 25, 2020 · 2 comments
Labels
bug Something isn't working M-azure This issue is related to the Azure module M-core This issue is related to the core module P3 Priority 3

Comments

@chrnola
Copy link
Contributor

chrnola commented Aug 25, 2020

I have two recent examples of an application that was running @walmartlabs/cookie-cutter-core@1.3.0-beta.6 (in other words, inclusive of the patch introduced by #150) which seemingly stopped reading from its input sources and processing messages. This application runs in ParallelismMode.Concurrent and is configured as such:

import { EventSourced } from "@walmartlabs/cookie-cutter-azure";
import { cached } from "@walmartlabs/cookie-cutter-core";
import { RedisStreamSource } from "@walmartlabs/cookie-cutter-redis/dist/RedisStreamSource";

...

Application...
  .input()
    .add(new RedisStreamSource(redisStreamSourceConfig))
  .done()
    .state(cached(
      IdempotentState,
      EventSourced.cosmosState(
        cosmosConfig,
        IdempotentState,
        new IdempotentStateAggregator())))
  .dispatch(messageHandler)
  .output()
    .stored(EventSourced.cosmosSink(cosmosConfig))
  .done()
  .run(runtime);

In both cases, the application logs show sequence number conflict, retrying, and then the service processes a few more messages, but then stops consuming all new messages roughly 30-120 seconds later.

From the application logs:

{"time":"2020-08-21T20:50:14.424Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042988127-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["02bbd437-88c0-4e64-900a-354ddb1582d8"],"domainId":["2b8794839f01cb1e1bbebc73a0e62c42"],"domainSubId":["37"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:50:14.426Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042988364-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["bed0a32e-600d-45cf-a3f5-b1870091152c"],"domainId":["f5c2602fb47d37cc6d8daccb1de09cd8"],"domainSubId":["51"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:50:14.427Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042988534-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["8ae11440-1feb-4ab0-a475-05775435eff9"],"domainId":["59ce1fd5f5c5d22a25c08bba5cfd5ef7"],"domainSubId":["29"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:50:14.427Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042988776-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["12be727f-b501-4e09-9039-30b1aefcebb0"],"domainId":["37c802c39c78853d065720b74ad4b22e"],"domainSubId":["50"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:50:14.428Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042989216-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["23833bbc-a0c1-47fc-aa1e-5d48d4332df4"],"domainId":["596cb9e631d155e735b9a532dde615fa"],"domainSubId":["49"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:50:15.227Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042990686-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["7b19e38e-eb33-4d92-9f5a-6fd3aadbbc24"],"domainId":["117d6fc11884fe25eeebfa86401913f1"],"domainSubId":["35"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:50:55.627Z","level":"WARN","ctx":"my-service.1_0_557","fields":{"msg":["sequence number conflict, retrying"],"key":["4f93787c2b12edbe4ab05f90c50a8b0e"],"newSn":[52],"expectedSn":[51],"actualSn":[52]}}
{"time":"2020-08-21T20:51:03.015Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventY event to Cosmos"],"streamId":["1598042938430-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"domainId":["9c6208d8033d093bf1263b24a7417d7a"],"domainSubId":["46"],"domainOutcomeFlag":[false],"domainScore":[0]}}
{"time":"2020-08-21T20:51:03.017Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042938649-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["d225de3f-e6c9-41da-8b15-091d4d14e51d"],"domainId":["4f93787c2b12edbe4ab05f90c50a8b0e"],"domainSubId":["31"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:51:03.018Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042938651-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["32de314a-e9b2-4096-a248-e45251199acf"],"domainId":["4f93787c2b12edbe4ab05f90c50a8b0e"],"domainSubId":["31"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:51:13.474Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042939184-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["2dca514d-a83f-4e7d-8d82-13a19ef4b28a"],"domainId":["f5c2602fb47d37cc6d8daccb1de09cd8"],"domainSubId":["51"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:51:13.994Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042939369-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["be982b1b-6461-4ac2-8e38-92213cd829ad"],"domainId":["0d5b57c7f890fd995a74a12a7b7bcf32"],"domainSubId":["32"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:51:22.818Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042939571-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["131749a1-22ec-4c71-b020-e51c37b08c72"],"domainId":["9c6208d8033d093bf1263b24a7417d7a"],"domainSubId":["46"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:51:22.819Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventY event to Cosmos"],"streamId":["1598042940484-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"domainId":["2b8794839f01cb1e1bbebc73a0e62c42"],"domainSubId":["37"],"domainOutcomeFlag":[false],"domainScore":[0]}}
{"time":"2020-08-21T20:51:22.821Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042941209-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["9662bc19-043b-4432-93ac-ff85fb7489c6"],"domainId":["2b8794839f01cb1e1bbebc73a0e62c42"],"domainSubId":["37"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:51:27.216Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042941784-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["272e0a1d-76a3-4531-9192-8ed7456a34d3"],"domainId":["59ce1fd5f5c5d22a25c08bba5cfd5ef7"],"domainSubId":["29"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:51:27.218Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventY event to Cosmos"],"streamId":["1598042942010-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"domainId":["2b8794839f01cb1e1bbebc73a0e62c42"],"domainSubId":["37"],"domainOutcomeFlag":[false],"domainScore":[0]}}
{"time":"2020-08-21T20:51:31.240Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042942624-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["2ccf8355-ee57-41cf-920e-a3f3fe2fb4ab"],"domainId":["2b8794839f01cb1e1bbebc73a0e62c42"],"domainSubId":["37"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:51:31.920Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042945230-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["4bf31315-d149-429f-a2c6-c344939b9014"],"domainId":["f5c2602fb47d37cc6d8daccb1de09cd8"],"domainSubId":["51"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:51:32.015Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventY event to Cosmos"],"streamId":["1598042946178-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"domainId":["0d5b57c7f890fd995a74a12a7b7bcf32"],"domainSubId":["32"],"domainOutcomeFlag":[false],"domainScore":[0]}}
{"time":"2020-08-21T20:51:32.018Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042946838-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["872e3b01-56c8-48a4-b953-7d91f54120a6"],"domainId":["0d5b57c7f890fd995a74a12a7b7bcf32"],"domainSubId":["32"],"domainOutcomeFlag":[false]}}
{"time":"2020-08-21T20:51:32.020Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventY event to Cosmos"],"streamId":["1598042947496-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"domainId":["2b8794839f01cb1e1bbebc73a0e62c42"],"domainSubId":["37"],"domainOutcomeFlag":[false],"domainScore":[0]}}
{"time":"2020-08-21T20:51:32.023Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventY event to Cosmos"],"streamId":["1598042948495-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"domainId":["354f91fb549568a9966d6da418e8a002"],"domainSubId":["36"],"domainOutcomeFlag":[false],"domainScore":[0]}}
{"time":"2020-08-21T20:51:32.024Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventY event to Cosmos"],"streamId":["1598042948695-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"domainId":["59ce1fd5f5c5d22a25c08bba5cfd5ef7"],"domainSubId":["29"],"domainOutcomeFlag":[false],"domainScore":[0]}}
{"time":"2020-08-21T20:51:32.027Z","level":"INFO","ctx":"my-service.1_0_557","fields":{"msg":["Buffering DomainEventX event to Cosmos"],"streamId":["1598042950215-0"],"streamName":["my-input-stream"],"consumerId":["my-consumer"],"id":["0b7c9638-9654-461d-90cc-133c42d79515"],"domainId":["117d6fc11884fe25eeebfa86401913f1"],"domainSubId":["35"],"domainOutcomeFlag":[false]}}

After that final log line there is no activity from the app for 3+ days despite plenty of messages being available for the input source to consume. Upon manually restarting the application it continues to consume from the input source as expected.

@cross311 also reported similar behavior on #150.

@cross311
Copy link
Collaborator

@chrnola The issue I ran into where I thought it was this cropping back up again after updating was actually my source stopped sending things. I put additional logging in my source and determined it went to sleep like it was supposed to but was hard to know. Since upgrading to the latest patch our service has not hit this issue again.

We were running a custom source that gathers a bunch of input messages every ten minutes and runs them through.

I don't know how many people are using the Redis source, maybe some additional logging in the source might be useful?

@chrnola
Copy link
Contributor Author

chrnola commented Aug 25, 2020

It's very likely that I'm attributing causality where there is none between the occurrence of the CosmosDB write conflict and the application halting. That was merely the only noticeable pattern across my two examples.

@plameniv plameniv added bug Something isn't working M-azure This issue is related to the Azure module M-core This issue is related to the core module labels Jun 10, 2021
@plameniv plameniv added the P3 Priority 3 label Jul 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working M-azure This issue is related to the Azure module M-core This issue is related to the core module P3 Priority 3
Development

No branches or pull requests

3 participants