Skip to content

[Bug]: SolaceIO seems to lose messages during Dataflow scaling/rebalancing events #36991

@ppawel

Description

@ppawel

What happened?

For a few months now we've been dealing with an issue where messages are "disappearing" during scaling events. By that, I mean they are just simply not being processed by pipeline steps downstream from SolaceIO.

It is always related to some logs from Dataflow service, with different kinds of warnings. These warnings claim it's generally "expected" to happen during autoscaling events but well, the outcome for us is data loss...

Example logs:

Work is no longer active on the backend, it already succeeded or will be retried. sharding_key=1251872cc9e8c6a8 status: INTERNAL: Windmill failed to commit the work item. CommitStatus: NOT_FOUND
=== Source Location Trace: === 
dist_proc/dax/workflow/worker/streaming/streaming_worker_client.cc:888
Error while processing a work item: INTERNAL: Windmill failed to commit the work item. CommitStatus: NOT_FOUND
=== Source Location Trace: === 
dist_proc/dax/workflow/worker/streaming/streaming_worker_client.cc:888
Finalize id rejected rather than committed
1 Not able to request finalization at SDK because the work item 52522 was not committed successfully. This may occur due to autoscaling or backend rebalancing.

I can provide more details on the logs if needed. Note that this is not necessarily related to #35304 because we don't have those errors about "closed consumer" around those times when messages are lost. The only corelation we've found so far is these logs from Windmill/workers.

What we tried so far is:

  1. Locked max_workers to 1 so it doesn't do horizontal autoscaling - this makes the problem much worse and it is pretty much guaranteed that some messages are lost when pipeline is scaling up/down. Thankfully this pipeline doesn't need many resources but this doesn't solve the problem (see below).
  2. Disabled vertical autoscaling (adjusting memory) which seems to have similar effect on the pipeline wrt this issue like horizontal scaling.

Those two reduced the issues a lot but still the issue occurs. I guess maybe because Dataflow does some kind of "rebalancing" internally, we tried also reducing number of parallel keys but this doesn't have any effect.

Could there be something in SolaceIO handling of general Beam I/O contract like finalization of bundles or restarting/splitting the readers that could have such impact? @bzablocki @stankiewicz

(Note: I picked P1 prio for this bug as it is indeed data loss, but feel free to adjust it, in case you have different understanding for P1)

Issue Priority

Priority: 1 (data loss / total loss of function)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions