[Bug]: SolaceIO seems to lose messages during Dataflow scaling/rebalancing events

### What happened?

For a few months now we've been dealing with an issue where messages are "disappearing" during scaling events. By that, I mean they are just simply not being processed by pipeline steps downstream from SolaceIO.

It is always related to some logs from Dataflow service, with different kinds of warnings. These warnings claim it's generally "expected" to happen during autoscaling events but well, the outcome for us is data loss...

Example logs:
```
Work is no longer active on the backend, it already succeeded or will be retried. sharding_key=1251872cc9e8c6a8 status: INTERNAL: Windmill failed to commit the work item. CommitStatus: NOT_FOUND
=== Source Location Trace: === 
dist_proc/dax/workflow/worker/streaming/streaming_worker_client.cc:888
```
```
Error while processing a work item: INTERNAL: Windmill failed to commit the work item. CommitStatus: NOT_FOUND
=== Source Location Trace: === 
dist_proc/dax/workflow/worker/streaming/streaming_worker_client.cc:888
```
```
Finalize id rejected rather than committed
```
```
1 Not able to request finalization at SDK because the work item 52522 was not committed successfully. This may occur due to autoscaling or backend rebalancing.
```

I can provide more details on the logs if needed. Note that this is not necessarily related to #35304 because we don't have those errors about "closed consumer" around those times when messages are lost. The only corelation we've found so far is these logs from Windmill/workers.

What we tried so far is:
1. Locked max_workers to 1 so it doesn't do horizontal autoscaling - this makes the problem much worse and it is pretty much guaranteed that some messages are lost when pipeline is scaling up/down. Thankfully this pipeline doesn't need many resources but this doesn't solve the problem (see below).
2. Disabled vertical autoscaling (adjusting memory) which seems to have similar effect on the pipeline wrt this issue like horizontal scaling.

Those two reduced the issues a lot but still the issue occurs. I guess maybe because Dataflow does some kind of "rebalancing" internally, we tried also reducing number of parallel keys but this doesn't have any effect.

Could there be something in SolaceIO handling of general Beam I/O contract like finalization of bundles or restarting/splitting the readers that could have such impact? @bzablocki @stankiewicz 

(Note: I picked P1 prio for this bug as it is indeed data loss, but feel free to adjust it, in case you have different understanding for P1)

### Issue Priority

Priority: 1 (data loss / total loss of function)

### Issue Components

- [ ] Component: Python SDK
- [x] Component: Java SDK
- [ ] Component: Go SDK
- [ ] Component: Typescript SDK
- [x] Component: IO connector
- [ ] Component: Beam YAML
- [ ] Component: Beam examples
- [ ] Component: Beam playground
- [ ] Component: Beam katas
- [ ] Component: Website
- [ ] Component: Infrastructure
- [ ] Component: Spark Runner
- [ ] Component: Flink Runner
- [ ] Component: Samza Runner
- [ ] Component: Twister2 Runner
- [ ] Component: Hazelcast Jet Runner
- [x] Component: Google Cloud Dataflow Runner

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: SolaceIO seems to lose messages during Dataflow scaling/rebalancing events #36991

What happened?

Issue Priority

Issue Components

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: SolaceIO seems to lose messages during Dataflow scaling/rebalancing events #36991

Description

What happened?

Issue Priority

Issue Components

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions