Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable batch delivery over WebSockets #1447

Merged
merged 9 commits into from
Jan 19, 2024

Conversation

peterbroadhurst
Copy link
Contributor

@peterbroadhurst peterbroadhurst commented Jan 15, 2024

New support of batch for WebSockets

In #1359 @EnriqueL8 added support for Webhooks (outbound HTTPS requests) to configure batch: true on the subscription, such that events are delivered multiple at a time.

This was deferred at the time for WebSockets (inbound HTTPS long-lived connections) because they already had read-ahead performance optimization, so it was lower priority.

However, the use of readahead in its current form in WebSockets does not allow the consumer of events to optimize. For example performing a single DB transaction in the application after processing a batch of events. You could implement a complex batch aggregation routing on the app-side, but that seems much more effort than FireFly just doing the batching for you.

So this PR closes the gap and supports batch: true for WebSockets as well.

You can test it simply with a dev env:
ws://127.0.0.1:5000/ws?ephemeral=true&namespace=default&autoack&batch=true&batchtimeout=1s&readahead=50

Fix to batch based delivery for WebHooks and WebSockets

In adding the WebSocket specific code, I found the original code in #1359 was not actually working as designed. It added a layer of extra batch collation logic on top of the event poller in eventDelivery.deliverBatchedEvents, but actually the way the dispatching code in bufferedDelivery worked the behavior was broken.

If you had 1 event arrive, then a very short delay and 2 more events arrive (simple to recreate with a /broadcast in a dev env), you ended up getting a batch of one, then the whole batch delay waiting, then a batch of 2 after a second wait. Instead of getting a single batch of 3 after the batch delay.

... so the batch delay was doubled to deliver two incomplete batches 😬

This was quite a big problem, and after some wrestling I ended up:

  1. Addressing some debt in the batch logic in eventPoller, which had ended up disabled a very long time ago
  2. Setting the default batching in eventPoller to off (batchTimeout=0 by default in config) as this is actually the behavior that you get today
  3. Removing the eventDelivery.deliverBatchedEvents function added in feat: batching events and webhook plugin support #1359
  4. Changing eventDispatcher to configure eventPoller batching settings
  5. Update eventDelivery.bufferedDelivery and eventDelivery.deliverEvents to handle batching

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@peterbroadhurst peterbroadhurst requested a review from a team as a code owner January 15, 2024 19:21
@peterbroadhurst peterbroadhurst marked this pull request as draft January 15, 2024 19:39
@peterbroadhurst peterbroadhurst self-assigned this Jan 15, 2024
…ve readahead set

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@peterbroadhurst
Copy link
Contributor Author

Popped this back into draft, as I've found a pretty significant issue in how the batching assumptions work between eventDispatcher.bufferedDelivery() and eventDispatcher.deliverBatchedEvents().

eventDispatcher.bufferedDelivery() waits for ack's for the whole array of dispatchable after sending them down to the ed.eventDelivery channel.

But the eventDispatcher.deliverBatchedEvents() processor enabled when ed.batch is true, has a timer to wait for the batch to fill before it dispatches.

So in reality you get really bad delays.
Simple to recreate with ws://127.0.0.1:5000/ws?ephemeral=true&batch=true&namespace=default&autoack&batchtimeout=15s in a dev env

If you send a broadcast there are three events - on generated quickly (transaction submit), then two more after 1s (tx confirmed and then message confirmed), but what you get is:

  • 15 second delay
  • Batch of 1 event
  • 15 second delay
  • Batch of 2 events

What you should get is:

  • 15 second delay
  • Batch of 3 events

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@codecov-commenter
Copy link

codecov-commenter commented Jan 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (2d92165) 99.99% compared to head (fee90bd) 99.99%.
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1447   +/-   ##
=======================================
  Coverage   99.99%   99.99%           
=======================================
  Files         321      321           
  Lines       23108    23175   +67     
=======================================
+ Hits        23106    23173   +67     
  Misses          1        1           
  Partials        1        1           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…as previously inert

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@peterbroadhurst peterbroadhurst marked this pull request as ready for review January 16, 2024 18:07
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
@@ -224,7 +224,7 @@ nav_order: 2
|Key|Description|Type|Default Value|
|---|-----------|----|-------------|
|batchSize|The maximum number of records to read from the DB before performing an aggregation run|[`BytesSize`](https://pkg.go.dev/github.com/docker/go-units#BytesSize)|`200`
|batchTimeout|How long to wait for new events to arrive before performing aggregation on a page of events|[`time.Duration`](https://pkg.go.dev/time#Duration)|`250ms`
|batchTimeout|How long to wait for new events to arrive before performing aggregation on a page of events|[`time.Duration`](https://pkg.go.dev/time#Duration)|`0ms`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually the behavior you got before, because the code that had (a very, very long time ago) been started to implement batch-timeout had been disabled. Making a separate comment to highlight where.

if ep.conf.eventBatchTimeout > 0 && lastEventCount > 0 {
shortTimeout := time.NewTimer(ep.conf.eventBatchTimeout)
select {
case <-shortTimeout.C:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See in this branch we just fall through to the longTimeoutDuration wait, it basically means there was no use of the eventBatchTimeout for its stated purpose.

An incomplete implementation had existed previously, and was partially removed leaving this code-cruft.

So in this PR I've come back round and implemented it with a short re-poll cycle (just once) after the batch timeout if the first time we trigger we get an incomplete batch.

It is disabled by default... but you could turn it on with readahead on subscriptions (and for optimization of the event loop), without needing to use batch-based delivery at all.

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Comment on lines 427 to 429
if err != nil {
ed.deliveryResponse(&core.EventDeliveryResponse{ID: e.Event.ID, Rejected: true})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a nit, but feels like this could be moved outside of the loop (ie down to line 431). Took me a few reads to convince myself that all errors were handled across all branches.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But will leave it with you on what you think is ultimately clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you actually point out a problem here, the reason it's in the loop is we still need to nack the other events. So I need to fix the logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done - would appreciate you taking a 2nd look if you don't mind

Signed-off-by: Peter Broadhurst <peter.broadhurst@kaleido.io>
Copy link
Contributor

@awrichar awrichar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New change looks good. Thanks for the comments on the flow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants