Allow cancelling a batch that is stuck in dispatch #1487

awrichar · 2024-04-01T19:31:23Z

Batch manager status can already be queried with "/status/batchmanager". The response includes a flag to say if a processor is blocked, along with the ID of the flushing batch, and the last error message.

The new API "/batches/{batchid}/cancel" can now be used to cancel a batch. This is currently only valid for batches with transaction type "contract_invoke_pin". It will mark all messages in the batch as "cancelled", and for private messages, will send a new batch of type "batch_pin" containing gap fill messages. These messages have no data, but will have a CID pointing to the original (failed) message, and will consume nonces to allow the topic to become unblocked.

Fixes #1446

~~Note: test coverage not yet complete~~

codecov-commenter · 2024-04-01T19:35:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (49410c5) to head (17a8380).
Report is 11 commits behind head on main.

❗ Current head 17a8380 differs from pull request most recent head 3cddcc9. Consider uploading reports for the commit 3cddcc9 to get more accurate results

Additional details and impacted files

@@           Coverage Diff            @@
##             main     #1487   +/-   ##
========================================
  Coverage   99.99%   100.00%           
========================================
  Files         322       323    +1     
  Lines       23406     23482   +76     
========================================
+ Hits        23404     23482   +78     
+ Misses          1         0    -1     
+ Partials        1         0    -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

awrichar · 2024-04-01T19:37:04Z

internal/events/aggregator.go

@@ -638,6 +647,9 @@ func (ag *aggregator) readyForDispatch(ctx context.Context, msg *core.Message, d
 		} else if valid {
 			action = core.ActionConfirm
 		}
+
+	default:
+		action = core.ActionConfirm


Unrelated find - we were rejecting messages with 0 data elements simply because the default action was 0, which means "reject". That didn't seem correct - if there were no special conditions to parse, I think the default should be "confirm" (which includes messages that don't carry data).

If we specifically want to reject messages that contain no data items, it should be a separate branch with an explicit reject reason.

I think we might have a bug reference for this - would be good to mark that as resolved... will see if I can find it.

Found this instead, which might bear you taking a quick look to see if now is the time to address it (in 1.3, but not this PR): #1270

... but I couldn't find the send-empty-message bug report. So maybe I imagined it.

You're likely thinking of #1127, which was resolved - but I think this crept back in as a different flavor of the same issue during one of the refactors in this file.

Batch manager status can already be queried with "/status/batchmanager". The response includes a flag to say if a processor is blocked, along with the ID of the flushing batch, and the last error message. The new API "/batches/{batchid}/cancel" can now be used to cancel a batch. This is currently only valid for batches with transaction type "contract_invoke_pin". It will mark all messages in the batch as "cancelled", and for private messages, will send a new batch of type "batch_pin" containing gap fill messages. These messages have no data, but will have a CID pointing to the original (failed) message, and will consume nonces to allow the topic to become unblocked. Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>

awrichar · 2024-04-01T21:22:52Z

internal/batch/batch_processor.go

 		})
 	})
 }

+func (bp *batchProcessor) writeGapFill(ctx context.Context, msg *core.Message) error {
+	// Gap fill is only needed for private messages
+	if msg.Header.Type != core.MessageTypePrivate {


I went back and forth on whether to send gap fill messages for broadcasts. Ultimately they are needed for private messages for two reasons:

This sender has "spent" a nonce, and may have calculated and used nonces after it. All recipients need to know about the spent nonce so they can process later messages and get back in sync.

All recipients will have received the message payload via data exchange, and will have created a message in "pending" state. They need to know to move those messages to "cancelled" so that they don't stay pending forever.

Neither of these is true for broadcasts. Therefore it felt like cancelling a broadcast can be a "local only" operation, and broadcasting a gap fill to everyone might be more confusing.

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>

awrichar · 2024-04-03T15:43:54Z

internal/batch/batch_processor.go

-			return true, bp.conf.dispatch(ctx, payload)
+			err = bp.conf.dispatch(ctx, payload)
+			if err != nil {
+				if bp.isCancelled() {


I chose to only check the cancellation flag after an error to prevent having to constantly acquire a mutex on the happy path. But this means once you cancel, you have to wait for the next retry to trigger and fail again before the cancellation will actually happen.

peterbroadhurst · 2024-04-03T15:53:05Z

internal/batch/batch_manager.go

+		return i18n.NewError(ctx, coremsgs.MsgErrorLoadingBatch)
+	}
+	msg := batch.Payload.Messages[0]
+	processor, err := bm.getProcessor(msg.Header.TxType, msg.Header.Type, msg.Header.Group, msg.Header.SignerRef.Author, false)


Just checking - what is it that means we're sure that a processor will be active in this case (so we don't need to create one)?

... ah ok, I can see that cancelFlush is only going to do anything meaningful if it's busy trying to flush. So the processor == nil case means there's nothing to do.

We can only cancel a batch that is stuck in a processor. So if the relevant processor is not active, we simply return an error.

peterbroadhurst · 2024-04-03T15:56:13Z

internal/batch/batch_manager.go

+		return err
+	}
+	if processor == nil {
+		return i18n.NewError(ctx, coremsgs.MsgCannotCancelBatchState)


The message text is very clear, but the variable name MsgCannotCancelBatchState isn't - maybe instead:
MsgBatchProcessorNotActive ?

peterbroadhurst · 2024-04-03T15:57:05Z

internal/batch/batch_processor.go

+	fs := &bp.flushStatus
+
+	if bp.conf.txType != core.TransactionTypeContractInvokePin {
+		return i18n.NewError(ctx, coremsgs.MsgCannotCancelBatchType)


Maybe include the type as an insert?

peterbroadhurst · 2024-04-03T15:57:52Z

internal/batch/batch_processor.go

+		return i18n.NewError(ctx, coremsgs.MsgCannotCancelBatchType)
+	}
+	if !id.Equals(fs.Flushing) {
+		return i18n.NewError(ctx, coremsgs.MsgCannotCancelBatchState)


Maybe worth a separate error for this, including the ID of the one that is being flushed (so the user can try and cancel that one instead)?

peterbroadhurst · 2024-04-03T16:03:09Z

internal/batch/batch_processor.go

+	gapFill.Header.TxType = core.TransactionTypeBatchPin
+	err := gapFill.Seal(ctx)
+	if err == nil {
+		err = bp.data.WriteNewMessage(ctx, &data.NewMessage{Message: gapFill})


One question that comes to my mind, which I wonder if you tested - what if there are other messages queued with pins after the gap-fill already?

I think the batch processor will move onto those messages and batch+send them with next-pins after the gap fill, before the gap-fill arrives. So on the receiving side(s) they will receive the gap-fill out of order.

I think this just works itself out, and is all ok - but needs testing.

Yes, I assumed this will work itself out. But you're right I should test to be sure.

I went to test this, and it does not sort itself out automatically. You can manually sort it out with another "rewind" call to the first batch that was blocked on the gap fill.

I'll try to figure out if this rewind can happen automatically. The problem is that rewinds are only effective if you rewind to an undispatched pin (they are ignored if you request a rewind to a pin that has been dispatched). The gap filled pin gets marked dispatched as part of this new logic - so we can't rewind to that one. We need to know if there's another pin queued up after it that is now blocked and requires a rewind...

And you know... I think this is a problem even outside of the gap fill scenario. It's a problem whenever the ordering of the confirmed messages does not match the ordering of the nonces. If I receive nonces 1, 3, 4, and then receive nonce 2, there is no logic that causes us to rewind and confirm 3 & 4.

I guess we are relying on the fact that messages are confirmed on chain in the order nonces were assigned by the sender. I think this is always a shaky assumption, but it's particularly bad now that we have two private message dispatchers running on the same topic.

that messages are confirmed on chain in the order nonces were assigned by the sender.

Yes, this is both a design assumption, and a requirement currently for a "well behaved" actor in a privacy group.

I confirmed with @peterbroadhurst that the following is expected to be a fundamental obligation of the transaction manager: transactions will always be confirmed on the chain in the order they were submitted as long as they come from a single signing key. This is critical for the above "well behaved" privacy group behavior to be reliable.

Therefore, all messages from a single key to a single privacy group need to be on the same thread, and therefore we have a problem in the current design where we have 2 batch processors assigning nonces from the same group (one for regular private messages, another for custom contract private messages). I think I need to combine these processors into one - can't remember the original reason I split them... likely I thought it was actually solving a problem at the time. I don't see any reason they can't be a single processor though.

Updated so that "batch_pin" and "contract_invoke_pin" messages share a batch processor. Also altered so that the "gap fill" batch is sent out immediately during the dispatch phase, instead of trying to queue it for the next assembly loop (which had a lot of problems with reordering).

peterbroadhurst

A few minor questions as I went through @awrichar - but this looks absolutely great

nguyer · 2024-04-05T14:47:58Z

internal/apiserver/route_post_batch_cancel_test.go

@@ -0,0 +1,47 @@
+// Copyright © 2022 Kaleido, Inc.


I guess the linter isn't linting tests...

Nope, it doesn't. Most of our test files have dates of 2021-2022.

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>

"batch_pin" and "contract_invoke_pin" pull from the same pool of nonces for private messages, and thus need to share a single dispatch thread (per message type). This ensures that the order in which nonces are assigned is the order in which nonces are actually used (which is critical for ordering). "unpinned" private messages can continue to be a separate dispatcher, as they don't require nonces. Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>

To preserve the correct ordering of nonces, gap fill batches cannot be queued on the normal assembly loop (which might already have other messages queued). The special batch must be created, sealed, and dispatched immediately instead of the cancelled batch. Messages in both batches must be updated accordingly to move them to "cancelled" or "sent". Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>

peterbroadhurst

Thanks @awrichar for the extra time on this one, and great to see the conclusion.

awrichar requested a review from a team as a code owner April 1, 2024 19:31

awrichar commented Apr 1, 2024

View reviewed changes

awrichar force-pushed the batchcancel branch from b26122b to 3380fbb Compare April 1, 2024 19:40

awrichar force-pushed the batchcancel branch from 3380fbb to fea7a21 Compare April 1, 2024 21:19

awrichar commented Apr 1, 2024

View reviewed changes

Add unit tests for batch cancel

137e657

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>

awrichar commented Apr 3, 2024

View reviewed changes

peterbroadhurst reviewed Apr 3, 2024

View reviewed changes

nguyer reviewed Apr 5, 2024

View reviewed changes

awrichar added 3 commits April 5, 2024 14:38

Add test coverage for gap fill logic in aggregator

618a265

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>

Add test coverage for existing gaps in blockchain event handling

b3768f7

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>

Clean up some error messages

17a8380

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>

awrichar force-pushed the batchcancel branch from 72dd6c5 to 17a8380 Compare April 5, 2024 18:59

awrichar added 3 commits April 9, 2024 12:28

Always clear "group" on broadcast messages

3cddcc9

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>

awrichar force-pushed the batchcancel branch from dd0ea72 to 3cddcc9 Compare April 9, 2024 17:02

peterbroadhurst approved these changes Apr 10, 2024

View reviewed changes

peterbroadhurst merged commit 64bcaa9 into hyperledger:main Apr 10, 2024
14 checks passed

peterbroadhurst deleted the batchcancel branch April 10, 2024 20:27

EnriqueL8 mentioned this pull request Apr 26, 2024

Invoke with data hyperledger/firefly-fir#17

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow cancelling a batch that is stuck in dispatch #1487

Allow cancelling a batch that is stuck in dispatch #1487

awrichar commented Apr 1, 2024 •

edited

Loading

codecov-commenter commented Apr 1, 2024 •

edited

Loading

awrichar Apr 1, 2024

peterbroadhurst Apr 3, 2024

peterbroadhurst Apr 3, 2024

awrichar Apr 5, 2024

awrichar Apr 1, 2024

awrichar Apr 3, 2024

peterbroadhurst Apr 3, 2024

peterbroadhurst Apr 3, 2024

awrichar Apr 3, 2024

peterbroadhurst Apr 3, 2024

peterbroadhurst Apr 3, 2024

peterbroadhurst Apr 3, 2024

peterbroadhurst Apr 3, 2024

awrichar Apr 3, 2024

awrichar Apr 5, 2024

awrichar Apr 5, 2024

peterbroadhurst Apr 5, 2024

awrichar Apr 8, 2024

awrichar Apr 9, 2024

peterbroadhurst left a comment

nguyer Apr 5, 2024

awrichar Apr 5, 2024

peterbroadhurst left a comment

Allow cancelling a batch that is stuck in dispatch #1487

Allow cancelling a batch that is stuck in dispatch #1487

Conversation

awrichar commented Apr 1, 2024 • edited Loading

codecov-commenter commented Apr 1, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbroadhurst left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbroadhurst left a comment

Choose a reason for hiding this comment

awrichar commented Apr 1, 2024 •

edited

Loading

codecov-commenter commented Apr 1, 2024 •

edited

Loading