Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow cancelling a batch that is stuck in dispatch #1487

Merged
merged 8 commits into from
Apr 10, 2024

Conversation

awrichar
Copy link
Contributor

@awrichar awrichar commented Apr 1, 2024

Batch manager status can already be queried with "/status/batchmanager". The response includes a flag to say if a processor is blocked, along with the ID of the flushing batch, and the last error message.

The new API "/batches/{batchid}/cancel" can now be used to cancel a batch. This is currently only valid for batches with transaction type "contract_invoke_pin". It will mark all messages in the batch as "cancelled", and for private messages, will send a new batch of type "batch_pin" containing gap fill messages. These messages have no data, but will have a CID pointing to the original (failed) message, and will consume nonces to allow the topic to become unblocked.

Fixes #1446

Note: test coverage not yet complete

@awrichar awrichar requested a review from a team as a code owner April 1, 2024 19:31
@codecov-commenter
Copy link

codecov-commenter commented Apr 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (49410c5) to head (17a8380).
Report is 11 commits behind head on main.

❗ Current head 17a8380 differs from pull request most recent head 3cddcc9. Consider uploading reports for the commit 3cddcc9 to get more accurate results

Additional details and impacted files
@@           Coverage Diff            @@
##             main     #1487   +/-   ##
========================================
  Coverage   99.99%   100.00%           
========================================
  Files         322       323    +1     
  Lines       23406     23482   +76     
========================================
+ Hits        23404     23482   +78     
+ Misses          1         0    -1     
+ Partials        1         0    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -638,6 +647,9 @@ func (ag *aggregator) readyForDispatch(ctx context.Context, msg *core.Message, d
} else if valid {
action = core.ActionConfirm
}

default:
action = core.ActionConfirm
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated find - we were rejecting messages with 0 data elements simply because the default action was 0, which means "reject". That didn't seem correct - if there were no special conditions to parse, I think the default should be "confirm" (which includes messages that don't carry data).

If we specifically want to reject messages that contain no data items, it should be a separate branch with an explicit reject reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might have a bug reference for this - would be good to mark that as resolved... will see if I can find it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found this instead, which might bear you taking a quick look to see if now is the time to address it (in 1.3, but not this PR): #1270

... but I couldn't find the send-empty-message bug report. So maybe I imagined it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're likely thinking of #1127, which was resolved - but I think this crept back in as a different flavor of the same issue during one of the refactors in this file.

Batch manager status can already be queried with "/status/batchmanager". The
response includes a flag to say if a processor is blocked, along with the
ID of the flushing batch, and the last error message.

The new API "/batches/{batchid}/cancel" can now be used to cancel a batch.
This is currently only valid for batches with transaction type
"contract_invoke_pin". It will mark all messages in the batch as "cancelled",
and for private messages, will send a new batch of type "batch_pin" containing
gap fill messages. These messages have no data, but will have a CID pointing
to the original (failed) message, and will consume nonces to allow the topic
to become unblocked.

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
})
})
}

func (bp *batchProcessor) writeGapFill(ctx context.Context, msg *core.Message) error {
// Gap fill is only needed for private messages
if msg.Header.Type != core.MessageTypePrivate {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on whether to send gap fill messages for broadcasts. Ultimately they are needed for private messages for two reasons:

  1. This sender has "spent" a nonce, and may have calculated and used nonces after it. All recipients need to know about the spent nonce so they can process later messages and get back in sync.
  2. All recipients will have received the message payload via data exchange, and will have created a message in "pending" state. They need to know to move those messages to "cancelled" so that they don't stay pending forever.

Neither of these is true for broadcasts. Therefore it felt like cancelling a broadcast can be a "local only" operation, and broadcasting a gap fill to everyone might be more confusing.

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
return true, bp.conf.dispatch(ctx, payload)
err = bp.conf.dispatch(ctx, payload)
if err != nil {
if bp.isCancelled() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to only check the cancellation flag after an error to prevent having to constantly acquire a mutex on the happy path. But this means once you cancel, you have to wait for the next retry to trigger and fail again before the cancellation will actually happen.

return i18n.NewError(ctx, coremsgs.MsgErrorLoadingBatch)
}
msg := batch.Payload.Messages[0]
processor, err := bm.getProcessor(msg.Header.TxType, msg.Header.Type, msg.Header.Group, msg.Header.SignerRef.Author, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking - what is it that means we're sure that a processor will be active in this case (so we don't need to create one)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... ah ok, I can see that cancelFlush is only going to do anything meaningful if it's busy trying to flush. So the processor == nil case means there's nothing to do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can only cancel a batch that is stuck in a processor. So if the relevant processor is not active, we simply return an error.

return err
}
if processor == nil {
return i18n.NewError(ctx, coremsgs.MsgCannotCancelBatchState)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message text is very clear, but the variable name MsgCannotCancelBatchState isn't - maybe instead:
MsgBatchProcessorNotActive ?

fs := &bp.flushStatus

if bp.conf.txType != core.TransactionTypeContractInvokePin {
return i18n.NewError(ctx, coremsgs.MsgCannotCancelBatchType)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe include the type as an insert?

return i18n.NewError(ctx, coremsgs.MsgCannotCancelBatchType)
}
if !id.Equals(fs.Flushing) {
return i18n.NewError(ctx, coremsgs.MsgCannotCancelBatchState)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth a separate error for this, including the ID of the one that is being flushed (so the user can try and cancel that one instead)?

gapFill.Header.TxType = core.TransactionTypeBatchPin
err := gapFill.Seal(ctx)
if err == nil {
err = bp.data.WriteNewMessage(ctx, &data.NewMessage{Message: gapFill})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question that comes to my mind, which I wonder if you tested - what if there are other messages queued with pins after the gap-fill already?

I think the batch processor will move onto those messages and batch+send them with next-pins after the gap fill, before the gap-fill arrives. So on the receiving side(s) they will receive the gap-fill out of order.

I think this just works itself out, and is all ok - but needs testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I assumed this will work itself out. But you're right I should test to be sure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went to test this, and it does not sort itself out automatically. You can manually sort it out with another "rewind" call to the first batch that was blocked on the gap fill.

I'll try to figure out if this rewind can happen automatically. The problem is that rewinds are only effective if you rewind to an undispatched pin (they are ignored if you request a rewind to a pin that has been dispatched). The gap filled pin gets marked dispatched as part of this new logic - so we can't rewind to that one. We need to know if there's another pin queued up after it that is now blocked and requires a rewind...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you know... I think this is a problem even outside of the gap fill scenario. It's a problem whenever the ordering of the confirmed messages does not match the ordering of the nonces. If I receive nonces 1, 3, 4, and then receive nonce 2, there is no logic that causes us to rewind and confirm 3 & 4.

I guess we are relying on the fact that messages are confirmed on chain in the order nonces were assigned by the sender. I think this is always a shaky assumption, but it's particularly bad now that we have two private message dispatchers running on the same topic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that messages are confirmed on chain in the order nonces were assigned by the sender.

Yes, this is both a design assumption, and a requirement currently for a "well behaved" actor in a privacy group.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed with @peterbroadhurst that the following is expected to be a fundamental obligation of the transaction manager: transactions will always be confirmed on the chain in the order they were submitted as long as they come from a single signing key. This is critical for the above "well behaved" privacy group behavior to be reliable.

Therefore, all messages from a single key to a single privacy group need to be on the same thread, and therefore we have a problem in the current design where we have 2 batch processors assigning nonces from the same group (one for regular private messages, another for custom contract private messages). I think I need to combine these processors into one - can't remember the original reason I split them... likely I thought it was actually solving a problem at the time. I don't see any reason they can't be a single processor though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated so that "batch_pin" and "contract_invoke_pin" messages share a batch processor. Also altered so that the "gap fill" batch is sent out immediately during the dispatch phase, instead of trying to queue it for the next assembly loop (which had a lot of problems with reordering).

Copy link
Contributor

@peterbroadhurst peterbroadhurst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor questions as I went through @awrichar - but this looks absolutely great

@@ -0,0 +1,47 @@
// Copyright © 2022 Kaleido, Inc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the linter isn't linting tests...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it doesn't. Most of our test files have dates of 2021-2022.

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
"batch_pin" and "contract_invoke_pin" pull from the same pool of nonces
for private messages, and thus need to share a single dispatch thread (per
message type). This ensures that the order in which nonces are assigned is
the order in which nonces are actually used (which is critical for ordering).

"unpinned" private messages can continue to be a separate dispatcher, as
they don't require nonces.

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
To preserve the correct ordering of nonces, gap fill batches cannot be queued
on the normal assembly loop (which might already have other messages queued).
The special batch must be created, sealed, and dispatched immediately instead
of the cancelled batch. Messages in both batches must be updated accordingly
to move them to "cancelled" or "sent".

Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
Copy link
Contributor

@peterbroadhurst peterbroadhurst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @awrichar for the extra time on this one, and great to see the conclusion.

@peterbroadhurst peterbroadhurst merged commit 64bcaa9 into hyperledger:main Apr 10, 2024
14 checks passed
@peterbroadhurst peterbroadhurst deleted the batchcancel branch April 10, 2024 20:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Close FIR-17
4 participants