-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow cancelling a batch that is stuck in dispatch #1487
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1487 +/- ##
========================================
Coverage 99.99% 100.00%
========================================
Files 322 323 +1
Lines 23406 23482 +76
========================================
+ Hits 23404 23482 +78
+ Misses 1 0 -1
+ Partials 1 0 -1 ☔ View full report in Codecov by Sentry. |
@@ -638,6 +647,9 @@ func (ag *aggregator) readyForDispatch(ctx context.Context, msg *core.Message, d | |||
} else if valid { | |||
action = core.ActionConfirm | |||
} | |||
|
|||
default: | |||
action = core.ActionConfirm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated find - we were rejecting messages with 0 data elements simply because the default action
was 0, which means "reject". That didn't seem correct - if there were no special conditions to parse, I think the default should be "confirm" (which includes messages that don't carry data).
If we specifically want to reject messages that contain no data items, it should be a separate branch with an explicit reject reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might have a bug reference for this - would be good to mark that as resolved... will see if I can find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found this instead, which might bear you taking a quick look to see if now is the time to address it (in 1.3, but not this PR): #1270
... but I couldn't find the send-empty-message bug report. So maybe I imagined it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're likely thinking of #1127, which was resolved - but I think this crept back in as a different flavor of the same issue during one of the refactors in this file.
Batch manager status can already be queried with "/status/batchmanager". The response includes a flag to say if a processor is blocked, along with the ID of the flushing batch, and the last error message. The new API "/batches/{batchid}/cancel" can now be used to cancel a batch. This is currently only valid for batches with transaction type "contract_invoke_pin". It will mark all messages in the batch as "cancelled", and for private messages, will send a new batch of type "batch_pin" containing gap fill messages. These messages have no data, but will have a CID pointing to the original (failed) message, and will consume nonces to allow the topic to become unblocked. Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
internal/batch/batch_processor.go
Outdated
}) | ||
}) | ||
} | ||
|
||
func (bp *batchProcessor) writeGapFill(ctx context.Context, msg *core.Message) error { | ||
// Gap fill is only needed for private messages | ||
if msg.Header.Type != core.MessageTypePrivate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth on whether to send gap fill messages for broadcasts. Ultimately they are needed for private messages for two reasons:
- This sender has "spent" a nonce, and may have calculated and used nonces after it. All recipients need to know about the spent nonce so they can process later messages and get back in sync.
- All recipients will have received the message payload via data exchange, and will have created a message in "pending" state. They need to know to move those messages to "cancelled" so that they don't stay pending forever.
Neither of these is true for broadcasts. Therefore it felt like cancelling a broadcast can be a "local only" operation, and broadcasting a gap fill to everyone might be more confusing.
Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
return true, bp.conf.dispatch(ctx, payload) | ||
err = bp.conf.dispatch(ctx, payload) | ||
if err != nil { | ||
if bp.isCancelled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to only check the cancellation flag after an error to prevent having to constantly acquire a mutex on the happy path. But this means once you cancel, you have to wait for the next retry to trigger and fail again before the cancellation will actually happen.
return i18n.NewError(ctx, coremsgs.MsgErrorLoadingBatch) | ||
} | ||
msg := batch.Payload.Messages[0] | ||
processor, err := bm.getProcessor(msg.Header.TxType, msg.Header.Type, msg.Header.Group, msg.Header.SignerRef.Author, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking - what is it that means we're sure that a processor will be active in this case (so we don't need to create one)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... ah ok, I can see that cancelFlush
is only going to do anything meaningful if it's busy trying to flush. So the processor == nil
case means there's nothing to do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can only cancel a batch that is stuck in a processor. So if the relevant processor is not active, we simply return an error.
internal/batch/batch_manager.go
Outdated
return err | ||
} | ||
if processor == nil { | ||
return i18n.NewError(ctx, coremsgs.MsgCannotCancelBatchState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message text is very clear, but the variable name MsgCannotCancelBatchState
isn't - maybe instead:
MsgBatchProcessorNotActive
?
internal/batch/batch_processor.go
Outdated
fs := &bp.flushStatus | ||
|
||
if bp.conf.txType != core.TransactionTypeContractInvokePin { | ||
return i18n.NewError(ctx, coremsgs.MsgCannotCancelBatchType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe include the type as an insert?
internal/batch/batch_processor.go
Outdated
return i18n.NewError(ctx, coremsgs.MsgCannotCancelBatchType) | ||
} | ||
if !id.Equals(fs.Flushing) { | ||
return i18n.NewError(ctx, coremsgs.MsgCannotCancelBatchState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth a separate error for this, including the ID of the one that is being flushed (so the user can try and cancel that one instead)?
internal/batch/batch_processor.go
Outdated
gapFill.Header.TxType = core.TransactionTypeBatchPin | ||
err := gapFill.Seal(ctx) | ||
if err == nil { | ||
err = bp.data.WriteNewMessage(ctx, &data.NewMessage{Message: gapFill}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question that comes to my mind, which I wonder if you tested - what if there are other messages queued with pins after the gap-fill already?
I think the batch processor will move onto those messages and batch+send them with next-pins after the gap fill, before the gap-fill arrives. So on the receiving side(s) they will receive the gap-fill out of order.
I think this just works itself out, and is all ok - but needs testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I assumed this will work itself out. But you're right I should test to be sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went to test this, and it does not sort itself out automatically. You can manually sort it out with another "rewind" call to the first batch that was blocked on the gap fill.
I'll try to figure out if this rewind can happen automatically. The problem is that rewinds are only effective if you rewind to an undispatched pin (they are ignored if you request a rewind to a pin that has been dispatched). The gap filled pin gets marked dispatched as part of this new logic - so we can't rewind to that one. We need to know if there's another pin queued up after it that is now blocked and requires a rewind...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you know... I think this is a problem even outside of the gap fill scenario. It's a problem whenever the ordering of the confirmed messages does not match the ordering of the nonces. If I receive nonces 1, 3, 4, and then receive nonce 2, there is no logic that causes us to rewind and confirm 3 & 4.
I guess we are relying on the fact that messages are confirmed on chain in the order nonces were assigned by the sender. I think this is always a shaky assumption, but it's particularly bad now that we have two private message dispatchers running on the same topic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that messages are confirmed on chain in the order nonces were assigned by the sender.
Yes, this is both a design assumption, and a requirement currently for a "well behaved" actor in a privacy group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed with @peterbroadhurst that the following is expected to be a fundamental obligation of the transaction manager: transactions will always be confirmed on the chain in the order they were submitted as long as they come from a single signing key. This is critical for the above "well behaved" privacy group behavior to be reliable.
Therefore, all messages from a single key to a single privacy group need to be on the same thread, and therefore we have a problem in the current design where we have 2 batch processors assigning nonces from the same group (one for regular private messages, another for custom contract private messages). I think I need to combine these processors into one - can't remember the original reason I split them... likely I thought it was actually solving a problem at the time. I don't see any reason they can't be a single processor though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated so that "batch_pin" and "contract_invoke_pin" messages share a batch processor. Also altered so that the "gap fill" batch is sent out immediately during the dispatch phase, instead of trying to queue it for the next assembly loop (which had a lot of problems with reordering).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor questions as I went through @awrichar - but this looks absolutely great
@@ -0,0 +1,47 @@ | |||
// Copyright © 2022 Kaleido, Inc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the linter isn't linting tests...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it doesn't. Most of our test files have dates of 2021-2022.
Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
"batch_pin" and "contract_invoke_pin" pull from the same pool of nonces for private messages, and thus need to share a single dispatch thread (per message type). This ensures that the order in which nonces are assigned is the order in which nonces are actually used (which is critical for ordering). "unpinned" private messages can continue to be a separate dispatcher, as they don't require nonces. Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
To preserve the correct ordering of nonces, gap fill batches cannot be queued on the normal assembly loop (which might already have other messages queued). The special batch must be created, sealed, and dispatched immediately instead of the cancelled batch. Messages in both batches must be updated accordingly to move them to "cancelled" or "sent". Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
Signed-off-by: Andrew Richardson <andrew.richardson@kaleido.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @awrichar for the extra time on this one, and great to see the conclusion.
Batch manager status can already be queried with "/status/batchmanager". The response includes a flag to say if a processor is blocked, along with the ID of the flushing batch, and the last error message.
The new API "/batches/{batchid}/cancel" can now be used to cancel a batch. This is currently only valid for batches with transaction type "contract_invoke_pin". It will mark all messages in the batch as "cancelled", and for private messages, will send a new batch of type "batch_pin" containing gap fill messages. These messages have no data, but will have a CID pointing to the original (failed) message, and will consume nonces to allow the topic to become unblocked.
Fixes #1446
Note: test coverage not yet complete