This document is a record of all the bugs or issues we uncovered while specifying & formally verifying the IBC protocols.
The algorithm for relaying connection handshake datagrams of type ConnOpenTry
does not handle the situation when both chains are in state INIT
.
The current relayer algorithm in ICS018 specifies that the ConnOpenTry
datagram should be relayed only if one of the chains is in state INIT
and the other chain is uninitialized (see the snippet below); this is not enough for guaranteeing liveness of the connection handshake protocol (ICS04).
if (localEnd.state === INIT && remoteEnd === null)
The correct code should include both the cases when a single chain is in state INIT
, as well as the case when both chains are in state INIT
, as specified here: Relayer.tla
This fix only concerns the relayer algorithm ICS018.
The same issue (and fix) seems to exist for the channel handshake datagrams.
This problem is not specific to the connection handshake protocol (ICS3) itself, but is a bug in the way the relayers use the UpdateClient
action.
We classify this under ICS3, however, since this is the context where we discovered the problem.
The TLA+ spec we have for depicting this liveness problem is for the ICS3 protocol.
Related issues: #71 and #61. The problem is more thoroughly described in #61, but for the sake of completeness we restated it here in a compact form.
The liveness property that a correct relayer should provide is eventual delivery.
Assuming some source chain A
, destination chain B
, and an IBC item X
(e.g., connection, channel, or packet) on chain A
, we can define this property as follows:
For any IBC item
X
on chainA
destined for chainB
, eventually, a correct relayer will submit itemX
to chainB
.
This is difficult to guarantee in practice, however. Intuitively, the difficulty arises because of a combination of two factors:
-
Proof requirement: For chain
B
to accept itemX
, it must verify the authenticity of this item; this is done via the light client that chainB
maintains. Given an itemX
and a commitment proof forX
constructed at heighth-1
, the light client requires the consensus state at heighth
that includes that commitment root required for verification. -
Concurrency: Different relayers may update the same light client. Suppose a relayer
r1
wants to submit a consensus state at heighth
. In the meantime, however, another relayerr2
may update this same light client to heighth'
. Assumeh'
is bigger thanh
. If the light client disallows updates with heights smaller than the current heighth'
thenr1
's update fails. Consequently, the relayer will be unable to submit consensus state at heighth
.
To ensure eventual delivery, relayer r1
would need to retry submitting item X
, that is: resubmit the consensus state at a larger height (e.g., at h'
) followed by the message that includes the proof for X
(e.g., at h'-1
).
This retry mechanism was adopted as a solution for the current relayer implementation.
Note that it is also possible for relayer r2
to have submitted the same item X
successfully; in this case, the liveness problem does not actually surface.
Note that the TLA+ spec below may change in time. Here we refer to the spec as existing at this commit (unchanged up to release 0.0.2).
To obtain an execution in TLA+ that depicts the above liveness problem, it is sufficient to enable the Concurrency
flag in the L2 default TLA+ spec for ICS3.
This spec is located in spec/connection-handshake/L2-tla/.
In this spec we make a few simplifications compared to the real system, most importantly: to verify an item at height h
, a light client can use the consensus state at the same height h
(no need for smaller height h-1
).
Below we summarize the parameters as well as the sequence of actions that lead to the liveness problem.
MaxBufLen <- 2
MaxHeight <- 8
Concurrency <- TRUE
- Behavior spec: Temporal formula
Spec
- Check for
Deadlock
, InvariantsTypeInvariant
andConsistencyProperty
, as well as PropertyTermination
Both chains A
and B
start at height 1
, and the light client on each chain has consensus state for height 1
.
-
The environment submits a
ICS3MsgInit
message to chainA
. -
Chain
A
processes theICS3MsgInit
, advances to height2
, and prepares aICS3MsgTry
message destined for chainB
. The proof in this message is for height2
. -
The environment triggers the
AdvanceChainHeight
action of chainB
, so this chain transitions from height1
to height2
. -
The environment triggers the
AdvanceChainHeight
action of chainA
, so this chain transitions from height2
to height3
. -
The environment triggers the
AdvanceChainHeight
action of chainA
, so this chain transitions from height3
to height4
. -
Concurrency: The environment triggers the
UpdateClient
action on chainB
: the light client on this chain is updated with height4
(that is, the latest height of chainA
), and chainB
also transitions from height2
to height3
. -
The environment passes (i.e., relays) the
ICS3MsgTry
message to chainB
. Recall that this message has proofs for height2
; consequently, the environment also attempts to triggerUpdateClient
action on chainB
for consensus state at height2
. This action does not enable because the light client onB
has a more recent consensus state for height4
. -
Chain
B
attempts to process theICS3MsgTry
but is unable to verify its authenticity, since the light client on this chain does not have the required consensus state at height2
. ChainB
drops this message.
From this point on, the model stutters, i.e., is unable to progress further in the connection handshake protocol.
Context. The original issue triggering this discussion is here: cosmos/ics/#459. Briefly, version negotiation in the ICS3 handshake can interfere in various ways, breaking either the safety or liveness of this protocol. Several solution candidates exist, which we classify by their "mode", i.e., a strategy for picking the version at some point or another in the protocol. For a full description of the modes, please consult L2-tla/readme.md#version-negotiation-modes.
Overview. Below we use TLA+ traces to explore and report on the exact problems that can occur. We also show how the solution candidates fare. The table below summarizes our results for the four cases we consider:
Case | Property violation |
---|---|
(a) Empty version intersection | liveness |
(b) Mode overwrite |
safety |
(c) Mode onTryNonDet |
liveness |
(d) Mode onAckNonDet |
safety |
These are the main takeaways from this discussion:
- The set of compatible versions that chains start off with (return values of
getCompatibleVersions()
in ICS3) have to intersect, otherwise a liveness issue occurs. This assumption is independent of the version negotiation mode. We report this in case (a) below. - Modes "overwrite", "onTryNonDet", and "onAckNonDet" all result in breaking the handshake protocol. See cases (b), (c), and (d) below for traces.
- The deterministic modes "onTryDet" and "onAckDet" pass model checking, so a solution should be chosen among these two candidates (see the original issue for follow-up on the solution).
Model checking details in TLA+:
- Model parameters:
Concurrency <- FALSE
MaxBufLen <- 2
MaxHeight <- 7
MaxVersionNr <- 2
VersionPickMode <- "overwrite"
- Check for Deadlock and property Termination.
Outcome:
- Model checking halts with exception "Temporal properties were violated."
The two chains start off with different versions (1
for A, and 2
for B).
So the compatible version sets on these chains do not intersect.
-
The environment submits a
ICS3MsgInit
message chainA
. -
The environment triggers the
AdvanceChainHeight
action of chainB
, so this chain transitions from height1
to height2
. -
The environment triggers the
AdvanceChainHeight
action of chainB
, so this chain transitions from height2
to height3
. -
The environment triggers the
AdvanceChainHeight
action of chainA
, so this chain transitions from height1
to height2
. -
Chain
A
processes theICS3MsgInit
, advances to height3
, and prepares aICS3MsgTry
for chainB
. The version in this message is<<1>>
, the same as the version field that chainA
started with. -
The environment relays the
ICS3MsgTry
message to the input buffer of chainB
. This message has proofs for height3
so chainB
gets updated with consensus state for height4
. With this update, chainB
also advances to height4
. -
Chain
B
drops theICS3MsgTry
message because the version field does not match any of the compatible versions of this chain. Therefore, the model cannot progress.
To fix this issue, the model requires an explicit assumption that the compatible versions on the two chains must have a non-empty intersection.
We capture this assumption in the Init
action, via the ChainVersionsOverlap
predicate:
Init ==
/\ chmA!Init
/\ chmB!Init
/\ ChainVersionsOverlap(storeChainA, storeChainB)
/\ InitEnv
Once we add the ChainVersionsOverlap
assumptions, this model no longer has liveness issues.
But the "overwrite" mode can lead to safety problems, however, which we document below.
Model checking details in TLA+:
- Model parameters:
Concurrency <- FALSE
MaxBufLen <- 2
MaxHeight <- 7
MaxVersionNr <- 2
VersionPickMode <- "overwrite"
- Check for invariant _VersionInvariant, as well as Deadlock and property Termination.
- Make sure the
Init
action includes theChainVersionsOverlap
predicate.
Outcome:
- Model checking halts with exception "Invariant VersionInvariant is violated."
Both chains A
and B
start with the compatible versions <<1, 2>>
.
-
The environment submits a
ICS3MsgInit
message to both chains. -
Chain
A
processes theICS3MsgInit
, advances to height2
, and prepares aICS3MsgTry
for chainB
. The versions in this message are<<1, 2>>
, the same as the version field in chainA
. The connection on this chain goes from stateUNINIT
to stateINIT
. -
Chain
B
processes theICS3MsgInit
, advances to height2
, and prepares aICS3MsgTry
for chainA
. The versions in this message are<<1, 2>>
, the same as the version field in chainB
. The connection on this chain goes from stateUNINIT
to stateINIT
. -
The environment relays the
ICS3MsgTry
message to the input buffer of chainB
. This message has proofs for height2
, so the environment triggersUpdateClient
on chainB
for consensus state at height2
. With this update, chainB
advances to height3
. -
The environment relays the
ICS3MsgTry
message to the input buffer of chainA
. This message has proofs for height2
, so the environment triggersUpdateClient
on chainA
for consensus state at height2
. With this update, chainA
advances to height3
. -
Chain
A
processes theICS3MsgTry
message, advances to height4
, and prepares aICS3MsgAck
message forB
. The version in this message is1
. The connection in this chain goes into stateTRYOPEN
, with version chosen to be1
. -
Chain
B
processes theICS3MsgTry
message, advances to height4
, and prepares aICS3MsgAck
message forA
. The version in this message is2
. The connection in this chain goes into stateTRYOPEN
, with version chosen to be2
. -
The environment relays the
ICS3MsgAck
message to the input buffer of chainB
. This message has proofs for height4
, so the environment triggersUpdateClient
on chainB
for consensus state at height4
. With this update, chainB
advances to height5
. -
Chain
B
processes theICS3MsgAck
message (which had version1
-- see step 6 above), advances to height6
, and prepares aICS3MsgConfirm
message forA
. ChainB
overwrites its local version (namely,2
) with the version in theICS3MsgAck
message (that is,1
). The connection in this chain goes into stateOPEN
, with version chosen to be1
. TheICS3MsgConfirm
that chainB
creates contains version1
. -
The environment relays the
ICS3MsgAck
message to the input buffer of chainA
. This message has proofs for height4
, so the environment triggersUpdateClient
on chainA
for consensus state at height4
. With this update, chainA
also advances to height5
. -
Chain
A
processes theICS3MsgAck
message; recall that the version in this message is2
(see step 7 above). Upon processing this message, chainA
overwrites its local version (which was1
) with the version in theICS3MsgAck
message (concretely,2
). The connection in this chain goes into stateOPEN
, with version chosen to be2
. ChainA
also advances to height6
and prepares aICS3MsgConfirm
message forB
; theICS3MsgConfirm
contains version2
.
At this point, the connection is OPEN
at both chains, but the version numbers do not match.
Hence, the invariant VersionInvariant
is violated.
Setup:
- Model parameters:
Concurrency <- FALSE
MaxBufLen <- 2
MaxHeight <- 7
MaxVersionNr <- 2
VersionPickMode <- "onTryNonDet"
- Check for Deadlock and property Termination.
Outcome:
- Model checking halts with exception "Temporal properties were violated."
- The issue is that the version in the two chains diverges and can never reconcile
Both chains A
and B
start with the field <<1, 2>>
, that is, with two compatible versions.
-
The environment submits a
ICS3MsgInit
message to both chains. -
The environment triggers the
AdvanceChainHeight
action of chainA
, so this chain transitions from height1
to height2
. -
The environment triggers the
AdvanceChainHeight
action of chainA
, so this chain transitions from height2
to height3
. -
Chain
B
processes theICS3MsgInit
, advances to height2
, and prepares aICS3MsgTry
message destined for chainA
. The versions in this message are<<1, 2>>
, the same as the version field in chainB
. -
The environment triggers the
AdvanceChainHeight
action of chainB
, so this chain transitions from height2
to height3
. -
Chain
A
processes theICS3MsgInit
, advances to height4
, and prepares aICS3MsgTry
message destined for chainB
. The versions in this message are<<1, 2>>
, the same as the version field in chainA
. -
The environment passes (i.e., relays) the
ICS3MsgTry
message to the input buffer of chainB
. This message has proofs for height4
; consequently, the environment also triggersUpdateClient
on chainB
for consensus state at height4
, preparing this chain to process the message in the input buffer. With this update, chainB
advances to height4
. -
The environment passes (i.e., relays) the
ICS3MsgTry
message to the input buffer of chainA
. This message has proofs for height2
, so the environment also does aUpdateClient
on chainA
for consensus state at height2
. With this update, chainA
advances to height5
. -
Chain
A
processes theICS3MsgTry
, advances to height6
, and prepares aICS3MsgAck
for chainB
. The version in this message is<<2>>
, which is the version which chainA
choose non-deterministically for this connection. The connection on chainA
is now in stateTRYOPEN
. -
Chain
B
processes theICS3MsgTry
, advances to height5
, and prepares aICS3MsgAck
for chainA
. The version in this message is<<1>>
, which chainB
choose non-deterministically for this connection. The connection on chainB
is now in stateTRYOPEN
.
From this point on, the two chains can not make further progress in the handshake, since they chose different versions.
Neither of the two chains can process the ICS3MsgAck
message because the version in this message does not match with the version the chain stores locally.
(A chain should not overwrite its local version either, otherwise the safety issue from case (b) can appear.)
Therefore, the model stutters (cannot progress anymore).
Model checking details in TLA+:
- Model parameters:
Concurrency <- FALSE
MaxBufLen <- 2
MaxHeight <- 7
MaxVersionNr <- 2
VersionPickMode <- "onAckNonDet"
- Check for invariant _VersionInvariant, as well as Deadlock and property Termination.
Outcome:
- Model checking halts with exception "Invariant VersionInvariant is violated."
Both chains A
and B
start with the compatible versions <<1, 2>>
.
-
The environment submits a
ICS3MsgInit
message chainA
. -
Chain
A
processes theICS3MsgInit
, advances to height2
, and prepares aICS3MsgTry
for chainB
. The versions in this message are<<1, 2>>
, the same as the version field in chainA
. The connection on this chain goes from stateUNINIT
to stateINIT
. -
The environment relays the
ICS3MsgTry
message to the input buffer of chainB
. This message has proofs for height2
, so the environment triggersUpdateClient
on chainB
for consensus state at height2
. With this update, chainB
advances to height2
. -
Chain
B
processes theICS3MsgTry
message, advances to height3
, and prepares aICS3MsgAck
message forA
. The version in this message is<<1, 2>>
. The connection in this chain goes into stateTRYOPEN
; chainB
does not choose a specific version yet, so the connection onB
still has versions<<1, 2>>
. -
The environment relays the
ICS3MsgAck
message to the input buffer of chainA
. This message has proofs for height3
, so the environment triggersUpdateClient
on chainA
for consensus state at height3
. With this update, chainA
advances to height3
. -
Chain
A
processes theICS3MsgAck
message (which has versions<<1, 2>>
), advances to height4
and prepares aICS3MsgConfirm
message forB
. ChainA
locks on version1
(non-deterministic choice between<<1, 2>>
), which it also reports in theICS3MsgConfirm
message. The connection in this chain goes into stateOPEN
, with version chosen to1
. -
The environment relays the
ICS3MsgConfirm
message to the input buffer of chainB
. This message has proofs for height4
, so the environment triggersUpdateClient
on chainB
for consensus state at height4
. With this update, chainB
also advances to height4
. -
Chain
B
processes theICS3MsgConfirm
message (which contains version1
). ChainB
locks on version2
(non-deterministic choice between its local versions<<1, 2>>
). The connection in this chain goes into stateOPEN
.
At this point, the connection is OPEN
at both chains, but the version numbers do not match.
Hence, the invariant VersionInvariant
is violated.