Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable joining from old snapshot test #3573

Merged
merged 47 commits into from
May 4, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
908294c
Fir and re-enable test
Feb 18, 2022
23f49a4
Changelog
Feb 18, 2022
179f489
Fix build
Feb 18, 2022
40c1839
LTS compat, further nodes should start from snapshots too
Feb 18, 2022
56a65d8
Merge branch 'main' into old_snapshot_test_enable
jumaffre Feb 18, 2022
3dd9c75
Migration
Feb 18, 2022
c83e4d3
Merge branch 'old_snapshot_test_enable' of github.com:jumaffre/CCF in…
Feb 18, 2022
f565977
Fix
Feb 18, 2022
5a8670a
Merge branch 'main' into old_snapshot_test_enable
jumaffre Feb 18, 2022
afb694f
from snapshot by default
Feb 18, 2022
247fa04
Fix
Feb 21, 2022
3a16e09
fmt
Feb 21, 2022
1f82a31
Oops
Feb 21, 2022
db36db3
Merge branch 'main' of github.com:microsoft/CCF into old_snapshot_tes…
Feb 22, 2022
bb90bfb
.
Feb 22, 2022
dd4da07
Fix
Feb 22, 2022
7f34b4a
Merge branch 'main' of github.com:microsoft/CCF into old_snapshot_tes…
Feb 24, 2022
1288bd7
now?
Feb 24, 2022
b466edc
Merge branch 'main' into old_snapshot_test_enable
achamayou Mar 9, 2022
b1dfdfd
Merge branch 'main' of github.com:microsoft/CCF into old_snapshot_tes…
Mar 14, 2022
c0319ee
Merge branch 'old_snapshot_test_enable' of github.com:jumaffre/CCF in…
Mar 14, 2022
7d99eca
Fix for `reconfiguration_test_cft`
Mar 14, 2022
9e99099
.
Mar 14, 2022
4cba225
Merge branch 'main' of github.com:microsoft/CCF into old_snapshot_tes…
Apr 14, 2022
35488d2
Merge branch 'main' of github.com:microsoft/CCF into old_snapshot_tes…
Apr 21, 2022
d0f5f1f
Merge branch 'main' into old_snapshot_test_enable
jumaffre Apr 22, 2022
58e80ee
Merge branch 'old_snapshot_test_enable' of github.com:jumaffre/CCF in…
Apr 26, 2022
448e6f4
Merge branch 'main' of github.com:microsoft/CCF into old_snapshot_tes…
Apr 26, 2022
cd0c8ac
Fix LTS compatibility test issue
Apr 26, 2022
f5f8fae
Merge branch 'main' into old_snapshot_test_enable
jumaffre Apr 26, 2022
e567c32
.
Apr 27, 2022
8d4a824
Merge branch 'old_snapshot_test_enable' of github.com:jumaffre/CCF in…
Apr 27, 2022
4bfd3fb
Merge branch 'main' into old_snapshot_test_enable
jumaffre Apr 27, 2022
f0c0908
Fix snapshot ledger rekey issue
Apr 27, 2022
7a3d0e2
Recovery from snapshot by default
Apr 27, 2022
cfe785f
Docs
Apr 27, 2022
e6890a6
Merge branch 'old_snapshot_test_enable' of github.com:jumaffre/CCF in…
Apr 27, 2022
2132e17
fmt
Apr 27, 2022
8db9b76
Cleanup
Apr 27, 2022
21c6dbb
Be more resilient to errors during snapshots copy
Apr 28, 2022
645b5a1
fmt
Apr 28, 2022
37d89aa
Merge branch 'main' of github.com:microsoft/CCF into old_snapshot_tes…
May 3, 2022
814535f
Merge branch 'main' into old_snapshot_test_enable
jumaffre May 3, 2022
4e53245
Merge branch 'main' into old_snapshot_test_enable
jumaffre May 4, 2022
664d68d
Merge branch 'main' into old_snapshot_test_enable
jumaffre May 4, 2022
edf5d46
Merge branch 'main' into old_snapshot_test_enable
jumaffre May 4, 2022
5090402
Merge branch 'main' into old_snapshot_test_enable
jumaffre May 4, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .daily_canary
Original file line number Diff line number Diff line change
@@ -1 +1 @@
Piou
Tweet.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,12 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

## Unreleased

### Fixed

- Fixed an issue where new node started without a snapshot would be able to join from a node that started with a snapshot (#3573).

## [2.0.0-rc8]

### Fixed
Expand Down
4 changes: 3 additions & 1 deletion doc/operations/ledger_snapshot.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,9 @@ Join/Recover From Snapshot

Once a snapshot has been generated by the primary, operators can copy or mount the snapshot directory to the new node directory before it is started. On start-up, the new node will automatically resume from the latest committed snapshot file in the ``snapshots.directory`` directory. If no snapshot file is found, all historical transactions will be replicated to that node.

From 2.x releases (specifically, from `-dev5`), committed snapshot files embed the receipt of the evidence transaction. As such, nodes can join or recover a service from a standalone snapshot file. For 1.x releases, it is expected that operators also copy the ledger suffix containing the proof of commit of the evidence transaction to the node's ledger directory.
From 2.x releases, committed snapshot files embed the receipt of the evidence transaction. As such, nodes can join or recover a service from a standalone snapshot file. For 1.x releases, it is expected that operators also copy the ledger suffix containing the proof of commit of the evidence transaction to the node's ledger directory.

It is important to note that new nodes cannot resume from a snapshot and join a service via a node that started from a more recent snapshot. For example, if a new node resumes from a snapshot generated at ``seqno 100`` and joins from a (primary) node that originally resumed from a snapshot at ``seqno 50``, the new node will throw a ``StartupSeqnoIsOld`` error shortly after starting up. It is expected that operators copy the *latest* committed snapshot file to new nodes before start up.

.. note:: Snapshots emitted by 1.x nodes can be used by 2.x nodes to join or a recover a service.

Expand Down
2 changes: 1 addition & 1 deletion doc/operations/start_network.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ To create a new CCF network, the first node of the network should be started wit
Uninitialized-- config -->Initialized;
Initialized-- start -->PartOfNetwork;

The unique identifier of a CCF node is the hex-encoded string of the SHA-256 digest the public key contained in its identity certificate (e.g. ``50211327a77fc16dd2fba8fae5fffac3df909fceeb307cf804a4125ae2679007``). This unique identifier should be used by operators and members to refer to this node with CCF (for example, when :ref:`governance/common_member_operations:Trusting a New Node`).
The unique identifier of a CCF node is the hex-encoded string of the SHA-256 digest of the public key contained in its identity certificate (e.g. ``50211327a77fc16dd2fba8fae5fffac3df909fceeb307cf804a4125ae2679007``). This unique identifier should be used by operators and members to refer to this node with CCF (for example, when :ref:`governance/common_member_operations:Trusting a New Node`).

CCF nodes can be started by using IP Addresses (both IPv4 and IPv6 are supported) or by specifying a fully qualified domain name. If an FQDN is used then a ``dNSName`` subject alternative name should be specified as part of the ``node_certificate.subject_alt_names`` configuration entry. Once a DNS has been setup it will be possible to connect to the node over TLS by using the node's domain name.

Expand Down
2 changes: 1 addition & 1 deletion include/ccf/odata_error.h
Original file line number Diff line number Diff line change
Expand Up @@ -80,7 +80,7 @@ namespace ccf
ERROR(InvalidQuote)
ERROR(InvalidNodeState)
ERROR(NodeAlreadyExists)
ERROR(StartupSnapshotIsOld)
ERROR(StartupSeqnoIsOld)
ERROR(CSRPublicKeyInvalid)

ERROR(ResharingAlreadyCompleted)
Expand Down
6 changes: 3 additions & 3 deletions python/ccf/ledger.py
Original file line number Diff line number Diff line change
Expand Up @@ -939,7 +939,7 @@ def try_add_chunk(path):
raise ValueError(
f"Ledger cannot parse committed chunk {file_b} following uncommitted chunk {file_a}"
)
if range_a[1] is not None and range_a[1] + 1 != range_b[0]:
if validator and range_a[1] is not None and range_a[1] + 1 != range_b[0]:
raise ValueError(
f"Ledger cannot parse non-contiguous chunks {file_a} and {file_b}"
)
Expand Down Expand Up @@ -974,9 +974,9 @@ def get_transaction(self, seqno: int) -> Transaction:
transaction = None
for chunk in self:
_, chunk_end = chunk.get_seqnos()
if chunk_end and chunk_end < seqno:
continue
for tx in chunk:
if chunk_end and chunk_end < seqno:
continue
public_transaction = tx.get_public_domain()
if public_transaction.get_seqno() == seqno:
return tx
Expand Down
6 changes: 3 additions & 3 deletions src/node/node_state.h
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ namespace ccf
std::unique_ptr<StartupSnapshotInfo> startup_snapshot_info = nullptr;
// Set to the snapshot seqno when a node starts from one and remembered for
// the lifetime of the node
std::optional<kv::Version> startup_seqno = std::nullopt;
kv::Version startup_seqno = 0;

std::shared_ptr<kv::AbstractTxEncryptor> make_encryptor()
{
Expand Down Expand Up @@ -242,7 +242,7 @@ namespace ccf
config.recover.previous_service_identity);

startup_seqno = startup_snapshot_info->seqno;
last_recovered_idx = startup_seqno.value();
last_recovered_idx = startup_seqno;
last_recovered_signed_idx = last_recovered_idx;

return !startup_snapshot_info->requires_ledger_verification();
Expand Down Expand Up @@ -1664,7 +1664,7 @@ namespace ccf
return self;
}

std::optional<kv::Version> get_startup_snapshot_seqno() override
kv::Version get_startup_snapshot_seqno() override
{
std::lock_guard<std::mutex> guard(lock);
return startup_seqno;
Expand Down
20 changes: 10 additions & 10 deletions src/node/rpc/node_frontend.h
Original file line number Diff line number Diff line change
Expand Up @@ -390,23 +390,23 @@ namespace ccf
this->network.consensus_type));
}

// If the joiner and this node both started from a snapshot, make sure
// that the joiner's snapshot is more recent than this node's snapshot
// Make sure that the joiner's snapshot is more recent than this node's
// snapshot. Otherwise, the joiner may not be given all the ledger
// secrets required to replay historical transactions.
auto this_startup_seqno =
this->node_operation.get_startup_snapshot_seqno();
if (
this_startup_seqno.has_value() && in.startup_seqno.has_value() &&
this_startup_seqno.value() > in.startup_seqno.value())
in.startup_seqno.has_value() &&
this_startup_seqno > in.startup_seqno.value())
{
return make_error(
HTTP_STATUS_BAD_REQUEST,
ccf::errors::StartupSnapshotIsOld,
ccf::errors::StartupSeqnoIsOld,
fmt::format(
"Node requested to join from snapshot at seqno {} which is "
"older "
"than this node startup seqno {}",
"Node requested to join from seqno {} which is "
"older than this node startup seqno {}",
in.startup_seqno.value(),
this_startup_seqno.value()));
this_startup_seqno));
}

auto nodes = args.tx.rw(this->network.nodes);
Expand Down Expand Up @@ -644,7 +644,7 @@ namespace ccf
result.recovery_target_seqno = rts;
result.last_recovered_seqno = lrs;
result.startup_seqno =
this->node_operation.get_startup_snapshot_seqno().value_or(0);
this->node_operation.get_startup_snapshot_seqno();

auto signatures = args.tx.template ro<Signatures>(Tables::SIGNATURES);
auto sig = signatures->get();
Expand Down
2 changes: 1 addition & 1 deletion src/node/rpc/node_interface.h
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ namespace ccf
const QuoteInfo& quote_info,
const std::vector<uint8_t>& expected_node_public_key_der,
CodeDigest& code_digest) = 0;
virtual std::optional<kv::Version> get_startup_snapshot_seqno() = 0;
virtual kv::Version get_startup_snapshot_seqno() = 0;
virtual SessionMetrics get_session_metrics() = 0;
virtual size_t get_jwt_attempts() = 0;
virtual crypto::Pem get_self_signed_certificate() = 0;
Expand Down
2 changes: 1 addition & 1 deletion src/node/rpc/node_operation.h
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ namespace ccf
return impl.get_last_recovered_signed_idx();
}

std::optional<kv::Version> get_startup_snapshot_seqno() override
kv::Version get_startup_snapshot_seqno() override
{
return impl.get_startup_snapshot_seqno();
}
Expand Down
2 changes: 1 addition & 1 deletion src/node/rpc/node_operation_interface.h
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ namespace ccf
virtual bool can_replicate() = 0;

virtual kv::Version get_last_recovered_signed_idx() = 0;
virtual std::optional<kv::Version> get_startup_snapshot_seqno() = 0;
virtual kv::Version get_startup_snapshot_seqno() = 0;

virtual SessionMetrics get_session_metrics() = 0;
virtual size_t get_jwt_attempts() = 0;
Expand Down
10 changes: 5 additions & 5 deletions src/node/rpc/test/node_stub.h
Original file line number Diff line number Diff line change
Expand Up @@ -57,11 +57,6 @@ namespace ccf
return kv::NoVersion;
}

std::optional<kv::Version> get_startup_snapshot_seqno() override
{
return std::nullopt;
}

SessionMetrics get_session_metrics() override
{
return {};
Expand All @@ -81,6 +76,11 @@ namespace ccf
return QuoteVerificationResult::Verified;
}

kv::Version get_startup_snapshot_seqno() override
{
return 0;
}

void initiate_private_recovery(kv::Tx& tx) override
{
throw std::logic_error("Unimplemented");
Expand Down
33 changes: 16 additions & 17 deletions tests/infra/network.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ class CodeIdNotFound(Exception):
pass


class StartupSnapshotIsOld(Exception):
class StartupSeqnoIsOld(Exception):
pass


Expand Down Expand Up @@ -231,15 +231,15 @@ def _add_node(
ledger_dir=None,
copy_ledger_read_only=False,
read_only_ledger_dirs=None,
from_snapshot=False,
from_snapshot=True,
snapshots_dir=None,
**kwargs,
):
# Contact primary if no target node is set
if target_node is None:
target_node, _ = self.find_primary(
timeout=args.ledger_recovery_timeout if recovery else 3
)
primary, _ = self.find_primary(
timeout=args.ledger_recovery_timeout if recovery else 3
)
target_node = target_node or primary
LOG.info(f"Joining from target node {target_node.local_node_id}")

committed_ledger_dirs = read_only_ledger_dirs or []
Expand All @@ -248,9 +248,8 @@ def _add_node(
# Note: Copy snapshot before ledger as retrieving the latest snapshot may require
# to produce more ledger entries
if from_snapshot:
# Only retrieve snapshot from target node if the snapshot directory is not
# specified
snapshots_dir = snapshots_dir or self.get_committed_snapshots(target_node)
# Only retrieve snapshot from primary if the snapshot directory is not specified
snapshots_dir = snapshots_dir or self.get_committed_snapshots(primary)
if os.listdir(snapshots_dir):
LOG.info(f"Joining from snapshot directory: {snapshots_dir}")
else:
Expand Down Expand Up @@ -723,8 +722,8 @@ def join_node(
for error in errors:
if "Quote does not contain known enclave measurement" in error:
raise CodeIdNotFound from e
if "StartupSnapshotIsOld" in error:
raise StartupSnapshotIsOld from e
if "StartupSeqnoIsOld" in error:
raise StartupSeqnoIsOld from e
raise

def trust_node(
Expand Down Expand Up @@ -1244,11 +1243,11 @@ def wait_for_snapshots_to_be_committed(src_dir, list_src_dir_func, timeout=20):

return node.get_committed_snapshots(wait_for_snapshots_to_be_committed)

def _get_ledger_public_view_at(self, node, call, seqno, timeout):
def _get_ledger_public_view_at(self, node, call, seqno, timeout, insecure=False):
end_time = time.time() + timeout
while time.time() < end_time:
try:
return call(seqno)
return call(seqno, insecure=insecure)
except Exception as ex:
LOG.info(f"Exception: {ex}")
self.consortium.create_and_withdraw_large_proposal(node)
Expand All @@ -1257,20 +1256,20 @@ def _get_ledger_public_view_at(self, node, call, seqno, timeout):
f"Could not read transaction at seqno {seqno} from ledger {node.remote.ledger_paths()} after {timeout}s"
)

def get_ledger_public_state_at(self, seqno, timeout=5):
def get_ledger_public_state_at(self, seqno, timeout=5, insecure=False):
primary, _ = self.find_primary()
return self._get_ledger_public_view_at(
primary, primary.get_ledger_public_tables_at, seqno, timeout
primary, primary.get_ledger_public_tables_at, seqno, timeout, insecure
)

def get_latest_ledger_public_state(self, timeout=5):
def get_latest_ledger_public_state(self, insecure=False, timeout=5):
primary, _ = self.find_primary()
with primary.client() as nc:
resp = nc.get("/node/commit")
body = resp.body.json()
tx_id = TxID.from_str(body["transaction_id"])
return self._get_ledger_public_view_at(
primary, primary.get_ledger_public_state_at, tx_id.seqno, timeout
primary, primary.get_ledger_public_state_at, tx_id.seqno, timeout, insecure
)

@functools.cached_property
Expand Down
10 changes: 6 additions & 4 deletions tests/infra/node.py
Original file line number Diff line number Diff line change
Expand Up @@ -451,14 +451,16 @@ def wait_for_node_to_join(self, timeout=3):
f"Node {self.local_node_id} failed to join the network"
) from e

def get_ledger_public_tables_at(self, seqno):
ledger = ccf.ledger.Ledger(self.remote.ledger_paths())
def get_ledger_public_tables_at(self, seqno, insecure=False):
validator = ccf.ledger.LedgerValidator() if not insecure else None
ledger = ccf.ledger.Ledger(self.remote.ledger_paths(), validator=validator)
assert ledger.last_committed_chunk_range[1] >= seqno
tx = ledger.get_transaction(seqno)
return tx.get_public_domain().get_tables()

def get_ledger_public_state_at(self, seqno):
ledger = ccf.ledger.Ledger(self.remote.ledger_paths())
def get_ledger_public_state_at(self, seqno, insecure=False):
validator = ccf.ledger.LedgerValidator() if not insecure else None
ledger = ccf.ledger.Ledger(self.remote.ledger_paths(), validator=validator)
assert ledger.last_committed_chunk_range[1] >= seqno
return ledger.get_latest_public_state()

Expand Down
24 changes: 19 additions & 5 deletions tests/infra/remote.py
Original file line number Diff line number Diff line change
Expand Up @@ -935,11 +935,25 @@ def get_snapshots(self):
return os.path.join(self.common_dir, self.snapshot_dir_name)

def get_committed_snapshots(self, pre_condition_func=lambda src_dir, _: True):
self.remote.get(
self.snapshot_dir_name,
self.common_dir,
pre_condition_func=pre_condition_func,
)
# It is possible that snapshots are committed while the copy is happening
# so retry a reasonable number of times.
max_retry_count = 5
retry_count = 0
while retry_count < max_retry_count:
try:
self.remote.get(
self.snapshot_dir_name,
self.common_dir,
pre_condition_func=pre_condition_func,
)
break
except Exception as e:
LOG.warning(
f"Error copying committed snapshots from {self.snapshot_dir_name}: {e}. Retrying..."
)
retry_count += 1
time.sleep(0.1)

return os.path.join(self.common_dir, self.snapshot_dir_name)

def log_path(self):
Expand Down
15 changes: 13 additions & 2 deletions tests/lts_compatibility.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,7 @@ def run_code_upgrade_from(

old_nodes = network.get_joined_nodes()
primary, _ = network.find_primary()
from_major_version = primary.major_version

LOG.info("Apply transactions to old service")
issue_activity_on_live_service(network, args)
Expand Down Expand Up @@ -324,13 +325,13 @@ def run_code_upgrade_from(
node.stop()

LOG.info("Service is now made of new nodes only")
primary, _ = network.find_nodes()

# Rollover JWKS so that new primary must read historical CA bundle table
# and retrieve new keys via auto refresh
if not os.getenv("CONTAINER_NODES"):
jwt_issuer.refresh_keys()
# Note: /gov/jwt_keys/all endpoint was added in 2.x
primary, _ = network.find_nodes()
if not primary.major_version or primary.major_version > 1:
jwt_issuer.wait_for_refresh(network)
else:
Expand All @@ -354,7 +355,17 @@ def run_code_upgrade_from(
)

# Check that the ledger can be parsed
network.get_latest_ledger_public_state()
# Note: When upgrading from 1.x to 2.x, it is possible that ledger chunk are not
# in sync between nodes, which may cause some chunks to differ when starting
# from a snapshot. See https://github.com/microsoft/ccf/issues/3613. In such case,
# we only verify that the ledger can be parsed, even if some chunks are duplicated.
# This can go once 2.0 is released.
insecure_ledger_verification = (
from_major_version == 1 and primary.version_after("ccf-2.0.0-rc7")
)
network.get_latest_ledger_public_state(
insecure=insecure_ledger_verification
)


@reqs.description("Run live compatibility with latest LTS")
Expand Down
Loading