Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KVStore Model Upgrade & Chunk.ServiceEventCount Model Upgrade Logic #6796

Merged

Conversation

jordanschalm
Copy link
Member

@jordanschalm jordanschalm commented Dec 10, 2024

This PR addresses #6777. It adds a new v2 model to the KVStore, which has no new fields, but is necessary for upgrade coordination. It also implements version-aware logic for the EN and SN to construct and require the appropriate data model, based on the reference block protocol version. Details of the upgrade are in Notion.

A lot of the upgrade logic implemented here is intended to be removed prior to the next spork. These blocks are annotated with comments // TODO(mainnet27, #6773): ... and marked as deprecated where possible.

Changes

EN Upgrade Logic (Block Computer)

  • ENs produce chunks with nil ServiceEventCount for execution results referencing blocks with protocol version <2
  • ENs produce chunks with non-nil ServiceEventCount for execution results referencing blocks with protocol version >=2

SN Upgrade Logic (Receipt Validator)

  • SNs accept only chunks with nil ServiceEventCount for execution results referencing blocks with protocol version <2
  • SNs accept only chunks with non-nil ServiceEventCount for execution results referencing blocks with protocol version >=2
    • SNs additionally validate that the service event count provided is consistent with the number of service events in the result

Protobuf Definition

  • Add chunk service event count field flow#1531 adds the ServiceEventCount field to our Protobuf models
  • Protobuf does not seem to natively support optional/nil-able types, so I used the extra bits in the field to encode nil values. I'm not sure whether this backward compatibility is required here: it depends on how the API retrieving Execution Results is being used. I don't love the way it is implemented here, but it has the following benefits:
    • No further protobuf changes required (when making ServiceEventCount non-optional)
    • The awkward part handling nil values can be completely removed
    • Clients using our conversion logic will get Execution Results consistent with what the protocol observes internally

TODOs before merging

  • Merge Add chunk service event count field flow#1531 and pin a tagged version
  • Upgrade Integration Tests CI job is failing, because the Protocol HCU test we have expects us to halt when entering version 2, but now we have a real version 2... Not sure how to address this in a way that doesn't require changes every time we do a protocol upgrade 🤔

@jordanschalm jordanschalm changed the title Jord/6777 service event count upgrade KVStore Model Upgrade & Chunk.ServiceEventCount Model Upgrade Logic Dec 11, 2024
Protobuf does not easily support pointer types, and does not support
explicit uint16 values (although smaller numeric values do use fewer
bytes on the wire).

This commit uses the extra high-order bits in the uint32 value to encode
whether the value is nil.
Base automatically changed from jord/6622-chunk-service-events to feature/efm-recovery December 11, 2024 20:35
@codecov-commenter
Copy link

codecov-commenter commented Dec 11, 2024

Codecov Report

Attention: Patch coverage is 63.00578% with 64 lines in your changes missing coverage. Please review.

Project coverage is 41.74%. Comparing base (cbb9053) to head (efb4b4e).

Files with missing lines Patch % Lines
utils/unittest/fixtures.go 0.00% 15 Missing ⚠️
model/flow/chunk.go 0.00% 14 Missing ⚠️
utils/unittest/chain_suite.go 0.00% 14 Missing ⚠️
state/protocol/protocol_state/kvstore/models.go 72.97% 6 Missing and 4 partials ⚠️
engine/execution/computation/computer/computer.go 66.66% 4 Missing and 2 partials ⚠️
module/validation/receipt_validator.go 84.84% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@                   Coverage Diff                    @@
##           feature/efm-recovery    #6796      +/-   ##
========================================================
+ Coverage                 41.71%   41.74%   +0.03%     
========================================================
  Files                      2033     2033              
  Lines                    181062   181209     +147     
========================================================
+ Hits                      75529    75648     +119     
- Misses                    99311    99338      +27     
- Partials                   6222     6223       +1     
Flag Coverage Δ
unittests 41.74% <63.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jordanschalm jordanschalm marked this pull request as ready for review December 11, 2024 22:25
// This version adds the following changes:
// - Non-system-chunk service event validation support (adds ChunkBody.ServiceEventCount field)
// - EFM Recovery (adds EpochCommit.DKGIndexMap field)
type Modelv2 struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 This time we need to introduce a new model but for future upgrades I think it makes sense to separate protocol version from KV store version. If we had introduced an integer back in the day, we would have been able to perform this upgrade without cloning the module but simple upgrading one value.

My proposal is to include a protocol version as separate field and use view based upgrader for it. This way we will be able to update protocol without making changes to the KV store schema or introducing new versions to the models. This is similar approach to what Alex was explaining in his last talk in working group for upgrading different parts of the software.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing this live, we decided to continue using the protocol state version to also stand in for behavioural-only changes.

  • the overhead is relatively small
  • we know we will need to introduce a distinct version for the execution stack, which will give us experience of dealing with multiple versions
  • we will revisit this as we use the versioning system more going forward

Copy link
Member

@durkmurder durkmurder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice and clean, appreciate those extra tests 🏅

Comment on lines 86 to 87

// perform actual replication to the next version
// perform actual replication to the next currentVersion
Copy link
Member

@AlexHentschel AlexHentschel Dec 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is incorrect. currentVersion should always be 0, because it is the version of the parent model:

currentVersion := model.GetProtocolStateVersion()

I think if you accepted my previous comment, we could just skip this here

Suggested change
// perform actual replication to the next version
// perform actual replication to the next currentVersion

and the code would read

// version change: Modelv0 only supports upgrade to protocolVersion = 1
if protocolVersion != 1 {
  return nil, fmt.Errorf("protocol state's current version %d only supports replication into version %d but requested was version %d: %w",
    currentVersion, 1, protocolVersion, ErrIncompatibleVersionChange)
}
v1 := &Modelv1{
  Modelv0: clone.Clone(*model),
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the above, the currentVersion in the error message is just a refactor error (fixed now).

Comment on lines 178 to 185
nextVersion := currentVersion + 1
if protocolVersion != nextVersion {
// can only Replicate into model with numerically consecutive currentVersion
return nil, fmt.Errorf("unsupported replication currentVersion %d, expect %d: %w",
protocolVersion, 1, ErrIncompatibleVersionChange)
}

// perform actual replication to the next currentVersion
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️
I am concerned about this, because it in principle permits arbitrary increments. We have a hard-coded type into which we are replicating: Modelv2. Hence, I would be inclined to also pin the protocolVersion in the implementation.

To be fair, in the current implementation, your sanity check in lines 189-191 would probably catch an unsupported protocolVersion -- but that sanity check is based on the implementation of the model that we are replicating into (Modelv2), while the logic governing the replication is part of with Modelv1. Therefore, I would be inclined to not depend on implementation details of a different struct.

Suggested change
nextVersion := currentVersion + 1
if protocolVersion != nextVersion {
// can only Replicate into model with numerically consecutive currentVersion
return nil, fmt.Errorf("unsupported replication currentVersion %d, expect %d: %w",
protocolVersion, 1, ErrIncompatibleVersionChange)
}
// perform actual replication to the next currentVersion
// version change: Modelv1 only supports upgrade to protocolVersion = 2
if protocolVersion != 2 {
return nil, fmt.Errorf("protocol state's current version %d only supports replication into version %d but requested was version %d: %w",
currentVersion, 2, protocolVersion, ErrIncompatibleVersionChange)
}

Copy link
Member Author

@jordanschalm jordanschalm Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand your concern...

The current implementation does:

currentVersion := model.GetProtocolStateVersion()
// ...
nextVersion := currentVersion + 1
if protocolVersion != nextVersion { /* error */ }

It enforces that protocolVersion is exactly equal to currentVersion+1, and currentVersion is a property of ModelV1 (the struct we are defining the method on).

The reason I implemented it this way, rather than hard-coding the numeric 2, is that this approach is general and will work without modification for the next model version. (when this is inevitably copy-pasted, less needs to be changed)

// NewDefaultKVStore constructs a default Key-Value Store of the *latest* protocol version for bootstrapping.
// Currently, the KV store is largely empty.
// TODO: Shortcut in bootstrapping; we will probably have to start with a non-empty KV store in the future;
// TODO(efm-recovery): we need to bootstrap with v1 in order to test the upgrade to v2. Afterward, we should bootstrap with v2 by default for new networks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bootstrap with v2 by default for new networks.

only for new networks? After the change, can we bootstrap new nodes with version v2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is only used for network-scale bootstrapping, hence why the comment is talking about networks rather than nodes. A new node would bootstrap with whatever version the network is currently running, based on the root protocol state snapshot it boots from.

Comment on lines +322 to +326
if version < 2 {
return flow.NewChunk_ProtocolVersion1, nil
} else {
return flow.NewChunk, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer to make this as strict as possible. This code might be living for multiple months until we spork into mainnet 27. There might be more updates in between now and the spork. I would like to avoid making assumptions about the nature of future updates (e.g. that they do not change the chunk format).

Suggested change
if version < 2 {
return flow.NewChunk_ProtocolVersion1, nil
} else {
return flow.NewChunk, nil
}
switch version {
case 1:
return flow.NewChunk_ProtocolVersion1, nil
case 2:
return flow.NewChunk, nil
default:
return nil, fmt.Errorf("unsupported chunk version: %d", version)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be more updates in between now and the spork. I would like to avoid making assumptions about the nature of future updates

Every other part of our codebase already implicitly makes assumptions about the nature of future updates, by virtue of not having every component, model, etc. tied to a protocol version number. It will always ultimately be the responsibility of the upgrade implementor to make sure it is safe and compatible with the existing logic and data structures.

I don't have a problem with the suggestion here on its own, but I don't think it represents a sustainable strategy for managing version-linked code changes. Effectively all of the time, when we fix a bug or add a feature, we intend for that logical change to persist in perpetuity, until it is actively changed again. I expect the same to be true for bug fixes and features introduced through a protocol version upgrade.

If in the future an engineer is changing the Chunk data structure for a protocol upgrade, they already need to carefully investigate and update logic where the Chunk is being used. This error will help them find one of the codepaths they need to change, but there are many more. Had they failed to notice this codepath without the error, they very likely would also have failed to notice other codepaths that don't have a version switch statement. Conclusion: sometimes helpful, but far from sufficient as a safety measure.

By contrast, if a future protocol upgrade changes something completely unrelated to the Chunk, they will need to update a growing number of these switch statements (minor inconvenience). Now suppose our version-based switch statement is in an infrequently accessed codepath (say a slashing challenge). The engineer tests their changes on the new protocol version, but their testing doesn't include our infrequently accessed codepath on the new version. Now we have introduced a severe bug that might only appear in production, after the version upgrade is complete. Conclusion: usually only a minor annoyance, but potential for severe bugs.

Copy link
Member

@AlexHentschel AlexHentschel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. This PR had a lot of change surface. Thanks for grinding through all the details.

Mostly added suggestions regarding documentation.

@jordanschalm jordanschalm merged commit a35632f into feature/efm-recovery Dec 19, 2024
55 checks passed
@jordanschalm jordanschalm deleted the jord/6777-service-event-count-upgrade branch December 19, 2024 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants