-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KVStore Model Upgrade & Chunk.ServiceEventCount
Model Upgrade Logic
#6796
KVStore Model Upgrade & Chunk.ServiceEventCount
Model Upgrade Logic
#6796
Conversation
Chunk.ServiceEventCount
Model Upgrade Logic
Protobuf does not easily support pointer types, and does not support explicit uint16 values (although smaller numeric values do use fewer bytes on the wire). This commit uses the extra high-order bits in the uint32 value to encode whether the value is nil.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/efm-recovery #6796 +/- ##
========================================================
+ Coverage 41.71% 41.74% +0.03%
========================================================
Files 2033 2033
Lines 181062 181209 +147
========================================================
+ Hits 75529 75648 +119
- Misses 99311 99338 +27
- Partials 6222 6223 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
…flow/flow-go into jord/6777-service-event-count-upgrade
// This version adds the following changes: | ||
// - Non-system-chunk service event validation support (adds ChunkBody.ServiceEventCount field) | ||
// - EFM Recovery (adds EpochCommit.DKGIndexMap field) | ||
type Modelv2 struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 This time we need to introduce a new model but for future upgrades I think it makes sense to separate protocol version from KV store version. If we had introduced an integer back in the day, we would have been able to perform this upgrade without cloning the module but simple upgrading one value.
My proposal is to include a protocol version as separate field and use view based upgrader for it. This way we will be able to update protocol without making changes to the KV store schema or introducing new versions to the models. This is similar approach to what Alex was explaining in his last talk in working group for upgrading different parts of the software.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing this live, we decided to continue using the protocol state version to also stand in for behavioural-only changes.
- the overhead is relatively small
- we know we will need to introduce a distinct version for the execution stack, which will give us experience of dealing with multiple versions
- we will revisit this as we use the versioning system more going forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice and clean, appreciate those extra tests 🏅
|
||
// perform actual replication to the next version | ||
// perform actual replication to the next currentVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is incorrect. currentVersion
should always be 0
, because it is the version of the parent model:
currentVersion := model.GetProtocolStateVersion() |
I think if you accepted my previous comment, we could just skip this here
// perform actual replication to the next version | |
// perform actual replication to the next currentVersion |
and the code would read
// version change: Modelv0 only supports upgrade to protocolVersion = 1
if protocolVersion != 1 {
return nil, fmt.Errorf("protocol state's current version %d only supports replication into version %d but requested was version %d: %w",
currentVersion, 1, protocolVersion, ErrIncompatibleVersionChange)
}
v1 := &Modelv1{
Modelv0: clone.Clone(*model),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like the above, the currentVersion
in the error message is just a refactor error (fixed now).
nextVersion := currentVersion + 1 | ||
if protocolVersion != nextVersion { | ||
// can only Replicate into model with numerically consecutive currentVersion | ||
return nil, fmt.Errorf("unsupported replication currentVersion %d, expect %d: %w", | ||
protocolVersion, 1, ErrIncompatibleVersionChange) | ||
} | ||
|
||
// perform actual replication to the next currentVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned about this, because it in principle permits arbitrary increments. We have a hard-coded type into which we are replicating: Modelv2
. Hence, I would be inclined to also pin the protocolVersion
in the implementation.
To be fair, in the current implementation, your sanity check in lines 189-191 would probably catch an unsupported protocolVersion
-- but that sanity check is based on the implementation of the model that we are replicating into (Modelv2
), while the logic governing the replication is part of with Modelv1
. Therefore, I would be inclined to not depend on implementation details of a different struct.
nextVersion := currentVersion + 1 | |
if protocolVersion != nextVersion { | |
// can only Replicate into model with numerically consecutive currentVersion | |
return nil, fmt.Errorf("unsupported replication currentVersion %d, expect %d: %w", | |
protocolVersion, 1, ErrIncompatibleVersionChange) | |
} | |
// perform actual replication to the next currentVersion | |
// version change: Modelv1 only supports upgrade to protocolVersion = 2 | |
if protocolVersion != 2 { | |
return nil, fmt.Errorf("protocol state's current version %d only supports replication into version %d but requested was version %d: %w", | |
currentVersion, 2, protocolVersion, ErrIncompatibleVersionChange) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand your concern...
The current implementation does:
currentVersion := model.GetProtocolStateVersion()
// ...
nextVersion := currentVersion + 1
if protocolVersion != nextVersion { /* error */ }
It enforces that protocolVersion
is exactly equal to currentVersion+1
, and currentVersion
is a property of ModelV1
(the struct we are defining the method on).
The reason I implemented it this way, rather than hard-coding the numeric 2, is that this approach is general and will work without modification for the next model version. (when this is inevitably copy-pasted, less needs to be changed)
// NewDefaultKVStore constructs a default Key-Value Store of the *latest* protocol version for bootstrapping. | ||
// Currently, the KV store is largely empty. | ||
// TODO: Shortcut in bootstrapping; we will probably have to start with a non-empty KV store in the future; | ||
// TODO(efm-recovery): we need to bootstrap with v1 in order to test the upgrade to v2. Afterward, we should bootstrap with v2 by default for new networks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bootstrap with v2 by default for new networks.
only for new networks? After the change, can we bootstrap new nodes with version v2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is only used for network-scale bootstrapping, hence why the comment is talking about networks rather than nodes. A new node would bootstrap with whatever version the network is currently running, based on the root protocol state snapshot it boots from.
if version < 2 { | ||
return flow.NewChunk_ProtocolVersion1, nil | ||
} else { | ||
return flow.NewChunk, nil | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer to make this as strict as possible. This code might be living for multiple months until we spork into mainnet 27. There might be more updates in between now and the spork. I would like to avoid making assumptions about the nature of future updates (e.g. that they do not change the chunk format).
if version < 2 { | |
return flow.NewChunk_ProtocolVersion1, nil | |
} else { | |
return flow.NewChunk, nil | |
} | |
switch version { | |
case 1: | |
return flow.NewChunk_ProtocolVersion1, nil | |
case 2: | |
return flow.NewChunk, nil | |
default: | |
return nil, fmt.Errorf("unsupported chunk version: %d", version) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be more updates in between now and the spork. I would like to avoid making assumptions about the nature of future updates
Every other part of our codebase already implicitly makes assumptions about the nature of future updates, by virtue of not having every component, model, etc. tied to a protocol version number. It will always ultimately be the responsibility of the upgrade implementor to make sure it is safe and compatible with the existing logic and data structures.
I don't have a problem with the suggestion here on its own, but I don't think it represents a sustainable strategy for managing version-linked code changes. Effectively all of the time, when we fix a bug or add a feature, we intend for that logical change to persist in perpetuity, until it is actively changed again. I expect the same to be true for bug fixes and features introduced through a protocol version upgrade.
If in the future an engineer is changing the Chunk data structure for a protocol upgrade, they already need to carefully investigate and update logic where the Chunk is being used. This error will help them find one of the codepaths they need to change, but there are many more. Had they failed to notice this codepath without the error, they very likely would also have failed to notice other codepaths that don't have a version switch statement. Conclusion: sometimes helpful, but far from sufficient as a safety measure.
By contrast, if a future protocol upgrade changes something completely unrelated to the Chunk, they will need to update a growing number of these switch statements (minor inconvenience). Now suppose our version-based switch statement is in an infrequently accessed codepath (say a slashing challenge). The engineer tests their changes on the new protocol version, but their testing doesn't include our infrequently accessed codepath on the new version. Now we have introduced a severe bug that might only appear in production, after the version upgrade is complete. Conclusion: usually only a minor annoyance, but potential for severe bugs.
engine/execution/ingestion/uploader/retryable_uploader_wrapper.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work. This PR had a lot of change surface. Thanks for grinding through all the details.
Mostly added suggestions regarding documentation.
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
This PR addresses #6777. It adds a new v2 model to the KVStore, which has no new fields, but is necessary for upgrade coordination. It also implements version-aware logic for the EN and SN to construct and require the appropriate data model, based on the reference block protocol version. Details of the upgrade are in Notion.
A lot of the upgrade logic implemented here is intended to be removed prior to the next spork. These blocks are annotated with comments
// TODO(mainnet27, #6773): ...
and marked as deprecated where possible.Changes
EN Upgrade Logic (Block Computer)
ServiceEventCount
for execution results referencing blocks with protocol version <2ServiceEventCount
for execution results referencing blocks with protocol version >=2SN Upgrade Logic (Receipt Validator)
ServiceEventCount
for execution results referencing blocks with protocol version <2ServiceEventCount
for execution results referencing blocks with protocol version >=2Protobuf Definition
ServiceEventCount
field to our Protobuf modelsServiceEventCount
non-optional)TODOs before merging
Upgrade Integration Tests
CI job is failing, because the Protocol HCU test we have expects us to halt when entering version 2, but now we have a real version 2... Not sure how to address this in a way that doesn't require changes every time we do a protocol upgrade 🤔