Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After runtime upgrade, Runtime API calls use new code with unmigrated storage #64

Open
JoshOrndorff opened this issue Oct 11, 2021 · 25 comments · May be fixed by #6029
Open

After runtime upgrade, Runtime API calls use new code with unmigrated storage #64

JoshOrndorff opened this issue Oct 11, 2021 · 25 comments · May be fixed by #6029
Assignees
Labels
I2-bug The node fails to follow expected behavior. T1-FRAME This PR/Issue is related to core FRAME, the framework.

Comments

@JoshOrndorff
Copy link
Contributor

If the Runtime is upgraded in block N (by a system::set_code transaction for example), the new code is immediately written to storage. However the storage migration contained in that upgrade are not executed until the beginning of block N + 1. This is fine for extrinsics and other on-chain execution because there is no on-chain execution between the end of block N and the beginning of block N + 1.

However, runtime APIs can be called from off-chain contexts at any time. For example, maybe an RPC call allows users to fetch data that must be gathered from a Runtime API. Or maybe the node runs a maintenance task that calls a runtime API at a regular interval.

Any runtime API call that is made against the state associated with block N will use the new code with the old unmigrated storage. In general the storage schema between these two runtime versions will not match which leads to misinterpreting stored data.

@bkchr bkchr added the I3-bug label Oct 11, 2021
@bkchr
Copy link
Member

bkchr commented Oct 11, 2021

Fuck...

Preventing runtime migrations to be executed over and over again one was part of the problem that we tried to fix with skipping on_initialize, but yeah we clearly did not thought about the described problem...

There are multiple solutions:

  1. Make users away of this situation and that they may need to trigger the runtime migration inside the runtime api implementation before executing the code.
  2. Prevent calling the runtime api when the state isn't migrated yet.
  3. Make the runtime api always use the runtime blob from the block "before". This means when we want to execute a runtime api on state of block X, we will use the runtime blob from X - 1. In case there was a runtime upgrade this means that we will use the correct runtime that is still able to interpret the non-migrated state. However we would still need to support the case where you want to call the runtime associated to the block. This is for example required for block authoring. If we author a block, we call initialize_block and that triggers the runtime migration, so we are able/required to use the new runtime code.

I'm in favor of solution 3.

@JoshOrndorff
Copy link
Contributor Author

JoshOrndorff commented Oct 11, 2021

Solution 3 makes sense to me as well.

@apopiak
Copy link
Contributor

apopiak commented Oct 11, 2021

Another potential solution:

  1. Set the code we're upgrading to as another storage item (e.g :pending_code:) (similar to Polkadot's pending PVF IIRC) and have the block authoring check that storage item when authoring blocks. Then move :pending_code: to :code: as the last part of a runtime migration.

I think I would tend to prefer 3, though.
(Edit: Tomek's point of maybe being able to recover from an invalid migration has me excited for 4 :-D)

@tomusdrw
Copy link
Contributor

(4) has the advantage of being able to recover from a potentially invalid migration (catch the failure and clear :pending_code:).

Also making the possible migration explicit for block authorship allows us to potentially recover from broken runtimes (i.e. block authorship could decide to use old code in case the new one is not working during some after-upgrade grace period), obviously it's a potential runtime-upgrade censorship possibility by block authors, but that's similar to refusing to accept the runtime upgrade extrinsics.

I like both 4 & 3 in that particular order.

@bkchr
Copy link
Member

bkchr commented Oct 11, 2021

obviously it's a potential runtime-upgrade censorship possibility by block authors

We could panic inside the runtime if someone tries to build a block without using the pending code.

@librelois
Copy link
Contributor

Is there any progress on this?

@bkchr
Copy link
Member

bkchr commented Mar 11, 2022

@RGafiyatullin will start looking into this.

@RGafiyatullin RGafiyatullin self-assigned this Mar 16, 2022
@skunert
Copy link
Contributor

skunert commented Mar 18, 2022

Our team had a discussion about this issue today. A visualization was created, will leave this here for reference:
9997

@bkchr
Copy link
Member

bkchr commented Mar 18, 2022

We had today some lengthy discussion about the topic. We mainly spoke about solution 3 & 4.

We assume that if there exists multi block migration, the first step of the migration fixes all the important storage items and sets some value that informs external tooling that we are currently in a migration phase and that the data should not be read. We also can assume that if there are maybe different migrations, one migration is pushed back to only apply one "complicated" migration at a time. Aka we don't try to migrate the entire world at once.

Solution 3

One of the main problems with this solution is pruning. Currently when we want to execute something on block X, we only need the state of block X. However, with this solution we would need to also have the state of block X - 1. The possibility of that happening is relative low, as we only prune X blocks behind the last finalized block.

We would basically change the any runtime execution to use the runtime of block X - 1 and only when we build or import a block we would use the runtime of X, so that we can run the migrations etc. This would solve the problem on the node side, however external tooling would also need to be changed to do this. Otherwise it may still reads the runtime of block X, extracts the wrong metadata and then interprets a storage entry in a wrong way.

Solution 4

As for solution 3 we need to tell the block execution that we want to use the :pending_code if it exists in the storage. This would be done for block production and block import. The runtime would set :pending_code instead of :code directly. :pending_code would always only bet set for in maximum one block. To prevent any censorship we need to ensure that :pending_code is moved to :code in the next block at the end of the block production (aka in finalize_block). :pending_code would need to consist of the actual runtime code + the block number it was set. Then we can add a check in finalize_block that checks if :pending_code should be moved into :code or if that should happen in the next block. We need to do this check in finalize_block, because someone could execute initialize_block while :pending_code is already set, but we should then not panic in that case.

External tooling would also not be influenced by this, because they can just use :code which is still the correct runtime code for the un-migrated state.

@RGafiyatullin
Copy link

Comparing the paritytech/substrate#3 and paritytech/substrate#4 solutions

Requires changes in Substrate/Client?

  • 3: yes;
  • 4: yes.

For both solutions, in the API for Runtime-invocation the argument block_id: BlockID would be replaced by something like block_id: BuildOnTopOf(BlockID) | QueryExisting(BlockID).
Or at least it would behave as

  • BuildOnTopOf(BlockID) in CallExecutor::contextual_call(...) as it constructs the block;
  • and as QueryExisting(BlockID) in CallExecutor::call(...) and CallExecutor::prove_execution(...).

In case of paritytech/substrate#4, during the block construction, the client should also enforce the max-age=1 for the :pending_code state-data value (probably by comparing the current value with the value of the corresponding cell in the previous block's state snapshot). The restriction for the minimal length-argument for the pruning should be probably introduced: the block production is impossible without the state-snapshots of the two preceding blocks.

In case of paritytech/substrate#3 the pruning set into N would render the Nth to the last block unqueriable (as it would require the state snapshot for the block that would be pruned). That can be fixed by keeping one block more than required by the pruning settings, if indeed needed.

Requires changes in Substrate/Runtime?

  • 3: no;
  • 4: yes (the support for :pending_code in the system pallet).

The solution paritytech/substrate#4 would require changes in the system.set_code and system.set_code_without_checks implementations.

The solution paritytech/substrate#4 would not be tolerant to the attempt to set the state[:code] directly (the way it is described in https://spec.polkadot.network/develop/#sect-loading-runtime-code as the runtime-upgrade technique).

Requires changes outside of Substrate?

  • 3: yes;
  • 4: no.

For the solution paritytech/substrate#3 the two methods for proof-of-execution would need to be provided in the API:

  • the existing Prove { data_from: BlockID };
  • the new ProveWithCode { data_from: BlockID, code_from: BlockID }.
    Apparently the former is just the degenerate case of the latter — where the data_from == code_from.
    It would be up to the requester to choose which way of proof-generation to trust.
    This could be used to gradually phase out the old proof-method.

For the solution paritytech/substrate#3 the proof of execution would need to be changed:

  • instead of providing the path BlockHash -> StateDB-Hash -> ... -> :code
  • provide the path BlockHash -> ParentHash -> StateDB-Hash -> ... -> :code.
    It seems as the proof's size won't change terribly.
    I could use a hand in evaluation of how terrible it is from the point of view of the projects that already use the Substrate's proofs of execution.
    Is THAT the spec for the the latter?

Works retroactively?

  • 3: yes;
  • 4: no.

The value for that criterion seems to be relatively small: if there were hundreds of code-updates during several millions of blocks in the history — the probabilty to hit the ill-behaving block is small.

The outcomes of the queries to the blocks at which the runtime was upgraded may fall into three cases:

  • works correctly:
    the data used by the query DID NOT change its encoding during the upgrade — so it just happens to work alright;
  • returns an error:
    the data used by the query DID change its encoding during the upgrade in a way,
    that it cannot be decoded, e.g. the size of the record changes;
  • works incorrectly:
    the data used by the query DID change its encoding during the upgrade, but in a way,
    that it still does not cause a failure during decoding attempt, e.g. the fields' order changed.

@RGafiyatullin
Copy link

I might be rooting for the-#3 all too clearly when being supposed to dispassionately evaluate each solution... :\

@tomaka
Copy link
Contributor

tomaka commented Apr 28, 2022

I'm in favor of 4 due to the surprise factor.

If you approach Substrate and learn about the fact that :code contains the Wasm runtime, how are you supposed to guess that it's not actually this value that is used but the one at the block before? This is extremely surprising and confusing, and strongly feels like a big hack to me.

While :pending_code also raises an eyebrow (the fact that the block production uses :pending_code instead of :code), at least it makes things explicit. As a Substrate newcomer, you raise an eyebrow because you don't know what :pending_code is for, but at least there's no possibility of guessing things wrong.

@cheme
Copy link
Contributor

cheme commented Apr 29, 2022

Just wanted to add that solution 4 got a cost:
on block N+1 when moving :pending_code to :code, the pending_code will end up in pov.

@pepyakin
Copy link
Contributor

pepyakin commented Apr 29, 2022

Not really a formed idea, but what if there was no.5, i.e. execute migration eagerly? Say, a runtime upgrade would first set :code and then it performs migrations before the block is finished. This would make calls to the Runtime API at N + 1 see the migrations.

  • One way this could be achieved if the current runtime instantiates the new runtime at N and then calls a dedicated entrypoint in the new runtime which runs the migrations and then returns control back.
  • Alternatively, the current runtime stop executing and gives up control to the new runtime. A special runtime function finishes the migration and finalizes block, potentially using a completely different finalization routine.

Note, that this approach does not change the current model too much. It does not require alteration of the client behavior either.

We, of course, would need a host function for pull off those shenanigans with the calling the new runtime. Logic of this thing is not a problem. The problem is the implications of needing to compile the code before execution. wasmtime may spend quite some time in that. Another option is to use wasmi for that or bring onboard additional wasm executor that does not require expensive compilation.

@tomaka
Copy link
Contributor

tomaka commented Jun 30, 2022

I'm really strongly in favor of solution 4. I think that solution 3 is a horrible idea whose added hidden complexity will cause at least one major bug in the future, and I'm writing this comment so that if solution 3 gets picked I can be quoted on that.
I think that picking a (subjectively) worse solution over a (subjectively) better one just because it is easier to implement is also a bad idea.

@RGafiyatullin
Copy link

RGafiyatullin commented Jun 30, 2022

@tomaka , I understand your concerns, but nonetheless respectfully disagree.

As I see it the "runtime arguments" are not limited to the :code only, it is also :heappages.
Should the same equillibristics be done with :heappages (i.e. should we introduce :pending_heappages)?

I think that solution 3 is a horrible idea whose added hidden complexity will cause at least one major bug in the future

Would you please elaborate as to which exactly one major bug in the future will be added by this implementation?

@RGafiyatullin
Copy link

I also believe that expressing "To query a block, use the same code that was used to produce that block" is way easier in the specification, rather than:

  • the Runtime's responsibility is to store :pending_code;
  • the Client's responsibility is to perform the upgrade procedure if it finds :pending_code;
  • the same value of :pending_code must not live for more than one block;

@tomaka
Copy link
Contributor

tomaka commented Jun 30, 2022

Should the same equillibristics be done with :heappages (i.e. should we introduce :pending_heappages)?

Yes!

Would you please elaborate as to which exactly one major bug in the future will be added by this implementation?

My fear is precisely that our system has so many corner cases that it is impossible to detect bugs because they become extremely subtle.

For example, one possible bug is that system_dryRun would, I guess, use the parent's runtime, but including the transaction in the chain will use the child's runtime. This can lead to system_dryRun succeeding but including the transaction failing, or vice versa. Similarly, querying fees and generating a nonce will not use the same runtime as actually including the transaction on chain.
I guess you could make these specific RPC functions call the new runtime, while other RPC functions would call the old runtime. Again, this is surprising.

Another source of bugs might come from the fact that calling Core_version will now return the version of the parent block's runtime, which to me is very surprising. This means that on the RPC level, state_getRuntimeVersion will now return a different value than state_call("Core_version", &[]).
And then if you use state_getRuntimeVersion to determine the list of runtime entry points that are supported (like you're supposed to), well you're out of luck because the list will not match what you can actually call.
So state_getRuntimeVersion would need to return the parent's version, or we should add a state_getParentRuntimeVersion.
And if you use state_getRuntimeVersion to determine the transaction_version to put in the transactions that you submit, well again you're out of luck because your transaction will be invalid.

Similarly, how do you handle getting the metadata? When the runtime changes, so does the metadata. If a JSON-RPC client queries the metadata, do you return the new metadata or the old metadata? If you return the new metadata, then the client would need to get the metadata of the parent, which is surprising. If you return the old metadata, then there's simply no way for the client to obtain the metadata of the new runtime until a child block has been generated.
If any JSON-RPC client had a mapping of hash(:code) <=> metadata, this is now broken.

@pepyakin
Copy link
Contributor

Just a small wasm note: :heappages is more of a historic artifact and there was interest migrating off it. Here are some options:

  • Not limit the number of linear memory pages and let it grow dynamically up to the maximum 4 GiB limit. That option was dubbed "elastic heap".
  • Just rely on the maximum number of wasm pages specified by the module itself. That has slightly different semantics from the current mechanism though.
  • In case that's too much burden for migration, we could replicate the very same mechanism by supplying the number of heappages as a custom section embedded within wasm.

@apopiak
Copy link
Contributor

apopiak commented Jul 1, 2022

Not really a formed idea, but what if there was paritytech/substrate#5, i.e. execute migration eagerly? Say, a runtime upgrade would first set :code and then it performs migrations before the block is finished. This would make calls to the Runtime API at N + 1 see the migrations.

👍 for thinking outside of the box

One thing that pops to mind as a problem with this is that the weight cost of the migration would be due when setting it, so we would probably want to avoid including other transactions in it and it might still be overweight.
It would also change the assumptions for runtime migrations (which is that they run at the start of the block, but now they run at the end), but should be manageable.

@RGafiyatullin
Copy link

@tomaka , thanks for very apt examples.

I don't think there supposed to be a 1:1 mapping between the JSON-RPC calls and the RuntimeAPI calls.

If we can clearly state that the JSON-RPC API has no intention to mirror/mimic/resemble Runtime API, we can look closer and find that there are in fact two problems:

  • The JSON-RPC API calls stubbornly try to have a slightly ambiguous at: Option<Block::Hash> argument;
  • The way the Runtime's traits are exposed to the Client (two sets of methods: one with and one without ExecutionContext).

JSON-RPC: the at: Option<Block::Hash> argument

I am about to resort to the "feelings, emotions and impressions" that a developer may have when exposed to the API-documentation and method signatures, i.e. the following is bound to be somewhat subjective/biased/opinionated.

If I had a look at the call state_getRuntimeVersion(block_id: Option<BlockHash>) -> RuntimeVersion, I would assume that it will return me the version of the Runtime that will handle my read-only queries to it.

If I had an intention to find out what Runtime version would be used to construct a block on top of given BlockHash, I would be searching for something like state_getRuntimeVersionForNextBlock(parent_block: Option<Block::Hash>).

If I keep in my mental cache the "Runtime API != JSON-RPC API", the the result of invocation of state_call("Core_version", &[], existing_block_id: Option<BlockHash>) makes no surprise to at all.
Although, I think that state_call is part of "plumbing API" (as opposed to "porcelain API"), therefore it most probably begs for some sort of Context: Construction | OffchainCall argument.

Runtime traits called by the Client, decl_runtime_apis!/impl_runtime_apis!

When implementing another JSON-RPC method, one faces the necessity to interact with the Runtime via the codegen'ed methods that do the heavy-lifting with the actual invocation of the CallExecutor.

The more compelling choice to invoke a Runtime method in this case to choose the generated method version that does not require the ExecutionContext.

If the developer would need to specify the ExecutionContext with each invocation of the Runtime, there would be more awareness, and the confusion could be avoided.

Summary

I strongly believe that this is the API-design problem: it should not be solved by changing the way the blocks are constructed.

@tomaka
Copy link
Contributor

tomaka commented Jul 5, 2022

The argument I've been making in this issue is that the design of Substrate is already extremely complicated.
I think that you and probably everyone reading this issue, as people who are very familiar with how Substrate works, are extremely biased. Our design is freaking over-complicated and desperately begs for simplifications. If there is a reason why Substrate can fail as a product, this is it: it is over-complicated.

The issue obviously needs to be fixed somehow, and no matter the solution that fix will necessarily introduce some complexity, but solution 3 to me introduces way more complexity than solution 4 from the point of view of the runtime builders and the frontend developers, which are the people that matter.
We want runtime builders and frontend developers to write code that works, and especially in rare situations such as a runtime upgrade that are hard to test, you want their same naively-written code that always works to continue working.

With solution 3 you fundamentally introduce complexity in the API. Yes you can design an API that lets you explicitly choose whether you want to call the "reading runtime" or the "building runtime", but that doesn't remove the fact that builders and frontend developers then need to understand the difference between the "reading runtime" and the "building runtime".

Solution 4 introduces more complexity on the client side, but that's ok, because the client already has tons of checks to perform at each block and needs special handling for handle runtime upgrades. Implementing solution 4 on the client side is normally just a few more ifs here and there.

@the-right-joyce the-right-joyce added T1-FRAME This PR/Issue is related to core FRAME, the framework. and removed T1-runtime labels Aug 25, 2023
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 8, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 9, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
serban300 pushed a commit to serban300/polkadot-sdk that referenced this issue Apr 10, 2024
jonathanudd pushed a commit to jonathanudd/polkadot-sdk that referenced this issue Apr 10, 2024
@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/2024-09-17-polkadot-finality-lag-slow-parachain-production-immediately-after-runtime-upgrade-post-mortem/10057/1

sandreim added a commit that referenced this issue Sep 20, 2024
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
@alindima alindima mentioned this issue Sep 24, 2024
7 tasks
alindima added a commit that referenced this issue Sep 24, 2024
alindima added a commit that referenced this issue Sep 24, 2024
@ordian ordian self-assigned this Sep 30, 2024
This was referenced Oct 11, 2024
github-merge-queue bot pushed a commit that referenced this issue Oct 21, 2024
Resolves #4776

This will enable proper core-sharing between paras, even if one of them
is not producing blocks.

TODO:
- [x] duplicate first entry in the claim queue if the queue used to be
empty
- [x] don't back anything if at the end of the block there'll be a
session change
- [x] write migration for removing the availability core storage
- [x] update and write unit tests
- [x] prdoc
- [x] add zombienet test for synchronous backing
- [x] add zombienet test for core-sharing paras where one of them is not
producing any blocks

_Important note:_
The `ttl` and `max_availability_timeouts` fields of the
HostConfiguration are not removed in this PR, due to #64.
Adding the workaround with the storage version check for every use of
the active HostConfiguration in all runtime APIs would be insane, as
it's used in almost all runtime APIs.

So even though the ttl and max_availability_timeouts fields will now be
unused, they will remain part of the host configuration.

These will be removed in a separate PR once #64 is fixed. Tracked by
#6067

---------

Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Co-authored-by: Andrei Sandu <andrei-mihail@parity.io>
Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com>
Co-authored-by: command-bot <>
Overkillus added a commit that referenced this issue Oct 30, 2024
Overkillus added a commit that referenced this issue Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. T1-FRAME This PR/Issue is related to core FRAME, the framework.
Projects
Status: To Do
Status: backlog
Development

Successfully merging a pull request may close this issue.