-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After runtime upgrade, Runtime API calls use new code with unmigrated storage #64
Comments
Fuck... Preventing runtime migrations to be executed over and over again one was part of the problem that we tried to fix with skipping There are multiple solutions:
I'm in favor of solution 3. |
Solution 3 makes sense to me as well. |
Another potential solution:
|
(4) has the advantage of being able to recover from a potentially invalid migration (catch the failure and clear Also making the possible migration explicit for block authorship allows us to potentially recover from broken runtimes (i.e. block authorship could decide to use old code in case the new one is not working during some after-upgrade grace period), obviously it's a potential runtime-upgrade censorship possibility by block authors, but that's similar to refusing to accept the runtime upgrade extrinsics. I like both 4 & 3 in that particular order. |
We could panic inside the runtime if someone tries to build a block without using the pending code. |
Is there any progress on this? |
@RGafiyatullin will start looking into this. |
We had today some lengthy discussion about the topic. We mainly spoke about solution 3 & 4. We assume that if there exists multi block migration, the first step of the migration fixes all the important storage items and sets some value that informs external tooling that we are currently in a migration phase and that the data should not be read. We also can assume that if there are maybe different migrations, one migration is pushed back to only apply one "complicated" migration at a time. Aka we don't try to migrate the entire world at once. Solution 3One of the main problems with this solution is pruning. Currently when we want to execute something on block X, we only need the state of block X. However, with this solution we would need to also have the state of block X - 1. The possibility of that happening is relative low, as we only prune X blocks behind the last finalized block. We would basically change the any runtime execution to use the runtime of block X - 1 and only when we build or import a block we would use the runtime of X, so that we can run the migrations etc. This would solve the problem on the node side, however external tooling would also need to be changed to do this. Otherwise it may still reads the runtime of block X, extracts the wrong metadata and then interprets a storage entry in a wrong way. Solution 4As for solution 3 we need to tell the block execution that we want to use the External tooling would also not be influenced by this, because they can just use |
Comparing the paritytech/substrate#3 and paritytech/substrate#4 solutionsRequires changes in Substrate/Client?
For both solutions, in the API for Runtime-invocation the argument
In case of paritytech/substrate#4, during the block construction, the client should also enforce the In case of paritytech/substrate#3 the pruning set into Requires changes in Substrate/Runtime?
The solution paritytech/substrate#4 would require changes in the The solution paritytech/substrate#4 would not be tolerant to the attempt to set the Requires changes outside of Substrate?
For the solution paritytech/substrate#3 the two methods for proof-of-execution would need to be provided in the API:
For the solution paritytech/substrate#3 the proof of execution would need to be changed:
Works retroactively?
The value for that criterion seems to be relatively small: if there were hundreds of code-updates during several millions of blocks in the history — the probabilty to hit the ill-behaving block is small. The outcomes of the queries to the blocks at which the runtime was upgraded may fall into three cases:
|
I might be rooting for the-#3 all too clearly when being supposed to dispassionately evaluate each solution... :\ |
I'm in favor of 4 due to the surprise factor. If you approach Substrate and learn about the fact that While |
Just wanted to add that solution 4 got a cost: |
Not really a formed idea, but what if there was no.5, i.e. execute migration eagerly? Say, a runtime upgrade would first set
Note, that this approach does not change the current model too much. It does not require alteration of the client behavior either. We, of course, would need a host function for pull off those shenanigans with the calling the new runtime. Logic of this thing is not a problem. The problem is the implications of needing to compile the code before execution. wasmtime may spend quite some time in that. Another option is to use wasmi for that or bring onboard additional wasm executor that does not require expensive compilation. |
I'm really strongly in favor of solution 4. I think that solution 3 is a horrible idea whose added hidden complexity will cause at least one major bug in the future, and I'm writing this comment so that if solution 3 gets picked I can be quoted on that. |
@tomaka , I understand your concerns, but nonetheless respectfully disagree. As I see it the "runtime arguments" are not limited to the
Would you please elaborate as to which exactly one major bug in the future will be added by this implementation? |
I also believe that expressing "To query a block, use the same code that was used to produce that block" is way easier in the specification, rather than:
|
Yes!
My fear is precisely that our system has so many corner cases that it is impossible to detect bugs because they become extremely subtle. For example, one possible bug is that Another source of bugs might come from the fact that calling Similarly, how do you handle getting the metadata? When the runtime changes, so does the metadata. If a JSON-RPC client queries the metadata, do you return the new metadata or the old metadata? If you return the new metadata, then the client would need to get the metadata of the parent, which is surprising. If you return the old metadata, then there's simply no way for the client to obtain the metadata of the new runtime until a child block has been generated. |
Just a small wasm note:
|
👍 for thinking outside of the box One thing that pops to mind as a problem with this is that the weight cost of the migration would be due when setting it, so we would probably want to avoid including other transactions in it and it might still be overweight. |
@tomaka , thanks for very apt examples. I don't think there supposed to be a 1:1 mapping between the JSON-RPC calls and the RuntimeAPI calls. If we can clearly state that the JSON-RPC API has no intention to mirror/mimic/resemble Runtime API, we can look closer and find that there are in fact two problems:
JSON-RPC: the
|
The argument I've been making in this issue is that the design of Substrate is already extremely complicated. The issue obviously needs to be fixed somehow, and no matter the solution that fix will necessarily introduce some complexity, but solution 3 to me introduces way more complexity than solution 4 from the point of view of the runtime builders and the frontend developers, which are the people that matter. With solution 3 you fundamentally introduce complexity in the API. Yes you can design an API that lets you explicitly choose whether you want to call the "reading runtime" or the "building runtime", but that doesn't remove the fact that builders and frontend developers then need to understand the difference between the "reading runtime" and the "building runtime". Solution 4 introduces more complexity on the client side, but that's ok, because the client already has tons of checks to perform at each block and needs special handling for handle runtime upgrades. Implementing solution 4 on the client side is normally just a few more |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: |
Signed-off-by: Andrei Sandu <andrei-mihail@parity.io>
Resolves #4776 This will enable proper core-sharing between paras, even if one of them is not producing blocks. TODO: - [x] duplicate first entry in the claim queue if the queue used to be empty - [x] don't back anything if at the end of the block there'll be a session change - [x] write migration for removing the availability core storage - [x] update and write unit tests - [x] prdoc - [x] add zombienet test for synchronous backing - [x] add zombienet test for core-sharing paras where one of them is not producing any blocks _Important note:_ The `ttl` and `max_availability_timeouts` fields of the HostConfiguration are not removed in this PR, due to #64. Adding the workaround with the storage version check for every use of the active HostConfiguration in all runtime APIs would be insane, as it's used in almost all runtime APIs. So even though the ttl and max_availability_timeouts fields will now be unused, they will remain part of the host configuration. These will be removed in a separate PR once #64 is fixed. Tracked by #6067 --------- Signed-off-by: Andrei Sandu <andrei-mihail@parity.io> Co-authored-by: Andrei Sandu <andrei-mihail@parity.io> Co-authored-by: Andrei Sandu <54316454+sandreim@users.noreply.github.com> Co-authored-by: command-bot <>
If the Runtime is upgraded in block N (by a
system::set_code
transaction for example), the new code is immediately written to storage. However the storage migration contained in that upgrade are not executed until the beginning of blockN + 1
. This is fine for extrinsics and other on-chain execution because there is no on-chain execution between the end of blockN
and the beginning of blockN + 1
.However, runtime APIs can be called from off-chain contexts at any time. For example, maybe an RPC call allows users to fetch data that must be gathered from a Runtime API. Or maybe the node runs a maintenance task that calls a runtime API at a regular interval.
Any runtime API call that is made against the state associated with block
N
will use the new code with the old unmigrated storage. In general the storage schema between these two runtime versions will not match which leads to misinterpreting stored data.The text was updated successfully, but these errors were encountered: