Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: rework contract runtime error processing #4181

Merged
merged 19 commits into from
Apr 14, 2021
Merged

feat: rework contract runtime error processing #4181

merged 19 commits into from
Apr 14, 2021

Conversation

olonho
Copy link
Contributor

@olonho olonho commented Mar 31, 2021

Make sure errors processed in the similar manner between Wasmer 0.x and Wasmer 1.x.
Introduced explicit non-deterministic error.

Test plan

Reworked error cases tests to make them uniform, then cargo test.

@@ -389,6 +383,9 @@ impl fmt::Display for WasmTrap {
WasmTrap::GenericTrap => write!(f, "Generic trap."),
WasmTrap::BreakpointTrap => write!(f, "Breakpoint trap."),
WasmTrap::StackOverflow => write!(f, "Stack overflow."),
WasmTrap::NondeterministicTrap(message) => {
write!(f, "Non-deterministic trap: {}", message)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undelated to PR, but

The error message given by the Display representation of an error type should be lowercase without trailing punctuation, and typically concise.

https://rust-lang.github.io/api-guidelines/interoperability.html?highlight=Error#error-types-are-meaningful-and-well-behaved-c-good-err

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked, but requires looking on that in separate PR.

runtime/near-vm-runner/src/wasmer1_runner.rs Outdated Show resolved Hide resolved
@olonho olonho changed the title feat: rework Wasmer1 error processing feat: rework contract runtime error processing Mar 31, 2021
/// It can be convert back by wasmer_vm::TrapCode::from_str
Wasmer1Trap(String),
/// Non-deterministic error.
Nondeterministic(String),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this error deterministic, i.e, is it guaranteed that a nondeterministic error will be wrapped in this variant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now it is.

VMError::FunctionCallError(FunctionCallError::WasmerRuntimeError(error_msg))
}
}
Ok(e) => return (&e).into(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can probably just do an if let here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

runtime/near-vm-runner/Cargo.toml Show resolved Hide resolved
runtime/near-vm-runner/src/wasmer1_runner.rs Outdated Show resolved Hide resolved
Comment on lines 116 to 137
InvokeError::TrapCode { code, srcloc } => {
VMError::FunctionCallError(FunctionCallError::Nondeterministic(format!(
"Wasmer trap {} at {}",
*code as u32, srcloc
)))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, from the light reading of the code, I am not sure that this should always be nondeterministic. It seems that, eg, division by zero generates this trap, no? Do we have a test for division by zero?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, reworked that piece.

Comment on lines 198 to 200
FunctionCallError::CompilationError(_)
| FunctionCallError::LinkError { msg: _ }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the PR, but:

  • LinkError should probably be a tuple -- it's the only one with msg field, which feels inconsistent
  • I am not entirely sure we want to distinguish between CompilationError and LinkError, they feel sufficiently similar to me in the context of wasm execution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess so, but this PR already become bigger than I want it to be.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I, actually, prefer it to be structured (and update the rest of variants as well if necessary) since String does not capture the meaning of the text value 😐

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention here is to have error separated between contract executor and the rest of logic. We can make it a enum, but the only purpose of this error is to show error to humans, that's why I made this change.

/// It can be convert back by wasmer_vm::TrapCode::from_str
Wasmer1Trap(String),
/// Non-deterministic error.
Nondeterministic(String),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt of still keep it WasmUnkownError is a good for wasmer 1. The internal of it is really unknown in wasmer 0.x (UnknownTrap etc.) but it's indeed known in wasmer 1. I suggest name it as WasmerRuntimeError or WasmerInternalError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked, so that Nondeterministic is properly returned.

Some(TrapCode::VMOutOfMemory) => {
FunctionCallError::Nondeterministic("Wasmer out of memory".to_string())
}
None => FunctionCallError::Nondeterministic(error_msg),
Copy link
Member

@ailisp ailisp Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect, None only happens when wasmer returns a Generic Error or User Error that failed to downcast to VMLogicError: https://github.com/wasmerio/wasmer/blob/9e0122e1f7bdb6fedf5fc17cc9a3a0fda30031a0/lib/engine/src/trap/error.rs#L18
And Generic Error is constructed by wasmer RuntimeError::new: https://github.com/wasmerio/wasmer/blob/9e0122e1f7bdb6fedf5fc17cc9a3a0fda30031a0/lib/engine/src/trap/error.rs#L61
User Error is any struct that has Error trait.

But both of them must come from constructing such structs in wasm file manually. We only consturct VMLogicError in near-vm-logic/src/logic.rs so other User Error and GenericError should not happen. So it is not an Nondeterministic, should be debug_assert!(false) or unreachable!()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworked code around this place.

@ailisp
Copy link
Member

ailisp commented Apr 1, 2021

Overall looks good, this specific one looks questionable: #4181 (comment)

runtime/near-vm-runner/src/wasmer1_runner.rs Show resolved Hide resolved
runtime/near-vm-runner/src/wasmer1_runner.rs Outdated Show resolved Hide resolved
runtime/near-vm-runner/src/wasmer_runner.rs Show resolved Hide resolved
runtime/near-vm-runner/src/wasmer_runner.rs Outdated Show resolved Hide resolved
runtime/near-vm-runner/src/wasmer_runner.rs Outdated Show resolved Hide resolved
@evgenykuzyakov
Copy link
Collaborator

Adding @volovyk-s to check on error reporting, since it affects near-api-js

// TODO: shall we abort on unknown errors also?
| FunctionCallError::WasmUnknownError(_)
| FunctionCallError::HostError(_)
| FunctionCallError::EvmError(_) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@artob @frankbraun @joshuajbouw Given that EVM execution now does not differ from any other Wasm contract execution, why do we still have a separate EvmError?

result.result = Err(ActionErrorKind::FunctionCallError(err).into());
false
}
Some(VMError::FunctionCallError(err)) => match err {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unexpected that we have structured FunctionCallError information, but then we convert it to string here. What is the reason for doing it? This structure propagates structured errors all the way into RPC (CC @frol ) and we prefer to have error in RPC as structured as possible, so that we avoid bad JS code that parses error strings to understand whether there is a compilation error (which is something devtools would want to differentiate from execution errors).

We would need to cut a release for near-api-js when Wasm error structure changes due to Wasmer upgrade independently on whether we are returning it as string or strictly-typed structure. With strictly typed structure we just going to have less glue-code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's intentional, and for that particular distinction we can introduce CompilationError and ExecutionError, if this difference is indeed important. Currently, the only use of the error is showing it to developers in logs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the only use of the error is showing it to developers in logs.

This is exactly what we want to improve by introducing sane granularity.

That's intentional, and for that particular distinction we can introduce CompilationError and ExecutionError, if this difference is indeed important.

Yes, that would be ideal. I think end-users care about three types of errors: UnknownMethod, CompilationError, ExecutionError.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, please introduce these 3 levels of errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, let's do that!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@volovyk-s BadArgument is a contract-specific error (because each contract decides to interpret it differently). We cannot differentiate it on a blockchain-level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olonho It seems like we are still combining all three kinds of errors into ActionErrorKind::FunctionCallError with string. Did you forget to push the commit?

Copy link
Contributor

@matklad matklad Apr 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we prefer to have error in RPC as structured as possible

I want to push back a little bit against "as structured as possible errors". This is not strictly related to the PR, just the pattern I observer in Rust's error handling.

The issue with maximally stuctured errors is that they increase coupling, and sometimes it's better to stop at the goldilocks errors, which are structured just enough.

In the context of this PR, if we were to make error maximally structured, than we would need to expose wasmer 0.17 and wasmer 1.0 as is. And when we remove support for wasmer 0.17 or upgrade to (hypothetical) wasmer 2.0, we'd have to change the definition of the error type as well, and update the call-sites. This PR in some sense is the consequence of the fact that we did change error definition when upgrading to wasmer 1.0. This is a code smell, pointing out that the original error definition probably wasn't encapsulated enough.

But, clearly, exposing just the error message goes too far in the opposite direction, as it forces the clients to parse out the messages.

There's a goldilocks error pattern -- ErrorKind. The best example of it is std::io::Error:

image

The Error type itself is an opaque struct, which can be converted to string via Display. There's a .kind() method, which exposes non-exhaustive, field-less enum. This structure allows the client to behave differently for different sorts of errors, but allows the implementor to modify the errors without changes to the client.

To re-iterate, this is definitely not something we should do in the scope of this PR, and it's not necessary something we want to do even eventually (serializability of errors add an extra design dimension here which might change the tradeoffs). But it's useful to keep this bit of info in mind -- just exposing errors as is is the path of least resistance, but often not something you want to have at the API boundary.

References:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly, it was intentional "right" coercion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matklad thanks for the detailed overview. I believe after some discussions we already arrived at the middle ground of introducing exactly the necessary variants that the consumer cares about 👍

Comment on lines 387 to 388
/// An error occurred during a `FunctionCall` Action, parameter is debug message.
FunctionCallError(String),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olonho We try to bring more structured errors into the codebase, and this change goes in an opposite direction, so I wonder about the motivation behind this change 🤔

@bowenwang1996 I assume changes to this enum require a protocol upgrade (since the errors are part of the proof), a db migration (since the values are serialized and stored as borsh values), and a network upgrade (?)

P.S. If we still want to change it, I suggest capturing the logic in the code instead of comments:

Suggested change
/// An error occurred during a `FunctionCall` Action, parameter is debug message.
FunctionCallError(String),
/// An error occurred during a `FunctionCall` Action.
FunctionCallError { debug_message: String },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not think, that protocol changes in any way, as for protocol it's important presence of FunctionCallError error, not exact type of it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for proof it is important to have an identical hash for the execution outcome, so it will require a protocol change unless I misunderstand something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, when computing hash we explicitly ignore error messages, @matklad shown me the place where it's done.

Copy link
Contributor

@matklad matklad Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matklad shown me the place where it's done.

Over here:

ExecutionStatus::Failure(_) => PartialExecutionStatus::Failure,

Naturally, I might be missing something, but it seems resonable that, from protocol's point of view, we remeber only a single bit: was there some error?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, we don't need a protocol upgrade!

What about database migration?

I also see various ExecutionOutcome types used in chain/network. (ExecutionOutcome uses TxExecutionError which uses ActionError which uses ActionErrorKind and all of that gets borsh-serialized and should be compatible with the existing running nodes, right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also do not think we shall store anything related to exact error kind in DB. If we do - guess it's a bug.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olonho we definitely store the errors in the storage since we receive them from RPC: https://explorer.near.org/transactions/GcRiQb5PFwMAogtu8te6wDKMRg9pefbWV4nJx3iqijny

image

http post https://archival-rpc.mainnet.near.org/ jsonrpc=2.0 method=tx params:='["GcRiQb5PFwMAogtu8te6wDKMRg9pefbWV4nJx3iqijny", "tom.zest.near"]' id=dontcare

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do store them in database and if we've changed how the error serializes to borsh we should do a migration

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so could we do migration in the followup CL? And would be amazing, if we could cooperate on the migration effort.

@@ -31,15 +29,11 @@ pub enum FunctionCallError {
MethodResolveError(MethodResolveError),
/// A trap happened during execution of a binary
WasmTrap(WasmTrap),
WasmUnknownError,
WasmUnknownError(String),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a description the the string value:

Suggested change
WasmUnknownError(String),
WasmUnknownError { debug_message: String },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 198 to 200
FunctionCallError::CompilationError(_)
| FunctionCallError::LinkError { msg: _ }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, I, actually, prefer it to be structured (and update the rest of variants as well if necessary) since String does not capture the meaning of the text value 😐

@olonho
Copy link
Contributor Author

olonho commented Apr 6, 2021

PTAL, merged suggestions.

runtime/near-vm-runner/src/wasmer1_runner.rs Show resolved Hide resolved
runtime/runtime/src/actions.rs Outdated Show resolved Hide resolved
result.result = Err(ActionErrorKind::FunctionCallError(err).into());
false
}
Some(VMError::FunctionCallError(err)) => match err {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olonho It seems like we are still combining all three kinds of errors into ActionErrorKind::FunctionCallError with string. Did you forget to push the commit?

@olonho
Copy link
Contributor Author

olonho commented Apr 14, 2021

Addresses Max's request.

@olonho olonho dismissed MaksymZavershynskyi’s stale review April 14, 2021 10:10

Max request was fixed.

@olonho olonho merged commit a6bf293 into master Apr 14, 2021
@olonho olonho deleted the wasmer1_errors branch April 14, 2021 10:10
matklad added a commit to matklad/nearcore that referenced this pull request May 6, 2021
chefsale pushed a commit that referenced this pull request May 6, 2021
…tic (#4278)

* fix: revert part of #4181 to classify FailedWithNoError as deterministic

* bump version
near-bulldozer bot pushed a commit that referenced this pull request May 13, 2021
Initially, we treated this error case as
deterministic (so, we stored this error in our state,
etc.)

Then, in
#4181 (comment)
we reasoned that this error actually happens
non-deterministically, so it's better to panic in this
case.

However, when rolling this out, we noticed that this
error happens deterministically for at least one
contract. So here we roll this back to a previous
behavior and emit some deterministic error, which won't
cause the node to panic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants