[sc-allocator] Handle some errors gracefully #14042

serban300 · 2023-04-28T12:19:13Z

High level

I was wondering if we could handle some allocator errors (for example OOM) gracefully or if there is a strong reason why we panic in all error case and mark the allocator as poisoned. I couldn't find any documentation on this, sorry if I missed it.

We could do this by simply returning a NULL pointer instead of an error when we encounter a non-critical problem.

Context

The size of a structure in memory is different from its encoded size. It can happen at times that the difference between these 2 representations to be significant. Take for example the XCM Instruction structure which takes ~1400 bytes in memory but has some variants that can be encoded in just 1 byte (for example Instruction::ClearOrigin)

In such cases we could store encoded structures in a relatively small space, compared to the real memory representation. If we end up in such a corner case (having for example a large encoded Vec of Instruction<Call>), when trying to decode it the execution will end up here, where Vec::push will eventually try to allocate too much memory and the wasm VM will panic here.

I was wondering if we could avoid panicking on such errors (OOM) in order to give the wasm application the possibility to handle such corner cases gracefully. In the scenario above, for example we could adjust the decoding logic by replacing r.push() with:

		r.try_reserve(1)
			.map_err(|_| {
				parity_scale_codec::Error::from("Could not allocate enough memory")
			})?;
		r.push(T::decode(input)?);

and we would get a decoding error instead of a wasm trap, which could be better.

Related to: paritytech/parity-bridges-common#2015

bkchr · 2023-04-29T09:34:31Z

We could do this by simply returning a NULL pointer instead of an error when we encounter a non-critical problem.

Generally yes, but nothing we should support with the current implementation. It can be done as part of this: #11883

and we would get a decoding error instead of a wasm trap, which could be better.

Could be better, could not be better. Generally the first step should be to open a pr to box a lot of these big things in the xcm instructions. Long term we really need to take into account the allocation size as well, as this isn't the first time we hit this issue with big enums. It already has happened with the Call enum as well.

Even if decoding would fail with an error, you would still not be able to deliver the messages? Without having looked into your code, why don't you decode the messages one after another?

koute · 2023-04-29T09:43:59Z

Generally the first step should be to open a pr to box a lot of these big things in the xcm instructions.

This. Currently the XCM enum is way too big; we need to cut it down by strategically adding some Boxes plus a static assert to make sure it doesn't accidentally regress in the future.

serban300 · 2023-05-01T07:39:26Z

Thank you for the answers and the suggestions !

It can be done as part of this: #11883

Thanks ! I didn't know about the issue. I'll follow it.

Generally the first step should be to open a pr to box a lot of these big things in the xcm instructions.

Ok, makes sense. I'll look into it an start a discussion with the XCM team about this.

Even if decoding would fail with an error, you would still not be able to deliver the messages? Without having looked into your code, why don't you decode the messages one after another?

In our case, we ran into this while working on an integration test meant to check that big messages arrive at the destination. So we were crafting big message payloads on purpose. Probably this use case is not very relevant for real world scenarios. And generally this shouldn't happen. But the idea was to have this more as a failsafe mechanism. In case this happens, it felt like it would be safer to have such a mechanism, and get a decoding error instead of a wasm trap. We wouldn't be able to deliver the message, but at least the chain wouldn't get stuck. Would that make sense ?

bkchr · 2023-05-01T20:09:16Z

Ok, makes sense. I'll look into it an start a discussion with the XCM team about this.

Should be a non-controversial pr and you can just open it. The faster we decrease the size, the better.

We wouldn't be able to deliver the message, but at least the chain wouldn't get stuck. Would that make sense ?

Aren't relayers delivering messages via normal transactions? If yes, they will not be added to a block and thus, nothing bad should happen.

bkchr · 2023-05-01T20:10:30Z

BTW, I would propose that you just deliver the messages as Vec<u8> and then decode message by message and then we should also not see this issue happening.

acatangiu · 2023-05-02T08:56:38Z

@bkchr @koute there are many ways to fix it on our side (not get in this situation in the first place) - we found this in a testing environment where most other checks and guardrails are off - so this PR is not a prerequisite or blocker for the bridges usecase.

Even so, I think it is still useful, I see it as defense in depth. Some bug or bad design can panic the runtime and break the chain doing an operation (even if it does it out of stupidity) that could be otherwise be recoverable. I might be missing some context or nuance but from high-level it looks more robust for an alocator to throw an error that the caller can choose to gracefully handle or panic depending on their logic.

bkchr · 2023-05-02T09:02:43Z

I might be missing some context or nuance but from high-level it looks more robust for an alocator to throw an error that the caller can choose to gracefully handle or panic depending on their logic.

As I had already commented:

We could do this by simply returning a NULL pointer instead of an error when we encounter a non-critical problem.

Generally yes, but nothing we should support with the current implementation. It can be done as part of this: #11883

The approach is something we can think about and can implement. However, we should not support this with the current allocator. I mean the general and more fine grained errors can be merged, but not returning NULL by the host function.

However, FRAME is also not build in a way to bubble up these errors. When you for example read something from the storage that would be too big to decode and it would fail. FRAME would return you the default value for the storage entry.
In general we would need to write code with these issues in mind, which is totally reasonable, but will require quite some refactoring etc.

acatangiu · 2023-05-02T09:09:06Z

Ok, I see. Thanks for the explanation and the context!

serban300 · 2023-05-02T09:48:18Z

Ok, Thank you for the details ! Closing this in favor of #11883 then.

[sc-allocator] Handle some errors gracefully

54f874b

serban300 added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. labels Apr 28, 2023

serban300 requested review from svyatonik and acatangiu April 28, 2023 12:19

serban300 self-assigned this Apr 28, 2023

serban300 requested a review from koute as a code owner April 28, 2023 12:19

acatangiu requested a review from bkchr April 28, 2023 16:22

serban300 closed this May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sc-allocator] Handle some errors gracefully #14042

[sc-allocator] Handle some errors gracefully #14042

serban300 commented Apr 28, 2023

bkchr commented Apr 29, 2023

koute commented Apr 29, 2023

serban300 commented May 1, 2023

bkchr commented May 1, 2023

bkchr commented May 1, 2023

acatangiu commented May 2, 2023

bkchr commented May 2, 2023

acatangiu commented May 2, 2023

serban300 commented May 2, 2023

[sc-allocator] Handle some errors gracefully #14042

[sc-allocator] Handle some errors gracefully #14042

Conversation

serban300 commented Apr 28, 2023

High level

Context

bkchr commented Apr 29, 2023

koute commented Apr 29, 2023

serban300 commented May 1, 2023

bkchr commented May 1, 2023

bkchr commented May 1, 2023

acatangiu commented May 2, 2023

bkchr commented May 2, 2023

acatangiu commented May 2, 2023

serban300 commented May 2, 2023