Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Avoid missing a slot on slow proposer #3819

Closed
jeremymj opened this issue Oct 15, 2019 · 29 comments
Closed

Avoid missing a slot on slow proposer #3819

jeremymj opened this issue Oct 15, 2019 · 29 comments
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.

Comments

@jeremymj
Copy link

I want to upload a contract file to the node, when the contract file is about 100kB, it can be successfully uploaded; but when I upload the contract at 240kB,the node can't generate a new block,the prompt message is as follows:
2019-10-15 17:50:56 Discarding proposal for slot 261855508; block production took too long
I compiled the substrate code version is commit 1ae7a901d9b2a7f4fefc7b5f5846c2d13ef6d8c7
by command cargo build --release
and started the node by ./target/release/substrate --dev,
the test contract wasm file I am using is as follows: reproduce_wasm_file.zip

@gui1117
Copy link
Contributor

gui1117 commented Oct 15, 2019

Is this a transaction weight issue ?

If an extrinsic take too much time to be executed then it shouldn't be put into the block, the weight mecanism should prevent block from being too big, I think the weight for this extrinsic is wrong.

I'm not sure how to solve it though.

cc @kianenigma

@gui1117
Copy link
Contributor

gui1117 commented Oct 15, 2019

sidenote:

maybe some code could be optimised as well so we can handle contract of this size, but also maybe this contract size is quite big and the issue should be solve by reducing it with features like dynamic linking (which are not there yet).

@kianenigma
Copy link
Contributor

From the looks of it, the block did take too long and the weight system did not catch it basically. But the transaction is actually one of the dynamic ones that we currently don't really deal with in the weight system. This can also be fixed by making the authorship code multi-threaded.

I must try and first reproduce this.

cc @pepyakin @andresilva

@jeremymj
Copy link
Author

Thank you for your follow-up on this issue! I am now trying to optimize the contract business logic, reduce the contract size, and look forward to your updates.

@tomusdrw
Copy link
Contributor

I believe we should mark the extrinsic as invalid if we timeout during it's execution and it's the first one in the block.

@kianenigma I don't see how we could make authorship multi-threaded, you mean to somehow verify correctness separately?

@jeremymj as a workaround you may try --execution-block-construction=Native

@gui1117
Copy link
Contributor

gui1117 commented Oct 16, 2019

I believe we should mark the extrinsic as invalid if we timeout during it's execution and it's the first one in the block.

like a back up solution if weight doesn't catch the too long transaction ?

@tomusdrw
Copy link
Contributor

Exactly, it allows us to prevent a situation where underpriced (underweighted) extrinsic can simply stall the network as we see in this issue.

@kianenigma
Copy link
Contributor

kianenigma commented Oct 16, 2019

@kianenigma I don't see how we could make authorship multi-threaded, you mean to somehow verify correctness separately?

It looks like the granularity of the time check in the current code is one transaction, meaning that if one transaction takes too long, we still miss the slot. I roughly imagine a parent thread that just moves on at some point if the proposal work takes too long should fix this. The slot worker should not wait for the proposer to finish and if the proposer fails to return for any reason, it should not miss the slot and just submit a half-full or even an empty block.

It looks like we have this and maybe the only problem is changing remaining_duration? the second argument of future::select() is what I was talking about actually

proposer.propose(

@tomusdrw
Copy link
Contributor

tomusdrw commented Oct 16, 2019

Ah, I see. Indeed it makes sense. I don't think it requires threads though. You can imagine the proposer returning a Stream of proposals (instead of Future), and when deadline is reached you just take last element returned by that stream.

@kianenigma
Copy link
Contributor

as for just reproduction, adding a thread::sleep() to any module's dispatchable and running a dev chain always leads to the same error (--execution native ofc.). Additionally, I am trying to add it as a testcase as well.

Ah, I see. Indeed it makes sense. I don't think it requires threads though. You can imagine the proposal being a Stream of proposals (instead of Future), and when deadline is reached you just take last element returned by that stream.

Seems very interesting. I am quite novice with futures but can give it a try.

@bkchr
Copy link
Member

bkchr commented Oct 16, 2019

Ah, I see. Indeed it makes sense. I don't think it requires threads though. You can imagine the proposer returning a Stream of proposals (instead of Future), and when deadline is reached you just take last element returned by that stream.

I don't understand this. What would be the last element? What kind of elements would the Proposer return?

@tomusdrw
Copy link
Contributor

Currently it produces a Block, but it would have to be a Stream of (unfinalized) blocks. I guess it would require copying the entire overlay every time though. I admit I haven't really looked into the code yet to check how feasible it is, but my reasoning was that after every extrinsic we would emit a copy of the intermediate state.

@kianenigma
Copy link
Contributor

I don't understand this. What would be the last element? What kind of elements would the Proposer return?

I suppose Proposer.propose could either:

  • Instead of a block return a stream of extrinsics and the caller is responsible for making a block out of them.
  • Return an unfinished block inside the stream as new transactions are added. The slot worker can at some point call it a day and move on.

But, I have a question about this polling and futures: From my understanding and @tomusdrw's explanations it looks like the proposer's .poll() can block the check for the deadline. Won't a stream suffer from more or less the same situation due to the underlying runtime?

@tomusdrw
Copy link
Contributor

Well, we do have select already, so we can ignore the slow future/stream, that just blocked on .poll() and resolve the Delay part. It just requires memoizing the previous value of the stream in the top-level future afaiu.

@bkchr
Copy link
Member

bkchr commented Oct 16, 2019

Currently it produces a Block, but it would have to be a Stream of (unfinalized) blocks. I guess it would require copying the entire overlay every time though. I admit I haven't really looked into the code yet to check how feasible it is, but my reasoning was that after every extrinsic we would emit a copy of the intermediate state.

Ahh, that sounds nice, but we should probably always return a Block. It would require some changes to the runtime api/block builder to support this kind of stuff, but it should be doable.

But I'm also not sure if all of this is so easy doable with Futures. Futures don't expect that internally you sleep and calling into the runtime is the same as sleep. We would need to build the blocks in a separate thread. So, this will require quite some amount of work.

This would also allow us to remove this max_duration parameter from Proposer.

@tomusdrw
Copy link
Contributor

Futures don't expect that internally you sleep and calling into the runtime is the same as sleep.

I think you meant blocking. As poll should complete fast, or otherwise you block the runtime thread (so other futures can't be polled), and calling into the runtime is one fat blocking operation.

@kianenigma kianenigma changed the title Block production stalls when upload contract wasm file Avoid missing a slot on slow proposer Oct 16, 2019
@kianenigma kianenigma added the I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. label Oct 16, 2019
@bkchr
Copy link
Member

bkchr commented Oct 16, 2019

Yeah sleeping/blocking, not such a big difference :P
Nevertheless, currently the code works, because we check the timer internally, but the futures logic does not works with the current implementation.

@kianenigma the issue is still about something different, we just talk about avoid missing slots. Nevertheless there is probably a bug in contracts. Checking a contract should probably not take that much time.

@kianenigma
Copy link
Contributor

I assume the behaviour of the contract code is actually rather normal? the code is big and just takes too long to process? But yeah maybe they should be two separate issue. You can revert the title if you prefer to have it so.

@bkchr
Copy link
Member

bkchr commented Oct 16, 2019

I would wait for @pepyakin to comment on this issue.

@jeremymj
Copy link
Author

@tomusdrw The problem is solved after adding the startup parameters, but there are additional optional values for this parameter. Where can I find out more about them?

@bkchr
Copy link
Member

bkchr commented Oct 17, 2019

@jeremymj which startup parameters?

@kianenigma
Copy link
Contributor

kianenigma commented Oct 17, 2019

I think @jeremymj is referring to --execution-block-construction=Native which was recommended earlier.

Nonetheless, I would not close this issue. The core problem still holds: future::select cannot poll the timeout and bail out, if a single runtime call (aka like a sleep) is taking too long, and as a consequence we miss a slot.

@jeremymj
Copy link
Author

@bkchr I use the way @kianenigma recommended to start the node:
./target/release/substrate -dev --execution-block-construction=Native

@tomusdrw
Copy link
Contributor

@jeremymj This affects how we execute the block (either Wasm or Native (if available)):

// Execute with native build (if available, WebAssembly otherwise).

Note that this is merely a workaround - if you have incompatible version of the client that can't run natively, it will fall back to wasm anyway.

Imho we need to address:

  1. Figure out why verification takes so long and if we can speed it up.
  2. Figure out if the weight is properly applied (maybe it should be non linear)?
  3. Fix block production missing slot without marking the extrinsic as invalid (DoS vector).
  4. Consider producing an "uncomplete" block if the deadline is close (either this or 3 needs to be implemented).

@jeremymj
Copy link
Author

@tomusdrw I see, thank you!

@pepyakin
Copy link
Contributor

pepyakin commented Oct 22, 2019

Sorry for belated reply. I think the problem is prosaic: put_code has some arbitrary cost in terms of gas, specifically it is 1 and I think that the weight is not applied in this case (besides the default one).

OTOH, while a 240k contract sounds to be big, it is not enourmous and I think it would be desirable that at least one contract of such size could fit into an empty block processing. Hence I'd like to see if and how we can speed up the contract code verification process and how much of an improvement #3869 would give us. I filled #3885 for this.

@kianenigma
Copy link
Contributor

kianenigma commented Oct 24, 2019

For anyone wanting to tackle this from the perspective of non-blockinng future, I've just written a test for my own understanding here that simulates what goes wrong in the slot worker when the author decides to sleep.

https://github.com/paritytech/substrate/compare/kiz-author-blocking?expand=1

@kianenigma
Copy link
Contributor

Any updates on this that has not been recorded here? does the author still fail to return in time if one single extrinsic takes too much time?

@bkchr
Copy link
Member

bkchr commented Aug 18, 2020

I think this should be fixed now. You can still miss a slot, but we will backoff if we can not build the block in time. This means the allowed block time increases with the number of missed slots.

Given that, I will close this issue ;)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.
Projects
None yet
Development

No branches or pull requests

6 participants