Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Channel needs to be re-established if task is not ack'ed #342

Closed
insipx opened this issue Sep 2, 2021 · 3 comments · Fixed by #472
Closed

Channel needs to be re-established if task is not ack'ed #342

insipx opened this issue Sep 2, 2021 · 3 comments · Fixed by #472

Comments

@insipx
Copy link
Collaborator

insipx commented Sep 2, 2021

Minor bug i discovered while running Archive overnight to test stability. If the work-queue encounter a task (in this case a block being executed) that exceeds RabbitMQ default timeout, rabbitMQ drops the channel in order to avoid filling up disk-space infinitely.

This behavior is described here: https://www.rabbitmq.com/consumers.html#acknowledgement-timeout

Solution to protect against this from the work-queue side is to wrap tasks in a callback/timeout that matches RabbitMQ timeout, if it times out then we need to check the channel to see if it has been dropped, and if so drop the task and re-establish the channel.

As for why a task is taking >30min, that would need further investigation, but i don't believe it to be a serious issue, since this has only happened to me once so far while syncing multiple chains. The problem occurred for me on Kusama. Finding problematic block and reproducing the issue would be needed to solve that

25394824:[2021-09-02][07:58:52] [lapin::channel][ERROR] Channel closed on channel 15 by 0:0 => AMQPError { kind: Soft(PRECONDITIONFAILED), message: ShortString("PRECONDITION_FAILED - delivery acknowledgement on channel 15 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more") } => PRECONDITION_FAILED - delivery acknowledgement on channel 15 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more::/home/insipx/.cargo/registry/src/github.com-1ecc6299db9ec823/lapin-1.8.0/src/channel.rs;855
25394825:[2021-09-02][07:58:52] [sa_work_queue::threadpool][ERROR] protocol error: AMQP soft error: PRECONDITION-FAILED: PRECONDITION_FAILED - delivery acknowledgement on channel 15 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more::/home/insipx/projects/parity/substrate-archive/master/work-queue/sa-work-queue/src/threadpool.rs;128
25395027:[2021-09-02][07:58:53] [sa_work_queue::threadpool][ERROR] protocol error: AMQP soft error: PRECONDITION-FAILED: PRECONDITION_FAILED - delivery acknowledgement on channel 15 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more::/home/insipx/projects/parity/substrate-archive/master/work-queue/sa-work-queue/src/threadpool.rs;128
25395183:[2021-09-02][07:58:53] [sa_work_queue::threadpool][ERROR] protocol error: AMQP soft error: PRECONDITION-FAILED: PRECONDITION_FAILED - delivery acknowledgement on channel 15 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more::/home/insipx/projects/parity/substrate-archive/master/work-queue/sa-work-queue/src/threadpool.rs;128
25395326:[2021-09-02][07:58:53] [sa_work_queue::threadpool][ERROR] protocol error: AMQP soft error: PRECONDITION-FAILED: PRECONDITION_FAILED - delivery acknowledgement on channel 15 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more::/home/insipx/projects/parity/substrate-archive/master/work-queue/sa-work-queue/src/threadpool.rs;128
25395474:[2021-09-02][07:58:53] [sa_work_queue::threadpool][ERROR] protocol error: AMQP soft error: PRECONDITION-FAILED: PRECONDITION_FAILED - delivery acknowledgement on channel 15 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more::/home/insipx/projects/parity/substrate-archive/master/work-queue/sa-work-queue/src/threadpool.rs;128
25395625:[2021-09-02][07:58:53] [sa_work_queue::threadpool][ERROR] protocol error: AMQP soft error: PRECONDITION-FAILED: PRECONDITION_FAILED - delivery acknowledgement on channel 15 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more::/home/insipx/projects/parity/substrate-archive/master/work-queue/sa-work-queue/src/threadpool.rs;128
@dvdplm
Copy link
Contributor

dvdplm commented Sep 3, 2021

As for why a task is taking >30min, that would need further investigation

At the very least the PR for this issue should log all the relevant info about the troublesome block (and perhaps some general system metrics, if the root cause is external to the block data, dunno, some pesky locking issue or smn).

@insipx
Copy link
Collaborator Author

insipx commented Oct 8, 2021

I revisted this a little bit last night after looking over the Wasm issues in the substrate repo related to PoV blocks, and I have a suspicion it might actually be a locking issue related to the RuntimeCache in the substrate WasmExecutor here that specifies a hard-coded limit a few lines above here

archive spawns a bunch of threads to execute blocks in order to speed up storage processing, so it would make sense that there is some kind of conflict happening here that might be worth examining this lock. It would maybe be worth running archive with a modified substrate and increase that limit to see if that fixes anything

@xcaptain
Copy link
Contributor

xcaptain commented May 7, 2022

Also come across this issue, the reason is prefetch_count too large, currently its value is 5000, execute_block takes about 2s, 5000 means a consumer would take more than 3 hours to consume local cached messages, the queue can't delivery next message to consumer in 1800s causes this error.
After changing the value to 100 this error has gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants