Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

"producer double-confirming known range" error when testing failover #3442

Closed
jchung00 opened this issue May 26, 2018 · 10 comments
Closed

"producer double-confirming known range" error when testing failover #3442

jchung00 opened this issue May 26, 2018 · 10 comments
Assignees
Milestone

Comments

@jchung00
Copy link

Tag: dawn-v4.2.0.
We have two nodes, same producer key and config, and one of them starts with production paused. They both run on the same machine.
Block producing node 1 start producing when the nodes start. Block producing node 2 syncs fine. We run curl -sL http://127.0.0.1:<node-1-port>/v1/producer/pause and curl -sL http://127.0.0.1:<node-2-port>/v1/producer/resume.
Block producing node 2 begins producing with no problems.
Then, we run curl -sL http://127.0.0.1:<node-2-port>/v1/producer/pause and curl -sL http://127.0.0.1:<node-1-port>/v1/producer/resume.
This causes the producer double-confirming known range assertion exception. We checked that this is where the assertion is defined in EOS.IO:

FC_ASSERT( itr->second < result.block_num - h.confirmed, "producer double-confirming known range" );

We also noticed that block producing node 2 can't resync with the chain even when we run it with the --hard-replay-blockchain option.

@gleehokie
Copy link
Contributor

Thanks for the report, I'll bring this to the devs attention.

@gleehokie gleehokie added this to the Version 1.0 milestone May 26, 2018
@wanderingbort
Copy link
Contributor

wanderingbort commented May 26, 2018 via email

@noprom
Copy link
Contributor

noprom commented May 29, 2018

Look forward to the fix.

@bytemaster bytemaster modified the milestones: Version 1.0, Version 1.1 May 29, 2018
@wanderingbort
Copy link
Contributor

We moved this to 1.1 for a few reasons:

  1. while this feature is intended to eventually bloom into support for failover its original purpose was not designed with that in mind and there may be several other harder to find issues with having a hot-standby producer for availability.
  2. The watermark calculation is conservative when a process starts. So, if we are simulating a dead node where that original node was restarted this would not have occurred. It is only when we are doing live-to-live-to-live node hand-offs with no process restarting that we would see this bug.

This is not to say it isn't something we need to fix. This will get better in short order. However, I wanted to enumerate some of the reasons why we have slipped it to version 1.1

@abourget
Copy link
Contributor

Bart, I have a suggestion. The resume should ask for a "proof of resignation" from another node that was signing with the same key, and the resume wouldn't kick on if that proof wasn't valid.

Ex:

  • node no. 1 produces blocks
  • operator wants to handoff to node no. 2
  • sends a signal to node no. 1 with pause, which returns some data about its last state, watermark or whatnot.. could be base64-encoded JSON or binary or whatever
  • send a signal to node no. 2 with resume, along with the payload from pause. The resume operation wouldn't unlock without a message from the previous producer that it is safe: the last block signed, a notice that production has stopped over there with the key used.
  • same thing if we want to come back.. and perhaps the watermark or whatever data could be passed back if we want live-to-live-to-live handoff..

Can you imagine something like this ? This would be to lower the risk of double-signing blocks.

Some special payload could unconditionally resume a chain, if you didn't stop production from a node previously. That would be explicit, and you'd need to check you didn't have a node previously.

What do you think ?

@abourget
Copy link
Contributor

sort of an out-of-band sync'ing of node production :)

@abourget
Copy link
Contributor

Also, I'm seeing that many nodes with the same keys loaded will all "counter-sign" all blocks. It doesn't fork or anything, and it's probably signing the same digest everywhere.. but I would have expected all signing to stop if you pause production. Would that make sense ?

@EOSBIXIN
Copy link

when I test in 1.0.6 I suffer the same bug after call resume&pasue API in second time to switch master & slaver.

the error message:
on_incoming_block ] 10 assert_exception: Assert Exception
nodeosd_1 | 2018-06-27T09:44:39.503449026Z prior != by_id_idx.end(): unlinkable block

@jchung00
Copy link
Author

Any update on this issue?

@abourget
Copy link
Contributor

Was it solved?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants