"producer double-confirming known range" error when testing failover #3442

jchung00 · 2018-05-26T13:52:11Z

Tag: dawn-v4.2.0.
We have two nodes, same producer key and config, and one of them starts with production paused. They both run on the same machine.
Block producing node 1 start producing when the nodes start. Block producing node 2 syncs fine. We run curl -sL http://127.0.0.1:<node-1-port>/v1/producer/pause and curl -sL http://127.0.0.1:<node-2-port>/v1/producer/resume.
Block producing node 2 begins producing with no problems.
Then, we run curl -sL http://127.0.0.1:<node-2-port>/v1/producer/pause and curl -sL http://127.0.0.1:<node-1-port>/v1/producer/resume.
This causes the producer double-confirming known range assertion exception. We checked that this is where the assertion is defined in EOS.IO:

eos/libraries/chain/block_header_state.cpp

Line 158 in 3b70b57

    
           FC_ASSERT( itr->second < result.block_num - h.confirmed, "producer double-confirming known range" );

We also noticed that block producing node 2 can't resync with the chain even when we run it with the --hard-replay-blockchain option.

The text was updated successfully, but these errors were encountered:

gleehokie · 2018-05-26T18:46:25Z

Thanks for the report, I'll bring this to the devs attention.

wanderingbort · 2018-05-26T20:01:56Z

I will create a more specific issue for this and reference this issue in it. One of the concepts not ready for fail over is the watermark calculated by the producer_plugin. This watermark is used to determine how many blocks to confirm with each produced block. For the particular scenario you describe we can fix this by scanning incoming blocks for producers who are also local producers. That way your backup node will have accurate watermarks Inaccurate watermarks trick on node into double committing past blocks which is a considered Byzantine failure and is why you cannot easily recover

…

On Sat, May 26, 2018, 2:46 PM Greg Lee ***@***.***> wrote: Thanks for the report, I'll bring this to the devs attention. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3442 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACYR4nEsvGMdedtoJLqZWOaWoHncuVy4ks5t2aMMgaJpZM4UO7Ca> .

noprom · 2018-05-29T10:30:56Z

Look forward to the fix.

wanderingbort · 2018-05-29T15:40:08Z

We moved this to 1.1 for a few reasons:

while this feature is intended to eventually bloom into support for failover its original purpose was not designed with that in mind and there may be several other harder to find issues with having a hot-standby producer for availability.
The watermark calculation is conservative when a process starts. So, if we are simulating a dead node where that original node was restarted this would not have occurred. It is only when we are doing live-to-live-to-live node hand-offs with no process restarting that we would see this bug.

This is not to say it isn't something we need to fix. This will get better in short order. However, I wanted to enumerate some of the reasons why we have slipped it to version 1.1

abourget · 2018-06-15T20:25:50Z

Bart, I have a suggestion. The resume should ask for a "proof of resignation" from another node that was signing with the same key, and the resume wouldn't kick on if that proof wasn't valid.

Ex:

node no. 1 produces blocks
operator wants to handoff to node no. 2
sends a signal to node no. 1 with pause, which returns some data about its last state, watermark or whatnot.. could be base64-encoded JSON or binary or whatever
send a signal to node no. 2 with resume, along with the payload from pause. The resume operation wouldn't unlock without a message from the previous producer that it is safe: the last block signed, a notice that production has stopped over there with the key used.
same thing if we want to come back.. and perhaps the watermark or whatever data could be passed back if we want live-to-live-to-live handoff..

Can you imagine something like this ? This would be to lower the risk of double-signing blocks.

Some special payload could unconditionally resume a chain, if you didn't stop production from a node previously. That would be explicit, and you'd need to check you didn't have a node previously.

What do you think ?

abourget · 2018-06-15T20:27:06Z

sort of an out-of-band sync'ing of node production :)

abourget · 2018-06-16T05:07:25Z

Also, I'm seeing that many nodes with the same keys loaded will all "counter-sign" all blocks. It doesn't fork or anything, and it's probably signing the same digest everywhere.. but I would have expected all signing to stop if you pause production. Would that make sense ?

EOSBIXIN · 2018-06-27T10:33:21Z

when I test in 1.0.6 I suffer the same bug after call resume&pasue API in second time to switch master & slaver.

the error message:
on_incoming_block ] 10 assert_exception: Assert Exception
nodeosd_1 | 2018-06-27T09:44:39.503449026Z prior != by_id_idx.end(): unlinkable block

jchung00 · 2018-07-25T04:41:42Z

Any update on this issue?

abourget · 2018-09-14T17:07:51Z

Was it solved?

gleehokie added this to the Version 1.0 milestone May 26, 2018

jchung00 mentioned this issue May 27, 2018

Chain wont resync after producer goes down #2978

Closed

taokayan self-assigned this May 28, 2018

bytemaster assigned wanderingbort and unassigned taokayan May 28, 2018

bytemaster modified the milestones: Version 1.0, Version 1.1 May 29, 2018

wanderingbort modified the milestones: Version 1.1, Version 1.0.4, Version 1.0.5 Jun 12, 2018

wanderingbort modified the milestones: Version 1.0.6, Version 1.1 Jun 18, 2018

wanderingbort closed this as completed Sep 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"producer double-confirming known range" error when testing failover #3442

"producer double-confirming known range" error when testing failover #3442

jchung00 commented May 26, 2018

gleehokie commented May 26, 2018

wanderingbort commented May 26, 2018 via email

noprom commented May 29, 2018

wanderingbort commented May 29, 2018

abourget commented Jun 15, 2018

abourget commented Jun 15, 2018

abourget commented Jun 16, 2018

EOSBIXIN commented Jun 27, 2018

jchung00 commented Jul 25, 2018

abourget commented Sep 14, 2018

"producer double-confirming known range" error when testing failover #3442

"producer double-confirming known range" error when testing failover #3442

Comments

jchung00 commented May 26, 2018

gleehokie commented May 26, 2018

wanderingbort commented May 26, 2018 via email

noprom commented May 29, 2018

wanderingbort commented May 29, 2018

abourget commented Jun 15, 2018

abourget commented Jun 15, 2018

abourget commented Jun 16, 2018

EOSBIXIN commented Jun 27, 2018

jchung00 commented Jul 25, 2018

abourget commented Sep 14, 2018