Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

synchronizer: check l1blocks #3546

Merged
merged 38 commits into from
Apr 16, 2024
Merged

Conversation

joanestebanr
Copy link
Contributor

@joanestebanr joanestebanr commented Apr 8, 2024

Closes #3540 #3561

What does this PR do?

Reviewers

Main reviewers:

Codeowner reviewers:

  • @-Alice
  • @-Bob

@joanestebanr joanestebanr changed the base branch from release/v0.6.5 to release/v0.6.6 April 9, 2024 07:47
@joanestebanr joanestebanr modified the milestones: v0.6.5, v0.6.6 Apr 9, 2024
@joanestebanr joanestebanr linked an issue Apr 10, 2024 that may be closed by this pull request
@joanestebanr joanestebanr marked this pull request as ready for review April 11, 2024 15:01
Copy link
Contributor

@tclemos tclemos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the scope of this task, but the PR changes seem like overkill.

I expected it to be just a single method running concurrently to check for blocks related to the consolidated point on Ethereum.

I have added many comments, but I confess I got exhausted while reviewing it because I always thought, "Is this necessary?"

I understand this PR can achieve the goal, but assuming the price we are paying to maintain this implementation in the future, I prefer to drop everything and come up with a more straightforward solution that gets the consolidated point from Ethereum and checks how we are from there.

Proposal:

Assuming the Synchronizer must be able to recover by itself when a reorg is detected and this mechanism is a secondary protection to help the Synchronizer identify the reorg with a different strategy, I would suggest implementing the following in the simplest way possible, like:

  • at the start of the application, get the current consolidated point block from Ethereum
  • get all the blocks we have from the consolidated point and check all of them for a reorg
    • if reorg is detected, flag it and wait for the synchronizer to fix it. Once fixed
  • stores the last block we checked in memory
  • keep monitoring the consolidated point on Ethereum until the last block we checked matches
  • repeat

Integration with the Synchronizer can be done via a simple channel, which can be appended to the current synchronization process.

Advantages of this approach:
no changes in the DB
faster, runs only on memory
takes advantage of the network consolidation point instead of verifying everything
with less checks, it can load all the blocks in a single shot instead of one by one and check them concurrently
way less code to maintain
way less changes in the real code due to this extra protection

Reasoning:

We assume the consolidation point on Ethereum is where we trust a reorg will never happen; from this point, we do our own check to make sure all the blocks not consolidated yet are matching with Ethereum; once we guarantee this, the synchronizer takes place and continues his job synchronizing block by block until all the blocks we have checked are now part of the consolidated part of Ethereum, then we start over. If, for some reason, the synchronizer is not able to detect a reorg in the regular synchronization process, our next check will start from the last time we checked until the latest synchronized block, and we will find it, flagging it to the synchronizer to allow the reorg process to be executed in the next synchronizer reorg check.

Conclusion:

I don't feel comfortable merging this whole PR, and I'm open to discussing it if you consider it worth it. Otherwise, you can check my comments in the PR and follow with this implementation.

etherman/etherman.go Outdated Show resolved Hide resolved
synchronizer/l1_check_block/async.go Outdated Show resolved Hide resolved
synchronizer/l1_check_block/async.go Outdated Show resolved Hide resolved
synchronizer/l1_check_block/async.go Outdated Show resolved Hide resolved
synchronizer/l1_check_block/async.go Outdated Show resolved Hide resolved
synchronizer/l1_check_block/check_l1block.go Show resolved Hide resolved
synchronizer/l1_check_block/check_l1block.go Outdated Show resolved Hide resolved
synchronizer/l1_check_block/check_l1block.go Outdated Show resolved Hide resolved
synchronizer/l1_check_block/integration.go Show resolved Hide resolved
synchronizer/l1_check_block/integration.go Outdated Show resolved Hide resolved
synchronizer/synchronizer.go Outdated Show resolved Hide resolved
synchronizer/synchronizer.go Show resolved Hide resolved
synchronizer/synchronizer.go Outdated Show resolved Hide resolved
synchronizer/synchronizer.go Outdated Show resolved Hide resolved
synchronizer/synchronizer.go Outdated Show resolved Hide resolved
synchronizer/l1_check_block/check_l1block.go Show resolved Hide resolved
synchronizer/l1_check_block/common.go Show resolved Hide resolved
synchronizer/synchronizer.go Outdated Show resolved Hide resolved
synchronizer/synchronizer.go Show resolved Hide resolved
synchronizer/synchronizer.go Outdated Show resolved Hide resolved
@joanestebanr
Copy link
Contributor Author

I don't know the scope of this task, but the PR changes seem like overkill.

I expected it to be just a single method running concurrently to check for blocks related to the consolidated point on Ethereum.

I have added many comments, but I confess I got exhausted while reviewing it because I always thought, "Is this necessary?"

I understand this PR can achieve the goal, but assuming the price we are paying to maintain this implementation in the future, I prefer to drop everything and come up with a more straightforward solution that gets the consolidated point from Ethereum and checks how we are from there.

Proposal:

Assuming the Synchronizer must be able to recover by itself when a reorg is detected and this mechanism is a secondary protection to help the Synchronizer identify the reorg with a different strategy, I would suggest implementing the following in the simplest way possible, like:

  • at the start of the application, get the current consolidated point block from Ethereum

  • get all the blocks we have from the consolidated point and check all of them for a reorg

    • if reorg is detected, flag it and wait for the synchronizer to fix it. Once fixed
  • stores the last block we checked in memory

  • keep monitoring the consolidated point on Ethereum until the last block we checked matches

  • repeat

Integration with the Synchronizer can be done via a simple channel, which can be appended to the current synchronization process.

Advantages of this approach: no changes in the DB faster, runs only on memory takes advantage of the network consolidation point instead of verifying everything with less checks, it can load all the blocks in a single shot instead of one by one and check them concurrently way less code to maintain way less changes in the real code due to this extra protection

Reasoning:

We assume the consolidation point on Ethereum is where we trust a reorg will never happen; from this point, we do our own check to make sure all the blocks not consolidated yet are matching with Ethereum; once we guarantee this, the synchronizer takes place and continues his job synchronizing block by block until all the blocks we have checked are now part of the consolidated part of Ethereum, then we start over. If, for some reason, the synchronizer is not able to detect a reorg in the regular synchronization process, our next check will start from the last time we checked until the latest synchronized block, and we will find it, flagging it to the synchronizer to allow the reorg process to be executed in the next synchronizer reorg check.

Conclusion:

I don't feel comfortable merging this whole PR, and I'm open to discussing it if you consider it worth it. Otherwise, you can check my comments in the PR and follow with this implementation.

I double-check the scope of the task and it's the expected behaviour

Comment on lines 37 to 39
// dontDoReorgCheckBeforeL2Sync if is true then the reorg check is not done before the L2 sync
// this is a private field, can not be configured
dontDoReorgCheckBeforeL2Sync bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this param needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unittest is failing because there is a extra call to CheckReorg before starting L2 sync. With this flag we skip this

synchronizer/synchronizer.go Outdated Show resolved Hide resolved
if err != nil {
log.Errorf("error resetting the state to a discrepancy block. Retrying... Err: %v", err)
continue
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it good to block it forever?? Maybe 10 retries or something like that

@ARR552 ARR552 merged commit 9f7361d into release/v0.6.6 Apr 16, 2024
16 checks passed
@ARR552 ARR552 deleted the feature/3540-check_l1blocks branch April 16, 2024 09:01
ARR552 added a commit that referenced this pull request Apr 24, 2024
* wip

* run on background L1block checker

* fix lint and documentation

* fix conflict

* add unittest

* more unittest

* fix lint

* increase timeout for async unittest

* fix unittest

* rename GetResponse for GetResult and fix uniitest

* add a second gorutines for check the newest blocks

* more unittest

* add unittest and run also preCheck on launch

* by default Precheck from FINALIZED and SAFE

* fix unittest, apply PR comments

* changes suggested by ARR552 in integration method

* fix documentation

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* fix unittest

* fix PR comments

* fix error

* checkReorgAndExecuteReset can't be call with lastEthBlockSynced=nil

* add parentHash to error

* fix error

* merge 3553 fix unittest

* fix unittest

* fix wrong merge

* adapt parallel reorg detection to flow

* fix unit tests

* fix log

* allow use sync parallel mode

---------

Co-authored-by: Alonso <ARR551@protonmail.com>
ARR552 added a commit that referenced this pull request Apr 25, 2024
* change number migration

* add column checked on state.block

* if no unchecked blocks  return ErrNotFound

* migration set to checked all but the block with number below max-1000

* add column checked on state.block (#3543)

* add column checked on state.block

* if no unchecked blocks  return ErrNotFound

* migration set to checked all but the block with number below max-1000

* Feature/#3549 reorgs improvement (#3553)

* New reorg function

* mocks

* linter

* Synchronizer tests

* new elderberry smc docker image

* new image

* logs

* fix json rpc

* fix

* Test sync from empty block

* Regular reorg case tested

* linter

* remove empty block + fix LatestSyncedBlockEmpty

* Improve check reorgs when no block is received during the call

* fix RPC error code for eth_estimateGas and eth_call for reverted tx and no return value; fix e2e test;

* fix test

* Extra unit test

* fix reorg until genesis

* disable parallel synchronization

---------

Co-authored-by: tclemos <thiago@polygon.technology>

* migrations

* Fix + remove empty blocks

* unit test

* linter

* Fix + remove empty blocks (#3564)

* Fix + remove empty blocks

* unit test

* linter

* Fix/#3565 reorg (#3566)

* fix + logs

* fix loop

* Revert "fix + logs"

This reverts commit 39ced69.

* fix L1InfoRoot when an error happens during the process of the L1 information (#3576)

* fix

* Comments + mock

* avoid error from some L1providers when fromBlock is higher than toBlock

* Revert some changes

* comments

* add L2BlockModulus to L1check

* doc

* fix dbTx = nil

* fix unit tests

* config

* fix sync unit test

* linter

* fix config param typo

* synchronizer:  check l1blocks (#3546)

* wip

* run on background L1block checker

* fix lint and documentation

* fix conflict

* add unittest

* more unittest

* fix lint

* increase timeout for async unittest

* fix unittest

* rename GetResponse for GetResult and fix uniitest

* add a second gorutines for check the newest blocks

* more unittest

* add unittest and run also preCheck on launch

* by default Precheck from FINALIZED and SAFE

* fix unittest, apply PR comments

* changes suggested by ARR552 in integration method

* fix documentation

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* fix unittest

* fix PR comments

* fix error

* checkReorgAndExecuteReset can't be call with lastEthBlockSynced=nil

* add parentHash to error

* fix error

* merge 3553 fix unittest

* fix unittest

* fix wrong merge

* adapt parallel reorg detection to flow

* fix unit tests

* fix log

* allow use sync parallel mode

---------

Co-authored-by: Alonso <ARR551@protonmail.com>

* linter

* comment check

---------

Co-authored-by: tclemos <thiago@polygon.technology>
Stefan-Ethernal pushed a commit to 0xPolygon/cdk-validium-node that referenced this pull request Apr 25, 2024
* wip

* run on background L1block checker

* fix lint and documentation

* fix conflict

* add unittest

* more unittest

* fix lint

* increase timeout for async unittest

* fix unittest

* rename GetResponse for GetResult and fix uniitest

* add a second gorutines for check the newest blocks

* more unittest

* add unittest and run also preCheck on launch

* by default Precheck from FINALIZED and SAFE

* fix unittest, apply PR comments

* changes suggested by ARR552 in integration method

* fix documentation

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* fix unittest

* fix PR comments

* fix error

* checkReorgAndExecuteReset can't be call with lastEthBlockSynced=nil

* add parentHash to error

* fix error

* merge 3553 fix unittest

* fix unittest

* fix wrong merge

* adapt parallel reorg detection to flow

* fix unit tests

* fix log

* allow use sync parallel mode

---------

Co-authored-by: Alonso <ARR551@protonmail.com>
Stefan-Ethernal pushed a commit to 0xPolygon/cdk-validium-node that referenced this pull request May 21, 2024
* wip

* run on background L1block checker

* fix lint and documentation

* fix conflict

* add unittest

* more unittest

* fix lint

* increase timeout for async unittest

* fix unittest

* rename GetResponse for GetResult and fix uniitest

* add a second gorutines for check the newest blocks

* more unittest

* add unittest and run also preCheck on launch

* by default Precheck from FINALIZED and SAFE

* fix unittest, apply PR comments

* changes suggested by ARR552 in integration method

* fix documentation

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* fix unittest

* fix PR comments

* fix error

* checkReorgAndExecuteReset can't be call with lastEthBlockSynced=nil

* add parentHash to error

* fix error

* merge 3553 fix unittest

* fix unittest

* fix wrong merge

* adapt parallel reorg detection to flow

* fix unit tests

* fix log

* allow use sync parallel mode

---------

Co-authored-by: Alonso <ARR551@protonmail.com>
Stefan-Ethernal added a commit to 0xPolygon/cdk-validium-node that referenced this pull request May 22, 2024
* check GER and index of synced L1InfoRoot matches with sc values (0xPolygonHermez#3551)

* apply txIndex fix to StoreTransactions; add migration to fix wrong txIndexes (0xPolygonHermez#3556)

* Feature/0xPolygonHermez#3549 reorgs improvement (0xPolygonHermez#3553)

* New reorg function

* mocks

* linter

* Synchronizer tests

* new elderberry smc docker image

* new image

* logs

* fix json rpc

* fix

* Test sync from empty block

* Regular reorg case tested

* linter

* remove empty block + fix LatestSyncedBlockEmpty

* Improve check reorgs when no block is received during the call

* fix RPC error code for eth_estimateGas and eth_call for reverted tx and no return value; fix e2e test;

* fix test

* Extra unit test

* fix reorg until genesis

* disable parallel synchronization

---------

Co-authored-by: tclemos <thiago@polygon.technology>

* Fix adding tx that matches with tx that is being processed (0xPolygonHermez#3559)

* fix adding  tx that matches (same addr and nonce) tx that is being processing

* fix generate mocks

* fix updateCurrentNonceBalance

* synchronizer:  check l1blocks (0xPolygonHermez#3546)

* wip

* run on background L1block checker

* fix lint and documentation

* fix conflict

* add unittest

* more unittest

* fix lint

* increase timeout for async unittest

* fix unittest

* rename GetResponse for GetResult and fix uniitest

* add a second gorutines for check the newest blocks

* more unittest

* add unittest and run also preCheck on launch

* by default Precheck from FINALIZED and SAFE

* fix unittest, apply PR comments

* changes suggested by ARR552 in integration method

* fix documentation

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* import new network-l1-mock from PR#3553

* fix unittest

* fix PR comments

* fix error

* checkReorgAndExecuteReset can't be call with lastEthBlockSynced=nil

* add parentHash to error

* fix error

* merge 3553 fix unittest

* fix unittest

* fix wrong merge

* adapt parallel reorg detection to flow

* fix unit tests

* fix log

* allow use sync parallel mode

---------

Co-authored-by: Alonso <ARR551@protonmail.com>

* Fix + remove empty blocks (0xPolygonHermez#3564)

* Fix + remove empty blocks

* unit test

* linter

* Fix/0xPolygonHermez#3565 reorg (0xPolygonHermez#3566)

* fix + logs

* fix loop

* Revert "fix + logs"

This reverts commit 39ced69.

* fix L1InfoRoot when an error happens during the process of the L1 information (0xPolygonHermez#3576)

* fix

* Comments + mock

* avoid error from some L1providers when fromBlock is higher than toBlock

* Revert some changes

* comments

* add L2BlockModulus to L1check

* doc

* fix dbTx = nil

* fix unit tests

* added logs to analyze blocking issue when storing L2 block

* add debug logs for datastreamer

* fix 0xPolygonHermez#3581 synchronizer panic synchronizing from trusted node (0xPolygonHermez#3582)

* synchronized: 0xPolygonHermez#3583  stop sync from l2 after no closed batch (0xPolygonHermez#3584)

* stop processing trusted Node after first open batch

* Update datastream lib to the latest version with additional debug info

* update dslib client interface

* Update the diff

* Fix non-e2e tests

* Update the docker image for the mock L1 network

* Update the diff

* Fix typo in the comment

* Use the Geth v1.13.11 Docker image and update the genesis spec

* Update the diff

---------

Co-authored-by: agnusmor <100322135+agnusmor@users.noreply.github.com>
Co-authored-by: Thiago Coimbra Lemos <tclemos@users.noreply.github.com>
Co-authored-by: Alonso Rodriguez <ARR552@users.noreply.github.com>
Co-authored-by: tclemos <thiago@polygon.technology>
Co-authored-by: Joan Esteban <129153821+joanestebanr@users.noreply.github.com>
Co-authored-by: Alonso <ARR551@protonmail.com>
Co-authored-by: agnusmor <agnusmor@gmail.com>
Co-authored-by: dPunisher <dpunish3r@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

synchronizer: double-check old L1 block to detect reorgs
3 participants