Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(l1): connection attempt on already connected peer due to revalidation #1684

Closed
fmoletta opened this issue Jan 10, 2025 · 0 comments · Fixed by #1809
Closed

bug(l1): connection attempt on already connected peer due to revalidation #1684

fmoletta opened this issue Jan 10, 2025 · 0 comments · Fixed by #1809
Assignees
Labels
L1 network Issues related to network communication

Comments

@fmoletta
Copy link
Contributor

During peer revalidation we ping the least recently pinged peers and expect them to reply with a pong message. When a peer replies with pong we also try to initiate a connection with it. If a peer we are already connected to were to fall under this revalidation then we would attempt to initiate a second connection with the peer to which the peer would reply to by disconnecting from us.
Here are some logs to illustrate the problem:

025-01-09T19:24:05.888572Z  INFO ethrex_net::kademlia: Snap Peers: 2 / Active Peers 2 / Total Peers: 101
2025-01-09T19:24:05.888674Z  INFO ethrex_net::kademlia: Active Peers ID: 0xa343…5072, 0xac90…5e2b
2025-01-09T19:24:05.888748Z  INFO ethrex_net::sync: Requesting Block Headers from 0xde8f…79e7
2025-01-09T19:24:05.889382Z  INFO ethrex_net: Running peer revalidation
2025-01-09T19:24:05.891256Z  INFO ethrex_net: Pinging peer 0xa343…5072 to re-validate!
2025-01-09T19:24:05.892743Z  INFO ethrex_net: Pinging peer 0x0010…e3e7 to re-validate!
2025-01-09T19:24:05.894219Z  INFO ethrex_net: Pinging peer 0x743d…7122 to re-validate!
2025-01-09T19:24:05.894262Z  INFO ethrex_net: Peer revalidation finished
2025-01-09T19:24:05.926046Z  INFO ethrex_net: Peer 0xa343…5072 answered ping with pong
2025-01-09T19:24:05.929520Z  INFO ethrex_net: Peer 0x743d…7122 answered ping with pong
2025-01-09T19:24:05.958563Z  INFO ethrex_net: Starting Peer as Initiator
2025-01-09T19:24:05.970228Z  INFO ethrex_net: Starting Peer as Initiator
2025-01-09T19:24:06.006510Z ERROR ethrex_net::rlpx::connection: Handshake failed: (Peer disconnected due to: Already connected), discarding peer 0xa343…5072

We should read the spec in order to handle this case accordingly

@fmoletta fmoletta added bug Something isn't working L1 labels Jan 10, 2025
@mpaulucci mpaulucci added the network Issues related to network communication label Jan 10, 2025
@mpaulucci mpaulucci added this to the [L1] 4 - P2P Network milestone Jan 10, 2025
@Arkenan Arkenan self-assigned this Jan 10, 2025
@mpaulucci mpaulucci removed the bug Something isn't working label Jan 21, 2025
github-merge-queue bot pushed a commit that referenced this issue Jan 27, 2025
**Description**

Changes:

- Adds a connection bool to the peer data in the kademlia table, that
starts as false.
- Sets `is_connected` to true when the handshake completes.
- No need to ever set it as false as the record is deleted when the
connection fails.
- Check `is_connected` directly from the kademlia table before starting
a connection in Pong.

Closes #1684
github-merge-queue bot pushed a commit that referenced this issue Jan 29, 2025
**Motivation**
This PR introduces the following upgrades for snap-sync:
- Use DB-persisted checkpoints so we can persist the sync progress
throughout restarts & cycles
- Stop ForckChoices & NewPayloads being applied while syncing
- Improved handling of stale pivot during sub-processes
- Improved handling of pending requests when aborting due to stale pivot
- Fetching of large storage tries (that don't fit in a single range
request)
- Safer (but a bit slower) healing that can be restarted
- Faster storage fetching (multiple parallel fetches)

And also simplifies it by removing the following logic:
- No longer downloads bodies and receipts for blocks before the pivot
during snap sync (WARNING: this goes against the spec but shouldn't be a
problem for the time being)
- Removes restart from latest block when latest - 64 becomes stale. (By
this point it is more effective to wait for the next fork choice update)
- Periodically shows state sync progress
<!-- Why does this pull request exist? What are its goals? -->

**Description**
- Stores the last downloaded block's hash in the DB during snap sync to
serve as a checkpoint if the sync is aborted halfway (common case when
syncing from genesis). This checkpoint is cleared upon succesful snap
sync.
- No longer fetches receipts or block bodies past the pivot block during
snap sync
- Add method `sync_status` which returns an enum with the current sync
status (either Inactive, Active or Pending) and uses it in the
ForkChoiceUpdate & NewPayload engine rpc endpoints so that we don't
apply their logic during an active or pending sync.
- Fetcher process now identify stale pivots and remain passive until
they receive the end signal
- Fetcher processes now return their current queue upon return so that
it can be persisted into the next cycle
- Stores the latest state root during state sync and healing as a
checkpoint
- Stores the last fetched key during state sync as a checkpoint
- Healing no longer stores the nodes received via p2p, it instead
inserts the leaf values and rebuilds it to avoid trie corruption between
restarts.
- The current progress percentage and estimated time to finish is
periodically reported during state sync
- Disables the following Paris & Cancun engine hive tests that
previously yielded false positives due to new payloads being accepted on
top of a syncing chain:

   * Invalid NewPayload (family)
    * Re-Org Back to Canonical Chain From Syncing Chain
   * Unknown HeadBlockHash
   * In-Order Consecutive Payload Execution (Flaky)
   * Valid NewPayload->ForkchoiceUpdated on Syncing Client
   * Invalid Missing Ancestor ReOrg
   * * Payload Build after New Invalid Payload
 (only Cancun)

- And also disables the following tests that fail with the flag
Syncing=true for the same reason :

   * Bad Hash on NewPayload
   * ParentHash equals BlockHash on NewPayload (only for Paris)
   * Invalid PayloadAttributes (family)

Misc:
- Replaces some noisy unwraps in networking module with errors
- Applies annotated hacky fixes for problems reported in #1684 #1685 &
#1686
<!-- A clear and concise general description of the changes this PR
introduces -->

<!-- Link to issues: Resolves #111, Resolves #222 -->

Closes None

<!-- A clear and concise general description of the changes this PR
introduces -->

<!-- Link to issues: Resolves #111, Resolves #222 -->

Closes #issue_number
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
L1 network Issues related to network communication
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants