Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix local-cluster-flakey tests #29221

Closed
HaoranYi opened this issue Dec 12, 2022 · 2 comments · Fixed by #35356
Closed

fix local-cluster-flakey tests #29221

HaoranYi opened this issue Dec 12, 2022 · 2 comments · Fixed by #35356
Assignees
Labels
consensus Issues related to consensus bugs in the validator

Comments

@HaoranYi
Copy link
Contributor

Problem

Investigate and fix local_cluster_flakey tests

An example failure log

[2022-12-12T21:19:22.949402625Z INFO  local_cluster_flakey] Create validator
C's ledger [2022-12-12T21:19:22.992800408Z INFO  local_cluster_flakey] Create
validator A's ledger [2022-12-12T21:19:23.079169152Z INFO
local_cluster_flakey] Checking A's tower for a vote on slot descended from
slot `next_slot_on_a` [2022-12-12T21:19:23.098472257Z INFO
local_cluster_flakey] Removing tower! [2022-12-12T21:19:23.099375667Z INFO
local_cluster_flakey] Restart validator C again!!!
[2022-12-12T21:19:26.123293290Z INFO  local_cluster_flakey] collected
validator C's votes: {29, 30, 31, 32} [2022-12-12T21:19:26.123306395Z INFO
local_cluster_flakey] Restart validator A again!!!
[2022-12-12T21:19:26.633276484Z ERROR solana_core::validator] Rebuilding a new
tower from the latest vote account due to failed tower restore: IO Error: No
such file or directory (os error 2) [2022-12-12T21:19:38.035273429Z INFO
local_cluster_flakey] Observed A's votes on: [26, 26, 26, 26, 27, 27, 27, 27,
27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,
27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27, 27] thread 'main'
panicked at 'Violation expected because of removed persisted tower!',
local-cluster/tests/local_cluster_flakey.rs:351:13 stack backtrace:    0:
rust_begin_unwind              at
/rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt              at
/rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/panicking.rs:142:14
2:
local_cluster_flakey::do_test_optimistic_confirmation_violation_with_or_without_tower
3: serial_test::serial_code_lock::local_serial_core    4:
core::ops::function::FnOnce::call_once    5:
core::ops::function::FnOnce::call_once              at
/rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/ops/function.rs:248:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose
backtrace. FAILED  failures:  failures:
test_optimistic_confirmation_violation_without_tower

A rerun with sucess log

[2022-12-12T21:52:58.443640037Z INFO  local_cluster_flakey] Waiting on both
validators A and B to vote on fork at slot 27 [2022-12-12T21:53:13.269944257Z
INFO  local_cluster_flakey] Create validator C's ledger
[2022-12-12T21:53:13.390689919Z INFO  local_cluster_flakey] Create validator
A's ledger [2022-12-12T21:53:13.647698343Z INFO  local_cluster_flakey]
Checking A's tower for a vote on slot descended from slot `next_slot_on_a`
[2022-12-12T21:53:13.697183806Z INFO  local_cluster_flakey] Removing tower!
[2022-12-12T21:53:13.701354097Z INFO  local_cluster_flakey] Restart validator
C again!!! [2022-12-12T21:53:18.901249687Z INFO  local_cluster_flakey]
collected validator C's votes: {29, 30, 31, 32}
[2022-12-12T21:53:18.901300033Z INFO  local_cluster_flakey] Restart validator
A again!!! [2022-12-12T21:53:21.553643995Z ERROR solana_core::validator]
Rebuilding a new tower from the latest vote account due to failed tower
restore: IO Error: No such file or directory (os error 2)
[2022-12-12T21:53:22.179317214Z INFO  local_cluster_flakey] Observed A's votes
on: [26, 26, 26, 26, 30] [2022-12-12T21:53:22.179366507Z INFO
local_cluster_flakey] THIS TEST expected violations. And indeed, there was
some, because of removed persisted tower. test
test_optimistic_confirmation_violation_without_tower ... ok  test result: ok.
2 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 111.31s

Proposed Solution

Debug and fix the failure to make the test robust

@HaoranYi HaoranYi self-assigned this Dec 12, 2022
@AshwinSekar AshwinSekar added the consensus Issues related to consensus bugs in the validator label Dec 14, 2022
@AshwinSekar
Copy link
Contributor

AshwinSekar commented Dec 23, 2022

Based on the above logs, seems like the issue is that when A restarts it is able to repair 27 through D. Will confirm once I'm able to reproduce. Easiest solution seems to be to just kill D during A restart, although there might be a gossip discovery issue.

@AshwinSekar
Copy link
Contributor

Unable to reproduce locally after running for a couple days, added more logging to hopefully catch it in a future CI failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus Issues related to consensus bugs in the validator
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants