-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: determine F3 participants relative to current network name #12597
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
masih
force-pushed
the
masih/repeat-f3-tests-on-ci
branch
6 times, most recently
from
October 15, 2024 14:10
f957084
to
610be8c
Compare
Repeat F3 itests on CI to investigate intermittent failures.
masih
force-pushed
the
masih/repeat-f3-tests-on-ci
branch
2 times, most recently
from
October 16, 2024 14:58
1a2cd45
to
ba33546
Compare
When manifest changes, depending on the timing it is possible for newly generated valid leases to get removed if the sign message loop attempts to sign messages that are as a result of progressing previous network. Here is an example scenario in a specific order that was causing itests to fail: * participants get a lease for network A up to instance 5 * network A progresses to instance 6 * manifest changes the network name to B * participants get a new lease for network B up to instance 5 * sign loop receives a message from network A, instance 6 * `getParticipantsByInstance` lazily removes leases since it only checks the instance. * the node ends up with no participants, and stuck. To fix this: 1) check if participants asked for are within the current network, and if not refuse to participate. 2) check network name, as well as instance, to lazily remove expired leases.
To aid debugging failing tests add option to print progress of all nodes at every eventual assertion, disabled by default.
masih
force-pushed
the
masih/repeat-f3-tests-on-ci
branch
from
October 16, 2024 15:51
ba33546
to
0a15c68
Compare
Defaults are based on epoch of 30s and real RTT. Shorten Delta and rebroadcast times.
masih
force-pushed
the
masih/repeat-f3-tests-on-ci
branch
from
October 16, 2024 15:59
0a15c68
to
0d3cb66
Compare
masih
changed the title
Investigate intermittent F3 itest failures on CI
fix: determine F3 participants relative to current network name
Oct 16, 2024
Stebalien
approved these changes
Oct 16, 2024
Stebalien
reviewed
Oct 16, 2024
Stebalien
approved these changes
Oct 16, 2024
4 tasks
Kubuxu
approved these changes
Oct 17, 2024
44 tasks
Kubuxu
pushed a commit
that referenced
this pull request
Oct 21, 2024
* Investigate intermittent F3 itest failures on CI Repeat F3 itests on CI to investigate intermittent failures. * Fix participation lease removal for wrong network When manifest changes, depending on the timing it is possible for newly generated valid leases to get removed if the sign message loop attempts to sign messages that are as a result of progressing previous network. Here is an example scenario in a specific order that was causing itests to fail: * participants get a lease for network A up to instance 5 * network A progresses to instance 6 * manifest changes the network name to B * participants get a new lease for network B up to instance 5 * sign loop receives a message from network A, instance 6 * `getParticipantsByInstance` lazily removes leases since it only checks the instance. * the node ends up with no participants, and stuck. To fix this: 1) check if participants asked for are within the current network, and if not refuse to participate. 2) check network name, as well as instance, to lazily remove expired leases. * Add debug capability to F3 itests to print current progress To aid debugging failing tests add option to print progress of all nodes at every eventual assertion, disabled by default. * Shorten GPBFT settings for a more responsive timing Defaults are based on epoch of 30s and real RTT. Shorten Delta and rebroadcast times. * Remove F3 itest repetitions on CI now that saul goodman See proof of the pudding: * https://github.com/filecoin-project/lotus/actions/runs/11369403828/job/31626763159?pr=12597 * Update the changelog * Address review comments * Remove the sanity check that all nodes use the same initial manifest Signed-off-by: Jakub Sztandera <oss@kubuxu.com>
rjan90
pushed a commit
that referenced
this pull request
Oct 24, 2024
* Investigate intermittent F3 itest failures on CI Repeat F3 itests on CI to investigate intermittent failures. * Fix participation lease removal for wrong network When manifest changes, depending on the timing it is possible for newly generated valid leases to get removed if the sign message loop attempts to sign messages that are as a result of progressing previous network. Here is an example scenario in a specific order that was causing itests to fail: * participants get a lease for network A up to instance 5 * network A progresses to instance 6 * manifest changes the network name to B * participants get a new lease for network B up to instance 5 * sign loop receives a message from network A, instance 6 * `getParticipantsByInstance` lazily removes leases since it only checks the instance. * the node ends up with no participants, and stuck. To fix this: 1) check if participants asked for are within the current network, and if not refuse to participate. 2) check network name, as well as instance, to lazily remove expired leases. * Add debug capability to F3 itests to print current progress To aid debugging failing tests add option to print progress of all nodes at every eventual assertion, disabled by default. * Shorten GPBFT settings for a more responsive timing Defaults are based on epoch of 30s and real RTT. Shorten Delta and rebroadcast times. * Remove F3 itest repetitions on CI now that saul goodman See proof of the pudding: * https://github.com/filecoin-project/lotus/actions/runs/11369403828/job/31626763159?pr=12597 * Update the changelog * Address review comments * Remove the sanity check that all nodes use the same initial manifest
21 tasks
rjan90
pushed a commit
that referenced
this pull request
Oct 28, 2024
* Investigate intermittent F3 itest failures on CI Repeat F3 itests on CI to investigate intermittent failures. * Fix participation lease removal for wrong network When manifest changes, depending on the timing it is possible for newly generated valid leases to get removed if the sign message loop attempts to sign messages that are as a result of progressing previous network. Here is an example scenario in a specific order that was causing itests to fail: * participants get a lease for network A up to instance 5 * network A progresses to instance 6 * manifest changes the network name to B * participants get a new lease for network B up to instance 5 * sign loop receives a message from network A, instance 6 * `getParticipantsByInstance` lazily removes leases since it only checks the instance. * the node ends up with no participants, and stuck. To fix this: 1) check if participants asked for are within the current network, and if not refuse to participate. 2) check network name, as well as instance, to lazily remove expired leases. * Add debug capability to F3 itests to print current progress To aid debugging failing tests add option to print progress of all nodes at every eventual assertion, disabled by default. * Shorten GPBFT settings for a more responsive timing Defaults are based on epoch of 30s and real RTT. Shorten Delta and rebroadcast times. * Remove F3 itest repetitions on CI now that saul goodman See proof of the pudding: * https://github.com/filecoin-project/lotus/actions/runs/11369403828/job/31626763159?pr=12597 * Update the changelog * Address review comments * Remove the sanity check that all nodes use the same initial manifest
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related Issues
Fixes #12519
Proposed Changes
When manifest changes, depending on the timing it is possible for newly generated valid leases to get removed if the sign message loop attempts to sign messages that are as a result of progressing previous network.
Here is an example scenario in a specific order that was causing itests to fail:
getParticipantsByInstance
lazily removes leases since it onlychecks the instance.
To fix this:
To aid debugging failing tests in the future add option to print progress of all nodes at every eventual assertion, disabled by default.
Additional Info
Look closely at the commits in this PR. The commits introduce a dedicated CI job that repeatedly runs the flaky tests (50 times) to assert that they are indeed fixed. The job is then removed in later commits.
These commits are left here for the benefit of the reviewer as proof of the pudding.
Checklist
Before you mark the PR ready for review, please make sure that: