Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix [MQB]: mqbc::StorageMgr: Transition to available only when all primary active #416

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kaikulimu
Copy link
Collaborator

@kaikulimu kaikulimu commented Sep 4, 2024

Fixes two flaky integration tests in FSM mode:

  1. In test_basic of test_restart.py, there was an issue where a replica could advertise availability before all primaries are active; then, a proxy could repoen queue and post message with no avail. The fix is to transition to available only when all primaries are active.

  2. In test_kill_post_start of test_strong_consistency.py, there was an issue where replicas are not issuing receipts to the primary after restart. This was because healing replicas in FSM mode were not buffering primary status advisories to process later, and thus not setting the correct primary in the FileStore. After I added the buffering logic, the tests pass.

@kaikulimu kaikulimu force-pushed the fsm-test--available-only-active-primary branch 3 times, most recently from 80f71b6 to e5d59b2 Compare September 10, 2024 21:54
@kaikulimu kaikulimu marked this pull request as ready for review September 11, 2024 13:59
@kaikulimu kaikulimu requested a review from a team as a code owner September 11, 2024 13:59
Copy link
Collaborator

@dorjesinpo dorjesinpo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple questions...


pinfo.setPrimaryStatus(value);
if (bmqp_ctrlmsg::PrimaryStatus::E_ACTIVE == value) {
d_fileStores[partitionId]->setPrimary(pinfo.primary(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this in setPrimaryStatusForPartition and not in setPrimaryForPartition?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FileStore should only set primary after the primary becomes active. The FileStore is unable to work with a passive primary. I will rename the function as FileStore::setActivePrimary to make the point clear.

<< " primary, this advisory could "
<< "be from the true one. Will"
<< " buffer the advisory for now.";
d_storageManager_p->bufferPrimaryStatusAdvisory(primaryAdv,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question, does pinfo.primaryNode() get assigned upon PrimaryStatusAdvisory or in the FSM flow there is another trigger? We receive PrimaryStatusAdvisory and we do not have partition primary, why not assign the primary then?

Copy link
Collaborator Author

@kaikulimu kaikulimu Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In FSM mode, the source of truth for partition assignments are in the cluster state snapshot of the CSL file. As part of healing, a new leader assigns partitions and then applies the assignments in its first CSL advisory. Primary status adviosries can be stale; that's why we have a lot of checks in this function in the first place. My original idea was to simply ignore all primary status advisories and purely rely upon FSM for partition assignments. However, FSM can heal a replica but neglect to set a primary as active. Thus, I came up with the idea of buffering primary status advisories. If an advisory is not stale (i.e. matching primary node and leaseId), then we trust the availability advisory.

Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 250 of commit e5d59b2 has completed with FAILURE

@kaikulimu kaikulimu force-pushed the fsm-test--available-only-active-primary branch from 835ba43 to 0116bdf Compare September 11, 2024 22:22
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 251 of commit 0116bdf has completed with FAILURE

@kaikulimu
Copy link
Collaborator Author

@dorjesinpo Back to you

@kaikulimu kaikulimu force-pushed the fsm-test--available-only-active-primary branch 2 times, most recently from 06c92b2 to 337db61 Compare September 17, 2024 10:29
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 256 of commit 337db61 has completed with FAILURE

Signed-off-by: Yuan Jing Vincent Yan <yyan82@bloomberg.net>
Signed-off-by: Yuan Jing Vincent Yan <yyan82@bloomberg.net>
Signed-off-by: Yuan Jing Vincent Yan <yyan82@bloomberg.net>
Signed-off-by: Yuan Jing Vincent Yan <yyan82@bloomberg.net>
Signed-off-by: Yuan Jing Vincent Yan <yyan82@bloomberg.net>
@kaikulimu kaikulimu force-pushed the fsm-test--available-only-active-primary branch from 337db61 to 8059ba6 Compare September 19, 2024 14:31
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 266 of commit 8059ba6 has completed with FAILURE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants