Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat[BMQ, MQB]: shutdown v2, optimizing shutdown logic #399

Merged
merged 3 commits into from
Sep 19, 2024

Conversation

dorjesinpo
Copy link
Collaborator

Changing shutdown logic to minimize control plane traffic.

  1. StopRequest is now per App (not Cluster)
  2. Downstreams of the the shutting down do not deconfigure queues. Upstreams do.
  3. The shutting down waits for unconfirmed (not the downstreams)
DownStream              Shutting down broker:           Upstream

                            InitiateShutdown:
                
                            ClientSessions:
                                changes the state
                    
                            Broker sessions:
                    <--         StopRequest V2          -->

Clusters                                                    Clusters
and                                                             The ClusterNodeSession:
ClusterProxies (CQH):                                               QueueHandles:
    Queues:                                                             deconfigureAll
        buffer PUTs
                                                            ClusterProxies:
                                                                The ClientSession:
                                                                    QueueHandles:
                                                                        deconfigureAll
                    -->         StopResponse V2         <--
            
                            Clusters:
                                stopPushing
                                checkUnconfirmedV2
                        
                                continueShutdownDispatched
                                    change the state(s)
                                    cancelAllRequests
                                    delete (dirty) queues

@dorjesinpo dorjesinpo requested a review from a team as a code owner August 15, 2024 14:03
@dorjesinpo dorjesinpo changed the title Shutdown V2 WIP Shutdown V2 Aug 15, 2024
@dorjesinpo dorjesinpo changed the title WIP Shutdown V2 Shutdown V2 Aug 15, 2024
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 180 of commit e81caf3 has completed with FAILURE

@678098 678098 changed the title Shutdown V2 Feat[BMQ, MQB]: faster shutdown Aug 27, 2024
@pniedzielski pniedzielski self-requested a review August 27, 2024 18:03
@pniedzielski pniedzielski self-assigned this Aug 27, 2024
Copy link
Collaborator

@pniedzielski pniedzielski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing big. A few questions inline, a pervasive typo, and some comment/docs changes.

Overall comments

  • Once we remove the v1 shutdown code, this will be a bit simpler. I think then it might be worth taking a little time to look at simplifying the structure of some of the parts that we've touched here (the separation of labor between initializeShutdown vs stop in mqba_application won't really make sense at that stage).

  • Now that we're open-source, I think we need be a bit more careful about what versions of brokers users may try to run. I noted that already we might want to turn off support for ungraceful shutdown, since all OSS brokers support it. But I think we should be intentional about turning off graceful shutdown v1, document when we do it, and provide an upgrade process for this (in the manner of https://kafka.apache.org/documentation/#upgrade).

  • I found this comment in mqbblp_clusterproxy's ClusterProxy::stopDispatched:

    // TBD:  In the future, we should review the entire shutdown logic, and
    //       eventually implement a request/response based mechanism in
    //       order to provide true graceful shutdown.
    

    A search revealed that this indeed was from before graceful shutdown logic v1, and from what I understand about this part of the code, this comment can safely be removed. I think changing some of the // temporary comments to something greppable as I noted below will mitigate things like this being left behind from this PR.

Merging

I think we'll need to unrevert Luke's admin API changes before we merge this in. So either: you can make the changes I suggest here and I'll approve before we unrevert the changes, and I can approve again after you rebase, or you can hold off on making the changes until after we unvert those changes and I'll approve then.

src/groups/mqb/mqbblp/mqbblp_routers.h Outdated Show resolved Hide resolved
src/groups/mqb/mqbi/mqbi_cluster.h Outdated Show resolved Hide resolved
src/groups/mqb/mqbi/mqbi_cluster.h Outdated Show resolved Hide resolved
src/groups/mqb/mqbi/mqbi_queue.h Show resolved Hide resolved
src/groups/mqb/mqba/mqba_adminsession.h Outdated Show resolved Hide resolved
{
for (Apps::iterator it = d_apps.begin(); it != d_apps.end(); ++it) {
it->value()->reset();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The post-conditions of QueueEngineUtil::AppState::reset are a little too intricate for my tastes; reset doesn't actually reset everything about the AppState... These two lines look like they should be reversed based on their names, but no, they are correct. I skimmed through the uses of reset and it's almost always followed by modifying d_routing_sp in some way.

Nothing to change in your PR about this, but I've written this down as something to revisit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change it? How about undoRouting?

Copy link
Collaborator

@pniedzielski pniedzielski Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

undoRouting is a good name for this. If you want to make that change in this PR or later, either way is okay by me. Low priority.

EDIT I see you made the change, looks good.

src/groups/mqb/mqbblp/mqbblp_relayqueueengine.cpp Outdated Show resolved Hide resolved
src/groups/mqb/mqba/mqba_clientsession.h Outdated Show resolved Hide resolved
src/groups/bmq/bmqp/bmqp_requestmanager.h Show resolved Hide resolved
@dorjesinpo dorjesinpo changed the title Feat[BMQ, MQB]: faster shutdown Feat[BMQ, MQB]: shutdown v2, optimizing shutdown logic Aug 29, 2024
@dorjesinpo dorjesinpo force-pushed the dev/shutdown-v2 branch 3 times, most recently from 1312b28 to 711cad0 Compare September 3, 2024 17:53
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 230 of commit 711cad0 has completed with FAILURE

@pniedzielski pniedzielski self-requested a review September 9, 2024 16:25
pniedzielski
pniedzielski previously approved these changes Sep 9, 2024
Copy link
Collaborator

@pniedzielski pniedzielski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏻

Let me know if you need a re-approval after fixing merge conflicts for getting this into main.

pniedzielski
pniedzielski previously approved these changes Sep 10, 2024
Copy link
Collaborator

@pniedzielski pniedzielski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reapproving after rebase to fix merge conflicts. Spot checked that no accidental changes were made.

Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 247 of commit b28cb4f has completed with FAILURE

Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 260 of commit c5b7c31 has completed with FAILURE

Signed-off-by: dorjesinpo <129227380+dorjesinpo@users.noreply.github.com>
Signed-off-by: dorjesinpo <129227380+dorjesinpo@users.noreply.github.com>
Signed-off-by: dorjesinpo <129227380+dorjesinpo@users.noreply.github.com>
Copy link

@bmq-oss-ci bmq-oss-ci bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build 267 of commit 670dde5 has completed with FAILURE

@dorjesinpo dorjesinpo merged commit 9075a4b into main Sep 19, 2024
36 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants