Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Deploys without Rebalancing (Proposal) #72

Open
wedamija opened this issue Jun 26, 2024 · 0 comments
Open

Handling Deploys without Rebalancing (Proposal) #72

wedamija opened this issue Jun 26, 2024 · 0 comments

Comments

@wedamija
Copy link
Member

wedamija commented Jun 26, 2024

One problem with our system is that during deploys a lot of rebalancing can happen. If our consumers take a while to start, it could mean we end up with downtime and missed checks during deploys. Ideally, we want to avoid rebalancing during the deploy as much as possible. This method might allow us to do that.

One major characteristic of our system is that we never commit offsets. Instead, the consumers read the entire partition for each partition that they are assigned to build up their config state. One side effect of doing this is that the consumer group we're using doesn't need to be the same group between deploys, since the main reason to keep a consistent consumer group across deploys is to continue reading from the last committed offset.

As well as this, each check result we produce has a consistent id - a hash of the subscription_id and the time that the check was intended to run (not the time it actually ran!). This means that we are able to de-duplicate results on the consumer, so we don't have to worry about double processing. This means that during the deploy process, it is fine to duplicate checks, since the duplicate results will be discarded.

Given all of this, we can avoid rebalancing during deploys during the deploy using the following method:

  • Whenever we deploy, we generate a new random consumer group that is specific to that deploy (the commit id? the deploy id?). All consumers being brought up in that deploy start up and use the new consumer group. Since this is a new consumer group that is not shared with current deploy, we perform no rebalancing. Instead, the new deploy starts up, gets the last fully processed tick from our shared data store, and starts processing. At this point we are duplicating checks, but this won't affect our results since we can de-duplicate these.
  • Once we've determined the new deploy is healthy, we kill the current deploy. Since it belongs to a separate consumer group, again no rebalance happens
  • Boom, blue green deploy

The only part of this that might be a little tricky is the shared store with the ticks. it could be a bit weird if both deploys are committing to that same store. but i think it's actually fine. Ticks always increase monotonically, and it's also fine to repeat a tick. so worst case is one deploy overwrites the tick of another, and if for some reason we restart we will replay that tick. But we won't be constantly reading the tick from the data store - it will be processed in memory, and only really read on a deploy or rebalance. So even if they do overwrite each other, eventually one deploy is killed, and the other overwrites the ticks and it's consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant