Handling Deploys without Rebalancing (Proposal) #72

wedamija · 2024-06-26T05:35:35Z

One problem with our system is that during deploys a lot of rebalancing can happen. If our consumers take a while to start, it could mean we end up with downtime and missed checks during deploys. Ideally, we want to avoid rebalancing during the deploy as much as possible. This method might allow us to do that.

One major characteristic of our system is that we never commit offsets. Instead, the consumers read the entire partition for each partition that they are assigned to build up their config state. One side effect of doing this is that the consumer group we're using doesn't need to be the same group between deploys, since the main reason to keep a consistent consumer group across deploys is to continue reading from the last committed offset.

As well as this, each check result we produce has a consistent id - a hash of the subscription_id and the time that the check was intended to run (not the time it actually ran!). This means that we are able to de-duplicate results on the consumer, so we don't have to worry about double processing. This means that during the deploy process, it is fine to duplicate checks, since the duplicate results will be discarded.

Given all of this, we can avoid rebalancing during deploys during the deploy using the following method:

Whenever we deploy, we generate a new random consumer group that is specific to that deploy (the commit id? the deploy id?). All consumers being brought up in that deploy start up and use the new consumer group. Since this is a new consumer group that is not shared with current deploy, we perform no rebalancing. Instead, the new deploy starts up, gets the last fully processed tick from our shared data store, and starts processing. At this point we are duplicating checks, but this won't affect our results since we can de-duplicate these.
Once we've determined the new deploy is healthy, we kill the current deploy. Since it belongs to a separate consumer group, again no rebalance happens
Boom, blue green deploy

The only part of this that might be a little tricky is the shared store with the ticks. it could be a bit weird if both deploys are committing to that same store. but i think it's actually fine. Ticks always increase monotonically, and it's also fine to repeat a tick. so worst case is one deploy overwrites the tick of another, and if for some reason we restart we will replay that tick. But we won't be constantly reading the tick from the data store - it will be processed in memory, and only really read on a deploy or rebalance. So even if they do overwrite each other, eventually one deploy is killed, and the other overwrites the ticks and it's consistent.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling Deploys without Rebalancing (Proposal) #72

Handling Deploys without Rebalancing (Proposal) #72

wedamija commented Jun 26, 2024 •

edited

Loading

Handling Deploys without Rebalancing (Proposal) #72

Handling Deploys without Rebalancing (Proposal) #72

Comments

wedamija commented Jun 26, 2024 • edited Loading

wedamija commented Jun 26, 2024 •

edited

Loading