Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only Allow Ready Hashring Replicas in Generated ConfigMap #89

Conversation

michael-burt
Copy link

@michael-burt michael-burt commented May 23, 2022

This PR adds a CLI flag (allow-only-ready-replicas) which filters out pods that are not in the Ready condition from the hashring configuration. I have been testing this at scale for about 6 weeks and it does lead to a reduction in the frequency of replication and forwarding errors in the hashring.

@matej-g
Copy link
Contributor

matej-g commented Aug 22, 2022

Hey @michael-burt sorry this did not get proper attention, I'd actually like to add this change. Do you have any objections?

@michael-burt
Copy link
Author

michael-burt commented Aug 23, 2022

Hi @matej-g , I have no objections to merging this, however there is a big caveat. In a high-churn environments where pods frequently get rescheduled, this PR causes the hashring configuration to change frequently. This causes a frequent redistribution of MTS across the hashring pods which can lead to a large increase in memory consumption. This scenario contributes to cascading failures where hashring pods get OOMKilled and removed from the hashring configuration, causing a redistribution of MTS which causes further load on remaining healthy pods, eventually causing them to OOMKill as well.

In the end, I found it was more stable to just run a static hashring config and remove this operator from our environment completely. I think we should include a caveat in the documentation describing these failure modes if we are going to merge this.

The failures I describe above were observed in an environment with ~50M MTS and 30-40 ingestor pods with a replication factor of 3.

@matej-g
Copy link
Contributor

matej-g commented Aug 24, 2022

Hey @michael-burt, thanks for your response, I actually looked at the code now and I get what you're saying. My initial impression was that this change affects only cases where we scale up and add new replicas to the hashring, in which case I think it makes sense to wait for the replicas to be ready before they are added.

However, for the second case where we check replica readiness on each sync, as you said I don't believe this is good idea in general. As you pointed, any intermittent issue where a receiver replica(s) goes down would mean hashring change, which would mean frequent hashring changes and problems you mentioned, which is also what we want to avoid. Even though there are some changes in upstream (to use more uniform algorithm for redistributing serie in the hashring as well as to not flush TSDB), I still don't think removing replicas on every un-readiness is good.

WDYT about only including the first change (on scaling up, wait for new replicas to be ready) but keep the behavior with regards to existing replicas? I'd be happy to take over this PR, possibly document this behavior a bit better as well.

@matej-g
Copy link
Contributor

matej-g commented Aug 26, 2022

@michael-burt I'll take this over, I'll probably open a new PR but I'll reuse some bits from this, thank you!

@michael-burt
Copy link
Author

Sounds good, thanks @matej-g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants