Options for geo-redundancy #3313
Replies: 2 comments
-
Mimir is designed for high-availability and requires a low latency connection within the cluster. High availability is typically achieved running with the default replication factor of 3 and deploying Mimir in multiple availability zones within the same region. In my experience, cross-region redundancy is not a very common use case. That being said, assuming you really need cross-region redundancy (and you want a full replication, including the object storage)...
I would consider to start with this, which is the only battle tested solution among the ones you listed. If a Mimir cluster is down, you can route queries to the other one. When the unhealthy cluster will get back, the agents will resume writing there so metrics are generally expected to reconcile (in such case, I would recommend to enable out of order ingestion with a configured time window of the max outage you want to handle, e.g. 12h).
Most recent data is kept in the ingesters, so most recent metrics will not be available for querying from the "secondary" cluster. Also with this approach you will not replicate the storage bucket to a different region, so you're going to achieve a full cross-region redundancy.
This is uncharted territory. As far as I know, none has ever battle tested it, so there may be unknowns I can't think of right now. For sure you have to run compactor only in one of the two clusters, but there may be more issues. Also in this case you're not replicating the bucket. |
Beta Was this translation helpful? Give feedback.
-
I have a bit of a better understanding of what was going wrong now. The approaches I listed earlier resulted in multiple hash rings, causing all kinds of problems for components that use the hash ring. For multi-region to work properly there needs to be a global What we are exploring now, and it looks promising, is a combination of using a multi-k8s-cluster service mesh for networking, tweaking the Thankfully the Helm chart was updated in the past month to simplify zone awareness configuration, but it required some customizing to allow the |
Beta Was this translation helpful? Give feedback.
-
I'm looking for some guidance. I have a requirement for geo-redundancy and am wondering if there are any deployment options that would work. I have two k8s clusters in different regions and want queries to either cluster to return the same (or as close to the same as possible) results.
The options I have explored:
block_ranges_period
causes increased write amplification, so I don't think this will scale well.err-mimir-store-consistency-check-failed
errors, caused by what looks like missing blocks. From my understanding, this is to do with having two independent sets of compactors, but I'm not sure. Will disabling the compactors in one cluster solve all the problems here? Will the redundant data written by the two clusters be detected and de-duplicated?I think option 3 seems the most promising but I'm not sure it plays nicely with the various components and there are a lot of unanswered questions about its behaviour. Does anyone have any advice?
Beta Was this translation helpful? Give feedback.
All reactions