Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different RedisFailover's sentinels join together #550

Closed
tparsa opened this issue Jan 4, 2023 · 53 comments
Closed

Different RedisFailover's sentinels join together #550

tparsa opened this issue Jan 4, 2023 · 53 comments
Labels

Comments

@tparsa
Copy link

tparsa commented Jan 4, 2023

Environment

How are the pieces configured?

  • Redis Operator version v1.2.2
  • Kubernetes version v1.23.13
  • Kubernetes configuration used (eg: Is RBAC active?)
affinity: {}
annotations: {}
container:
  port: 9710
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
  repository: quay.io/spotahome/redis-operator
  tag: v1.2.2
imageCredentials:
  create: false
  email: someone@example.com
  existsSecrets:
  - registrysecret
  password: somepassword
  registry: url.private.registry
  username: someone
monitoring:
  enabled: false
  prometheus:
    name: unknown
  serviceAnnotations: {}
  serviceMonitor: false
nameOverride: ""
nodeSelector: {}
replicas: 3
resources:
  limits:
    cpu: 100m
    memory: 128Mi
  requests:
    cpu: 100m
    memory: 128Mi
securityContext:
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 1000
service:
  port: 9710
  type: ClusterIP
serviceAccount:
  annotations: {}
  create: true
  name: ""
tolerations: []
updateStrategy:
  type: RollingUpdate
@tparsa tparsa changed the title Different RedisFailover sentinels join together Different RedisFailover's sentinels join together Jan 4, 2023
@samof76
Copy link
Contributor

samof76 commented Jan 5, 2023

Do you mean different the sentinel from a different A redis-failover are joining sentinels from B redis-failover? In which case can please answer the following?

  1. Are the failovers running on different running on different the namespaces?
  2. Are the redis's configured with different the port?

@tparsa
Copy link
Author

tparsa commented Jan 5, 2023

Yes for example A Redis-failover sentinels are joining B Redis-failover sentinels. In some cases because one Redis has a password and the other does not the actual Redis pods can not join together thus the data will not combine.

  1. We witnessed both within the same namespace and between namespaces.
  2. No all of our redises are using the default 6379 port.

@tparsa
Copy link
Author

tparsa commented Jan 15, 2023

When this happens the operator doesn't have any way to fix it.
I checked the metric exposed on this line and the sentinel is unhealthy.
Furthermore, I checked the metric on this line as well and found out that operator is sending a reset to sentinels. So I tried to check the process manually and sent a SENTINEL RESET * to all of my 3 sentinels (Sentinels of RedisFailover A). Still, nothing happened since there were 9 sentinels in total (3 for RedisFailover B and 3 for RedisFailover C that all 9 joined together).

To give you more context we have over 90 RedisFailovers in our Production cluster and the cluster's node's CPUs are under heavy pressure. Sentinels from different RedisFailovers joining together is a huge problem for us before our current cluster we fixed any kind of problem by using Network Policies to prevent different Redises and Sentinels to connect to eachother.

@raghu-nandan-bs
Copy link
Contributor

@tparsa is there a deterministic way to reproduce the issue?
Also, do they all monitor the same master when this happens?

@tparsa
Copy link
Author

tparsa commented Jan 15, 2023

I can't provide any way to reproduce the issue. But here is the sentinels.conf of one of the following sentinels:

rfs-click-redis-64cb975df9-sjc8h 2/2     Running                0               7d8h    10.0.38.85    c28-s17   <none>           <none>
rfs-click-redis-64cb975df9-vlnwc 2/2     Running                0               30h     10.0.26.26    c28-n15   <none>           <none>
rfs-click-redis-64cb975df9-wxvmz 2/2     Running                0               4d5h    10.0.42.144   c28-n22   <none>           <none>
> kubectl exec -it rfs-click-redis-64cb975df9-sjc8h -- sh
Defaulted container "sentinel" out of: sentinel, sentinel-exporter, sentinel-config-copy (init)
/data $ cat /redis/sentinel.conf
sentinel myid 5e8a4531d95ac54825fea8bd4babac46ff6f9429
sentinel deny-scripts-reconfig yes
sentinel monitor mymaster 10.0.60.6 6379 2

# Generated by CONFIG REWRITE
port 26379
user default on nopass ~* +@all
dir "/data"

sentinel down-after-milliseconds mymaster 5000
sentinel failover-timeout mymaster 10000
sentinel config-epoch mymaster 25
sentinel leader-epoch mymaster 0

sentinel known-replica mymaster 10.0.19.221 6379
sentinel known-replica mymaster 10.0.46.29 6379
sentinel known-replica mymaster 10.0.59.123 6379
sentinel known-replica mymaster 10.0.55.45 6379
sentinel known-replica mymaster 10.0.42.137 6379
sentinel known-replica mymaster 10.0.56.109 6379
sentinel known-replica mymaster 10.0.35.189 6379
sentinel known-replica mymaster 10.0.42.195 6379
sentinel known-replica mymaster 10.0.39.186 6379
sentinel known-replica mymaster 10.0.38.253 6379
sentinel known-replica mymaster 10.0.29.215 6379
sentinel known-replica mymaster 10.0.8.220 6379
sentinel known-sentinel mymaster 10.0.33.88 26379 10037ad576a264e5ae6eb2c45efe6c0ad2704296
sentinel known-sentinel mymaster 10.0.55.146 26379 ade9feaff56c96dc7a8a6e3dda9e3a34420d41db
sentinel known-sentinel mymaster 10.0.40.77 26379 3c97521013d35fb69d5291f993e28d3b23e67da1
sentinel known-sentinel mymaster 10.0.42.144 26379 0752b9327a04e18dc72dbb65a34f3263265b5774
sentinel known-sentinel mymaster 10.0.35.4 26379 1bd1d0e46cb85a8c39db17cecee2303a65f41ae2
sentinel known-sentinel mymaster 10.0.9.51 26379 1825f95bf43d033613f990930012c177533d8dc9
sentinel known-sentinel mymaster 10.0.45.62 26379 55bf4c72c16f9829dbf1adfd1abf36f283bb7517
sentinel known-sentinel mymaster 10.0.26.26 26379 a157b90702bdaf027423b2c096602c35e1196d36
sentinel current-epoch 26

Replicacount of this RedisFailover is 3 sentinels and 5 redises but as you can see there are 12 (4 * 3) slaves and 8 other sentinels which indicates that 3 different RedisFailovers joined together.

@raghu-nandan-bs
Copy link
Contributor

@tparsa do all the sentinels point to same master, in their config?

@tparsa
Copy link
Author

tparsa commented Jan 16, 2023

I assume they do because what I see in the monitoring is only one of the three RedisFailovers has a master but sometimes there is a glitch that they do have masters.
The following images are indicators of which Redis is the master. Dark green is master and light green is slave.
image
image
image

@zekena2
Copy link

zekena2 commented Jan 16, 2023

I'm trying to upgrade to redis operator more than 1.0 so we can upgrade k8s to 1.22 and what happened is something similar. I tried the latest version first and what happened with me is the new redis operator chooses redis 6.2 by default so all redis replicas and sentinels got changed at the same time after they settled down some replicas didn't get synced into any clusters and I've two big clusters with 36 sentinels and 27 replicas. When I downgraded the operator to 1.1.1 The replicas that weren't ready got ready but still I've the same situation two big clusters with 36 sentinels and more replicas.
Note: operator live in a different namespace than the failovers and all failover objects live in one namespace.

@zekena2
Copy link

zekena2 commented Jan 16, 2023

I still have the situation going on in a test env and I've no idea how to fix it except with removing redis failovers and applying them agian one by one but I'm not sure if it'll help. Now I'm trying to search for a better way to do the migration if anyone can suggest something I'd be thankful.

@raghu-nandan-bs
Copy link
Contributor

@zekena2 In your test env, could you please confirm if all unruly sentinels point to same master? Also, could you please share operator logs for the same?

kubectl logs -f <operator-pod> | grep "<affected redis failover object name>"

@raghu-nandan-bs
Copy link
Contributor

also, which version of operator are you using currently?

@zekena2
Copy link

zekena2 commented Jan 16, 2023

Currently I'm using v1.2.4. using this command I can confirm that almost all sentinels point to the same master except three of the same failover point and some random ones that points to localhost or literally an ip that doesn't belong to any pod.

kubectl get pods | awk '$1 ~ /^rfs/ { print $1 }' | xargs -I {} -n 1 sh -c "echo \"{}\" && kubectl exec -it pod/{} -c sentinel -- redis-cli -p 26379 sentinel get-master-addr-by-name mymaster"

This is the logs for one of the failovers. I also get the same logs mentioned in this issue.
redisoperator.log

@raghu-nandan-bs
Copy link
Contributor

Thanks for the details @zekena2

@raghu-nandan-bs
Copy link
Contributor

@zekena2 are the failovers running in bootstrap mode?

@raghu-nandan-bs
Copy link
Contributor

Also, can you please confirm if you dont see the issue in v.1.2.3?

@raghu-nandan-bs
Copy link
Contributor

@zekena2 could you please share logs with debug mode enabled?

@zekena2
Copy link

zekena2 commented Jan 16, 2023

We don't use any bootstraping.
I'm trying v1.2.3 but I can't see any improvments.
How can I enable debug mode?

@raghu-nandan-bs
Copy link
Contributor

@zekena2 thanks for checking; please try --log-level=debug

@tparsa
Copy link
Author

tparsa commented Jan 16, 2023

What I saw in the operator metrics, indicates that the operator does realize there is a problem with the number of sentinels. But the only fix that the operator does is sending a SENTINEL RESET * that doesn't fix anything.

@raghu-nandan-bs
Copy link
Contributor

@tparsa please do share debug logs if you have the setup running.

@zekena2
Copy link

zekena2 commented Jan 17, 2023

here's debug logs.

redisoperator.log

@tparsa
Copy link
Author

tparsa commented Jan 17, 2023

@raghu-nandan-bs I'm afraid of restarting the operator :)) It might cause more sentinels to join together. It can be catastrophic for us.

@zekena2
Copy link

zekena2 commented Jan 17, 2023

I can confirm that deleting and creating the failovers fix the problem but this isn't really a good solution or an upgrade strategy.

@tparsa
Copy link
Author

tparsa commented Jan 17, 2023

@zekena2 Also deleting all sentinel pods will fix the problem as well.

@zekena2
Copy link

zekena2 commented Jan 17, 2023

@tparsa I can confirm that it helps, Thanks for the tip.

@EladDolev
Copy link

EladDolev commented Feb 7, 2023

we suffer from the exact same issue
one Redis cluster may be working just fine, while other clusters are joining together to form a one big cluster

using K8s 1.23 and redis-operator v1.2.4

on some K8s clusters we have ~10 RedisFailover which are working perfectly fine
all of our instances are always in the same namespace

@tparsa
Copy link
Author

tparsa commented Feb 7, 2023

We solved our problem by using network policies denying all connections except the ones we needed.

But there is a problem somewhere; this shouldn't be the ultimate solution.

@raghu-nandan-bs
Copy link
Contributor

raghu-nandan-bs commented Feb 10, 2023

@tparsa @EladDolev I checked this issue again today, tried forcefully reproducing the issue by setting all redis instances and sentinel instances to be consider a different failover's instance as master. that way, redis operator will be in a state where it wont fix the issue. while we can try updating the operator code to mitigate this, it would be really nice to understand how the redisfailovers ended up in this state in the first place...

@EladDolev if you are still having this issue, could you please help by giving some info -

  • are there any custom redis/sentinel configurations used?
  • when did you start seeing this issue? did you restart the pods, following which the issue occured?

@zekena2
Copy link

zekena2 commented Feb 10, 2023

I'm still facing this issue whenever we make a node rotation for upgrading, we put an alert to mitigate it manually for now. In my case custom config is only these two:

customConfig:
  - "down-after-milliseconds 2000"
  - "failover-timeout 3000"

So I guess deleting multiple different sentinels at the same time causes it?

@zekena2
Copy link

zekena2 commented Feb 11, 2023

1- I've 16 failovers every one of them has 3 sentinels and 3 replicas, total 96 all running in the same namespace.
2- The address pool has 1140 free IP addresses it's mostly like that all the time but the aws-vpc-cni reuses IP address that was freed recently by another deleted pod in order to save time and API calls.
3- It might happen that both redis and sentinels in the same node but I didn't investigate if that's the case when we face the issue of sentinels joining each other.

@zekena2
Copy link

zekena2 commented Feb 11, 2023

Regarding the sentinels and replicas being on the same node, I checked from KSM and it happens to be one of the clusters that joined eachother recently had a sentinel and a replica on the same node before the node rotation but the other cluster didn't.

@EladDolev
Copy link

hi @raghu-nandan-bs

are there any custom redis/sentinel configurations used?

sure. we're using some custom configurations
for redis

    - appendonly no
    - hz 100
    - save ""
    - maxmemory 0
    - maxmemory-policy noeviction

for sentinel

    - down-after-milliseconds 3000
    - failover-timeout 120000

we're running on GKE, and we're currently witnessing this issue happens on a cluster that is running on spots
the way we're spreading the workloads, for each cluster the pods would be spread on different nodes, but one worker node is likely to contain both replica and sentinel of the same cluster, and other pods that belong to different RedisFailover

@raghu-nandan-bs
Copy link
Contributor

Thanks for sharing the details @EladDolev , will take some time to test and get back with code changes, will update here...

@zekena2
Copy link

zekena2 commented Feb 13, 2023

I think the operator should move away from using IP addresses and utilize the feature in redis 7.0 with using hostnames instead that will eliminate a lot of the possible bugs, but that's probably will require a huge rewrite.

@samof76
Copy link
Contributor

samof76 commented Feb 18, 2023

The best thing to do in this case is to use custom ports for individual deployments, and keep them unique per failover deployment. We do this consciously in our redis failover deployments.

The other way is to instead of using mymaster for every sentinel may be we use a unique name per deployment. This would require a code change.

I see this issue cannot be fixed by any code change on the operator side as the sentinels inherent property is to remember it's redis. Your best shot currently is to use custom, unique, port. But if you are using default port the second option wherein the code change is required to replace the mymaster with a unique name per redisfailover deployment.

@samof76
Copy link
Contributor

samof76 commented Feb 18, 2023

On second thought we should fix the mymaster thing in any case.

@tparsa
Copy link
Author

tparsa commented Feb 18, 2023

This does not help our case.
Pods were not restarted, so using the same IP was not the problem in our case.

@tparsa
Copy link
Author

tparsa commented Feb 18, 2023

Different ports will fix it I'm not saying it won't but we're not finding the root cause.
In our case, at least the RedisFailover that lost master (joined to another RedisFailover) didn't have any pod IP changes.

@zekena2
Copy link

zekena2 commented Feb 18, 2023

I see this issue cannot be fixed by any code change on the operator side as the sentinels inherent property is to remember it's redis.

I'm pretty sure I never saw this issue in v1.0 there's definitely a regression happened somewhere.

@tparsa
Copy link
Author

tparsa commented Feb 19, 2023

One way that fixes the issue but is not a good way to fix it is to delete all sentinel pods and simultaneously set a master in RedisFailover. Then when the sentinels are running, the operator will set the right pod to monitor.

@mpihlak
Copy link
Contributor

mpihlak commented Mar 4, 2023

We just recently had a case where two sentinels "joined together".

There was some event on Kubernetes that caused a bunch of Redis and Sentinel pods to reschedule (ie. a node was scaled down). During rescheduling one of the Redis pod IP addresses was reused by a Redis pod in another cluster.

In the sentinel logs for cluster A, ther was this:

1:X 01 Mar 2023 10:24:39.931 * +fix-slave-config slave 10.178.10.102:6379 10.178.10.102 6379 @ mymaster 10.178.13.24 6379

Here the master 10.178.13.24 belongs to cluster A, but the redis pod IP 10.178.10.102 belongs to cluster B. It had belonged to cluster A just like a minute ago, but the pod was rescheduled and got a new IP. Also a pod in another cluster was rescheduled at the same time and was assigned the IP 10.178.10.102.

There was also a corresponding log entry in the Redis Operator logs, that points this to SetMasterOnAll:

time="2023-03-01T10:24:39Z" level=info msg="Making pod rfr-foo-redis-2 slave of 10.178.13.24" namespace=default redisfailover=foo-redis service=redis.healer src="checker.go:198"

(note: should really log the IP there as well).

Looking at SetMasterOnAll, it looks like it is open to race conditions -- if the pod IP addresses change between fetching them and calling MakeSlaveOfWithPort it could do the wrong thing there. I can imagine this happening more often with Kubernetes API calls taking a long time to complete (ie. overloaded API server).

It seems that setting different auth credentials to different clusters would mitigate this. Possibly having different master ID and not using mymaster everywhere.

@zekena2
Copy link

zekena2 commented Mar 6, 2023

It became eaiser to reproduce thanks to @mpihlak discovery. We will need calico cni in order to reuse the IP addresses consistently using the IPPool feature.

Start the cluster with minikube using this command
minikube start --network-plugin=cni --cni=calico --kubernetes-version=v1.24.11 --driver=docker

Install the operator with whatever way you want. I simply ran helm install redis-operator redis-operator/redis-operator
after adding the repo.

Apply the failovers and the IPPools.

apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: foo
spec:
  sentinel:
    replicas: 3
    resources:
      requests:
        cpu: 100m
      limits:
        memory: 100Mi
  redis:
    podAnnotations:
      'cni.projectcalico.org/ipv4pools': '["new-pool1"]'
    replicas: 3
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 400m
        memory: 500Mi
---
apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: bar
spec:
  sentinel:
    replicas: 3
    resources:
      requests:
        cpu: 100m
      limits:
        memory: 100Mi
  redis:
    podAnnotations:
      'cni.projectcalico.org/ipv4pools': '["new-pool2"]'
    replicas: 3
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 400m
        memory: 500Mi
---
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  name: new-pool1
spec:
  blockSize: 30
  cidr: 10.0.0.0/30
  ipipMode: Never
  natOutgoing: true
---
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  name: new-pool2
spec:
  blockSize: 30
  cidr: 10.0.0.4/30
  ipipMode: Never
  natOutgoing: true

Now change the annotations in the failovers to point to the other IPPool

apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: foo
spec:
  sentinel:
    replicas: 3
    resources:
      requests:
        cpu: 100m
      limits:
        memory: 100Mi
  redis:
    podAnnotations:
      'cni.projectcalico.org/ipv4pools': '["new-pool2"]'
    replicas: 3
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 400m
        memory: 500Mi
---
apiVersion: databases.spotahome.com/v1
kind: RedisFailover
metadata:
  name: bar
spec:
  sentinel:
    replicas: 3
    resources:
      requests:
        cpu: 100m
      limits:
        memory: 100Mi
  redis:
    podAnnotations:
      'cni.projectcalico.org/ipv4pools': '["new-pool1"]'
    replicas: 3
    resources:
      requests:
        cpu: 100m
        memory: 100Mi
      limits:
        cpu: 400m
        memory: 500Mi
---
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  name: new-pool1
spec:
  blockSize: 30
  cidr: 10.0.0.0/30
  ipipMode: Never
  natOutgoing: true
---
apiVersion: crd.projectcalico.org/v1
kind: IPPool
metadata:
  name: new-pool2
spec:
  blockSize: 30
  cidr: 10.0.0.4/30
  ipipMode: Never
  natOutgoing: true

The operator will start restarting the redis replicas and the sentinels will start to join eachother as the ip addresses will be switched between the two clusters. Note it will happen most of the time but not everytime(for me it happened 4 out of 5) and that proves more that there's a race condition.

@githubixx
Copy link

Hi! As I was thinking about using this operator I came across this issue and I felt immediately remembered about the problems I had getting this "fixed" with Google cloud running Redis Sentinel clusters on VMs. We have around 20 of these running and from time to time you of course want to upgrade or install OS updates. Since we do immutable infrastructure we just don't update VMs (via Ansible, Chef, Puppet, ...) but we completely replace them.

All our Redis clusters have three instances. So if we update such a cluster we shutdown one Redis node (which always contains one Redis and Sentinel process). We build our OS image with Hashicorp Packer. So our automation picks up the latest OS image built and then a new VM starts and re-joins the cluster. And then this also happens with the other two remaining nodes. We first replace the two replicas and finally the primary. Before shutting down the primary we issue a primary failover.

As long as you do this only with one Redis Sentinel cluster it normally works fine. But since the update process runs in parallel all 20 Redis Sentinel cluster are recreated at the same time. All that works just fine. But at the beginning we also discovered that some nodes suddenly tried to join other clusters during that process. All clusters have a different redisMasterName configured and therefore the join failed of course. To move that nodes back to the cluster they belong we manually had clean the configuration and rejoin that node.

We tried a lot of workarounds but nothing really worked reliable as Redis/Sentinel always "remembered" the IP addresses if the old (now gone) nodes. And that's actually the problem. During that Redis Sentinel clusters recreation process it was likely that one node got an internal IP address that was previously used by a node that belonged to a different Redis Sentinel cluster.

So our "solution" was to give every VM node fixed internal IP addresses. So they don't change IP addresses as the IP address is allocated once then is then always assigned to the VM that used it before. That "fixed" this issue once and forever 😃

But AFAIK you can't do this in Kubernetes. So from what I've read so far in this thread using NetworkPolicy or different ports for every Redis Sentinel deployment might be possible workarounds. Since Redis 6.2 you can also use hostnames instead of IP addresses: https://redis.io/docs/management/sentinel/#ip-addresses-and-dns-names Since Kubernetes has DNS autodiscovery this might be a more general solution to this problem.

@github-actions
Copy link

This issue is stale because it has been open for 45 days with no activity.

@github-actions github-actions bot added the stale label Apr 22, 2023
@github-actions
Copy link

github-actions bot commented May 7, 2023

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale May 7, 2023
@tparsa
Copy link
Author

tparsa commented May 11, 2023

Any updates on this?

@zekena2
Copy link

zekena2 commented Aug 21, 2023

I wanted to check if the new 1.3.0-rc1 might solve it but nothing has changed, clusters still join eachother. @ese Can you please take a look at this bug before the stable release?

indiebrain added a commit to powerhome/redis-operator that referenced this issue Dec 14, 2023
In #4, we added a NetworkPolicy. The intent was to prevent the Redis
and/or Sentinel pods from differing RedisFailovers from joining up with
one another (See:
spotahome#550). These policies
have proven to be too coarsely grained. We end up deploying supplemental
NetworkPolicies to allow ingress traffic.

This change narrows the scope of the NetworkPolicies manged by the
operator. One policy allows traffic to the Redis node pods ONLY on the
redis port and monitoring port for traffic originating from within the
namespace. The other policy allows traffic to the Sentinel pods ONLY on
the sentinel port for traffic originating from within the namespace. All
other traffic to these pods will be dropped.

Connections to the HAProxy pods - IE access to the redis master node -
will now be allowed by default. This achieves the goals of #4 and allows
us to stop littering additional NetworkPolicies to allow external
communication with a Redis instance.
@dbackeus
Copy link

dbackeus commented Jan 16, 2024

We just had this happen as well. Three sentinel clusters joined together, possibly connected to some flaky nodes. Lucklily these were all staging deployments. Otherwise that would have caused critical errors and data loss.

@sapisuper
Copy link

For everyone facing this problem I suggest to move from this redis-operator to opstree redis-operator. in opstree they have capability to set which sentinel to specific redis server (redis-replication)

    redisReplicationName : [name redis-server]
    masterGroupName: [set sentinelName default mymaster]

this configuration save us. cheers 😃
Doc reference : https://ot-redis-operator.netlify.app/docs/getting-started/sentinel/

@dbackeus
Copy link

dbackeus commented Aug 6, 2024

I think a good solution for this would be to migrate from using IP-addresses to hostnames for both Sentinel's and Redis instances. Ie. use replica-announce-ip <hostname> for Redis instances and sentinel announce-ip <hostname> for Sentinel instances. This would imply starting to use StatefulSet for running sentinels as well.

Hostname support is documented at: https://redis.io/docs/latest/operate/oss_and_stack/management/sentinel/#ip-addresses-and-dns-names

The issues around environments which swap IP-addresses around is sort of documented at https://redis.io/docs/latest/operate/oss_and_stack/management/sentinel/#sentinel-docker-nat-and-possible-issues. But it focuses on Docker environments rather than Kubernetes.

We are currently working on a DIY helm template hostname based approach to replace this operator after having a critical data loss in our production environment due to Sentinels getting mixed up and telling random Redis instances from different clusters to start following each other.

@githubixx
Copy link

I can recommend Bitnami Redis Helm chart: https://artifacthub.io/packages/helm/bitnami/redis It also supports Redis Sentinel. It works very well for us with 20+ Redis Sentinel deployments in a Kubernetes cluster. The chart handles various issues you can face with Redis Sentinel very well. They've quite a lot Helm charts for other services too: https://github.com/bitnami/charts/tree/main/bitnami

@sapisuper
Copy link

@githubixx correct better using Bitnami Redis. I got issue in OT-REDIS-OPERATOR regarding failover. the failover doesn't running smoothly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants