[stable/redis] fix sentinel restart problems #21168

btrappe · 2020-03-02T09:26:58Z

Is this a new chart

no

What this PR does / why we need it:

Which issue this PR fixes

This is a renewal of PR#19464 that was closed due to inactivity after sucessfully passing the automatic test bed. It seems to exists some interest of this PR, that's why I resend this updated PR.

About the reason of my changes:
I tried to test HA of the redis cluster and the sentinel-conform functionality of our services that uses the redis/sentinel cluster. To perform failover functionality as in real time, I just deleted slave or the master pod and found some abnormalities.

sentinel on master pod doesn't startup, if only setinel container on master pod is restarted (commit 911d2d6)
Interestingly, the redis chart creates a non-symmetrically set of pods: There is one "master" pod that own initially the redis master node and (optionally) several "slave" pods that owns initially a set of redis (and sentinel) nodes in slave mode. As long as the master node is kept untouched, scaling up/down slave pods and even recreating of suddently deleted slave pods work as intended (at least at first glance), because new slave pods can contact a redis via master service. Only if the master pod is created, it has the problem to determine, if it is the first pod of a new cluster (initial cluster startup) or if there are some remains of a running cluster, that lost its master pod. To address this problem, sameersbn added in commit 1adb31c (Chart version 8.0.16) some lines that tries to ask some sentinel for its list of sentinel IPs and add them directly to sentinel.conf. That works perfectly at pod fresh creation (non existing sentinel.conf and thus start with a copy of some template sentinel.conf) but lead to a persisting "error in sentinal configuration at line 21" (line number may vary), if the sentinal container is restarted only - you can easily simulate this by attaching to the sentinel container running on the pod of the master service and kill the sentinel process by typing "kill 1" - K8s will immediately restart the pod in-place (without wiping the folders) and thus, the code from sameersbn will append the well-known setinals a second time (there are still in sentinel.conf from the formerly running sentinel process) and lead to duplicate lines that makes the sentinel process to issue the configuration error instead of starting up. One solution would be to start always with a fresh sentinal.conf but it is worth to preserve the sentinel.conf whenever possible as it contains always the most current state of all sentinel cluster nodes. That's why I changed the code of sameersbn in that way that it is executed only if there is no former well-known sentinel node list found in sentinel.conf.

reset sentinel list after master pod has gone (commit 9be66eb)
While I dealt with the problems of 1. and the article https://redis.io/topics/sentinel, I hit on the notice "Sentinels never forget already seen Sentinels". In a statically configured environment this has no implications if nodes will be restarted/recreated having the former IP addresses. on k8s however, every pod (even after in-place updates or partial container restarts) will get always a new IP address. This leads the known sentinel list in sentinel.conf become larger and larger. Worse, at some point, the K8s cluster will recycle unused IP addresses and if some foreign redis/sentinal cluster gets such an IP address, two independent clusters will (at least partially) get unified and this leads to unexpected results. I observed such a problem on our K8s cluster running several environments/stages of our microservice architecture and thus multiple redis clusters in different K8s namespaces and could not explain at that time, why some sentinels reported redis nodes from an foreign redis cluster.
The fix is simple: just before start of the sentinel I fork a shell that is waiting 120s (fix, enough time to give sentinel to get familiar to each other), then ask the (newly started) sentinel on master pod for all known sentinels, followed by resetting all returned sentinels and the sentinel on master pod. That should work, since at this time all sentinel should follow the same master redis node from which they can acquire the current setinel list.

reset sentinel after helm upgrade (Commit baa5b79)
Finally, the IP address problem decribed in 2. also ocurs while upgrading the helm chart, because pods are potentially are rebuild. Since it it is not ensured, that the master service pod is recreated and the whole upgrade doesn't take more than 120s, I added a job that is a hook of helm chart install/upgrade event and executes the same procedure as mentioned in 2. (reset all sentinals)

Special notes for your reviewer:

Checklist

[Place an '[x]' (no spaces) in all applicable fields. Please remove unrelated fields.]

DCO signed
Chart Version bumped
Title of the PR starts with chart name (e.g. [stable/mychartname])

if only sentinel container on master svc pod crashes/restarts, the existing sentinels was be added twice in sentinel.conf which leads to an "error in configuration at line <somethings around 21>" Signed-off-by: btrappe <btrappe@web.de> Signed-off-by: Bernd Trappe <btrappe@web.de>

if redis helm chart is upgraded, pods may be restartet after in-place container upgrade and get new IP addresses. Sentinels don't forget ever seen IP adresses. This job resets the sentinel lists after HELM chart updates. Signed-off-by: btrappe <btrappe@web.de> Signed-off-by: Bernd Trappe <btrappe@web.de>

Signed-off-by: Bernd Trappe <btrappe@web.de>

k8s-ci-robot · 2020-03-02T09:27:13Z

Hi @btrappe. Thanks for your PR.

I'm waiting for a helm member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2020-03-02T09:27:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: btrappe
To complete the pull request process, please assign tompizmor
You can assign the PR to them by writing /assign @tompizmor in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

stable/redis/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

miguelaeh

Thank you very much for the PR!!!!
I leave some comments.
Apart from that, I would like to share with you that we are working on a Redis major change that will use the Redis Cluster topology so the Sentinel won't be needed anymore. For this reason, I am interested in if these changes will break the backward compatibility. In case these changes are a major change too we would like to avoid merge it.

miguelaeh · 2020-03-04T09:00:34Z

stable/redis/Chart.yaml

@@ -1,6 +1,6 @@
 apiVersion: v1
 name: redis
-version: 10.5.6
+version: 10.5.7


This should be at least a minor version as you are adding a new feature. If this PR is going to break the backward compatibility it should be a major change.

This doesn't look like a new feature to me. It fixes some annoying bugs by using functionality that basically already exists in this chart.

I stand corrected. Looks like resetting sentinels was never a "feature" of this chart.
The only difference I observed while using this patched version is that there are no longer any failed pods. Without the changes from this PR we had a randomly failing pod every 2-3 days. The upgrade process worked fine, too.

That's fully my opinion - I don't want to introduce any features (the helm chart already promises to provide a HA redis cluster, otherwise it wouldn't need the sentinels), but want to use the chart without the mentioned problems, I suppose a cluster actally should keep away from me as a cluster consumer.

miguelaeh · 2020-03-04T09:05:46Z

stable/redis/templates/redis-with-sentinel-job.yaml

+        - -c
+        - |
+          $(timeout -s 9 {{ .Values.sentinel.initialCheckTimeout }} redis-cli --raw -h $REDIS_MASTER_HOST -a \"$REDIS_PASSWORD\" -p {{ .Values.sentinel.service.sentinelPort }} SENTINEL sentinels {{ .Values.sentinel.masterSet }} | awk -f /health/parse_sentinels_reset.awk)
+          redis-cli --raw -h $REDIS_MASTER_HOST -a \"$REDIS_PASSWORD\" -p 26379 SENTINEL reset {{ .Values.sentinel.masterSet }}


Do not hardcode the port here.

Fixed in d7c69fb

miguelaeh · 2020-03-04T09:06:10Z

stable/redis/templates/redis-master-statefulset.yaml

+            existing_sentinels=$(timeout -s 9 {{ .Values.sentinel.initialCheckTimeout }} redis-cli --raw -h {{ template "redis.fullname" . }} -a "$REDIS_PASSWORD" -p {{ .Values.sentinel.service.sentinelPort }} SENTINEL sentinels {{ .Values.sentinel.masterSet }})
+            echo "$existing_sentinels" | awk -f /health/parse_sentinels.awk | tee -a  /opt/bitnami/redis-sentinel/etc/sentinel.conf
+          fi
+          bash -c "sleep 120; $(timeout -s 9 {{ .Values.sentinel.initialCheckTimeout }} redis-cli --raw -h localhost -a \"$REDIS_PASSWORD\" -p {{ .Values.sentinel.service.sentinelPort }} SENTINEL sentinels {{ .Values.sentinel.masterSet }} | awk -f /health/parse_sentinels_reset.awk) echo; redis-cli --raw -h localhost -a \"$REDIS_PASSWORD\" -p 26379 SENTINEL reset {{ .Values.sentinel.masterSet }}" &


Why is this sleep needed?

I generally know, sleeps are ugly. The problem is, however, I have to ping (i.e. reset) all sentinel nodes. To get a full list of all (at least potential) sentinels, I'm asking a well-known sentinel - the one I just started on localhost. There are 2 cases: whole pod creation with empty/start configuration and pod restart with some configuration from before. The 120 sec sleep ensures that my just started sentinal has enough time to communicate to and negotiate with the other cluster members and is able to deliver a complete list of potentials sentinel even when the pod is newly created without any information from master pod before. On other hand, I don't expect a new master restart within 120sec. (which actually shouldn't occur at all of course),

btrappe · 2020-03-04T14:47:01Z

For this reason, I am interested in if these changes will break the backward compatibility. In case these changes are a major change too we would like to avoid merge it.

I don't expect any problems regarding backward compatibility since no new features are introduced but only bugs of the HA node creation process in the case of node/pod failure are solved and neither additional configuration values were added nor changed.

…entinel-job: wait for master node becoming ready Signed-off-by: Bernd Trappe <btrappe@web.de>

stale · 2020-04-08T10:34:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

DaveOHenry · 2020-04-08T11:12:55Z

ping

carrodher · 2020-04-09T10:19:33Z

Sorry, we missed the notification about this PR, we are going to re-review it. Thanks!

miguelaeh

Hi,
I am sorry for the late, we didn't receive the notification about your comments.
It looks good to me, except the REDIS_PASSWORD env var, redis-cli will recognize the REDISCLI_AUTH env var so it is not necessary to use the -a option that is insecure. In any case, as the full chart is using REDIS_PASSWORD it is okay to e to be consistent and create another PR to change that in the full chart.

LGTM!
lets wait for the CI/CD

miguelaeh · 2020-04-10T10:11:54Z

Hi again,
Given the stable deprecation timeline, this Bitnami maintained Helm chart is now located at bitnami/charts. Please visit the bitnami/charts GitHub repository to create Issues or PRs.
In this issue we tried to explain more carefully the reasons and motivations behind this transition, please don’t hesitate to add a comment in this issue if you have any question related to the migration itself.
I am sorry for the noise, we didn't realize about that before.

zanhsieh · 2020-04-14T11:56:07Z

/ok-to-test

Signed-off-by: Bernd Trappe <btrappe@web.de>

stale · 2020-05-15T05:05:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

DaveOHenry · 2020-05-15T05:42:44Z

ping

stale · 2020-06-14T12:11:48Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Any further update will cause the issue/pull request to no longer be considered stale. Thank you for your contributions.

stale · 2020-06-28T12:29:56Z

This issue is being automatically closed due to inactivity.

williambrode · 2020-11-11T06:57:35Z

stable/redis/templates/redis-master-statefulset.yaml

+            existing_sentinels=$(timeout -s 9 {{ .Values.sentinel.initialCheckTimeout }} redis-cli --raw -h {{ template "redis.fullname" . }} -a "$REDIS_PASSWORD" -p {{ .Values.sentinel.service.sentinelPort }} SENTINEL sentinels {{ .Values.sentinel.masterSet }})
+            echo "$existing_sentinels" | awk -f /health/parse_sentinels.awk | tee -a  /opt/bitnami/redis-sentinel/etc/sentinel.conf
+          fi
+          bash -c "sleep 120; $(timeout -s 9 {{ .Values.sentinel.initialCheckTimeout }} redis-cli --raw --no-auth-warning -h localhost -a \"$REDIS_PASSWORD\" -p {{ .Values.sentinel.service.sentinelPort }} SENTINEL sentinels {{ .Values.sentinel.masterSet }} | awk -f /health/parse_sentinels_reset.awk) echo; redis-cli --raw --no-auth-warning -h localhost -a \"$REDIS_PASSWORD\" -p {{.Values.sentinel.porti }} SENTINEL reset {{ .Values.sentinel.masterSet }}" &


If anyone like me is looking for this fix based on Sentinel issues in this chart version - notice the typo on this line: ".Values.sentinel.porti" -> ".Values.sentinel.port"

oh thanks - that's in fact a typo. Unfortunately noone of the maintainer wants to merge it finnaly anymore: #21168 (comment)
I'll try to ropen this PR with corrected port variable name but I'm afraid, the whole chart is really depracated.

this repo will have been closed the day after tomorrow - it makes no sense to update my PR, but thanks for mention this error !

btrappe and others added 5 commits March 1, 2020 22:10

[stable-redis] reset sentinel list after master pod has gone

9be66eb

Signed-off-by: Bernd Trappe <btrappe@web.de>

small improvements

d5798b9

Signed-off-by: Bernd Trappe <btrappe@web.de>

redis chart version bump, small cosmetic change

a178f43

Signed-off-by: Bernd Trappe <btrappe@web.de>

helm-bot added the Contribution Allowed If the contributor has signed the DCO or the CNCF CLA (prior to the move to a DCO). label Mar 2, 2020

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 2, 2020

k8s-ci-robot requested review from carrodher and javsalgar March 2, 2020 09:27

btrappe changed the title ~~[stable-redis] fix sentinel restart problems~~ [stable/redis] fix sentinel restart problems Mar 2, 2020

miguelaeh suggested changes Mar 4, 2020

View reviewed changes

explicit sentinel port references replaced by variables, redis-with-s…

d7c69fb

…entinel-job: wait for master node becoming ready Signed-off-by: Bernd Trappe <btrappe@web.de>

helm-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Mar 9, 2020

btrappe requested a review from miguelaeh March 9, 2020 09:54

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2020

stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 8, 2020

miguelaeh approved these changes Apr 10, 2020

View reviewed changes

k8s-ci-robot added ok-to-test and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 14, 2020

btrappe force-pushed the master branch from ebd641f to 1b92efe Compare April 14, 2020 15:15

helm-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 14, 2020

helm-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 14, 2020

minor Chart version bumped

6fd5c21

Signed-off-by: Bernd Trappe <btrappe@web.de>

btrappe force-pushed the master branch from 5a5987e to 6fd5c21 Compare April 14, 2020 15:22

Merge branch 'master' into master

aefbb8a

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 15, 2020

stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 15, 2020

stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 14, 2020

stale bot closed this Jun 28, 2020

williambrode reviewed Nov 11, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stable/redis] fix sentinel restart problems #21168

[stable/redis] fix sentinel restart problems #21168

btrappe commented Mar 2, 2020 •

edited

Loading

k8s-ci-robot commented Mar 2, 2020

k8s-ci-robot commented Mar 2, 2020

miguelaeh left a comment

miguelaeh Mar 4, 2020

DaveOHenry Mar 4, 2020 •

edited

Loading

DaveOHenry Mar 4, 2020

btrappe Mar 4, 2020

miguelaeh Mar 4, 2020

btrappe Mar 9, 2020

miguelaeh Mar 4, 2020

btrappe Mar 4, 2020

btrappe commented Mar 4, 2020

stale bot commented Apr 8, 2020

DaveOHenry commented Apr 8, 2020

carrodher commented Apr 9, 2020

miguelaeh left a comment

miguelaeh commented Apr 10, 2020

zanhsieh commented Apr 14, 2020

stale bot commented May 15, 2020

DaveOHenry commented May 15, 2020

stale bot commented Jun 14, 2020

stale bot commented Jun 28, 2020

williambrode Nov 11, 2020

btrappe Nov 11, 2020

btrappe Nov 11, 2020

[stable/redis] fix sentinel restart problems #21168

[stable/redis] fix sentinel restart problems #21168

Conversation

btrappe commented Mar 2, 2020 • edited Loading

Is this a new chart

What this PR does / why we need it:

Which issue this PR fixes

Special notes for your reviewer:

Checklist

k8s-ci-robot commented Mar 2, 2020

k8s-ci-robot commented Mar 2, 2020

miguelaeh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveOHenry Mar 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

btrappe commented Mar 4, 2020

stale bot commented Apr 8, 2020

DaveOHenry commented Apr 8, 2020

carrodher commented Apr 9, 2020

miguelaeh left a comment

Choose a reason for hiding this comment

miguelaeh commented Apr 10, 2020

zanhsieh commented Apr 14, 2020

stale bot commented May 15, 2020

DaveOHenry commented May 15, 2020

stale bot commented Jun 14, 2020

stale bot commented Jun 28, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

btrappe commented Mar 2, 2020 •

edited

Loading

DaveOHenry Mar 4, 2020 •

edited

Loading