-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with StatefulSet Rolling Update Strategy #180
Comments
I think you would have the same issue even if you use deployment, new pods would crash-loop anyways. The way how we solved this is: we have our own implementation of the Kubernetes strategy that polls k8s API and only joins nodes of the same version (from the version label) into a cluster. This makes it impossible to handoff state between application versions but makes sure that code that was never tested to co-live in a cluster would end up crashing in production. |
@AndrewDryga do you think this custom k8s strategy is worth a PR or can be shared ? Because i feel like how are more people not running into this same issue ? Is everyone else using this library only every deploying libcluster 1 time and then never adding new features into its registry from that point forward ? |
@amacciola the problem with our strategy is that it is very opinionated (uses specific labels named for our environment, uses node names from k8s labels, etc). I will think about open-sourcing it but it's a relatively easy change: just leverage |
@AndrewDryga okay i will try this. So if i am trying to extend the k8s dns strategy here: you are suggestion that i need to tweak the libcluster/lib/strategy/kubernetes_dns.ex Line 107 in 5240d23
to also additionally query for a specific version or at least only matching version numbers ? |
@amacciola you can't extract that information from DNS server, instead you should modify that function in lib/strategy/kubernetes.ex. K8s API returns a lot of information about the pod including labels that you need to use to store the version. |
@AndrewDryga i see. So its changing libcluster/lib/strategy/kubernetes.ex Lines 232 to 252 in 5240d23
so include additional params to only return info for certain version info |
@amacciola yes, you want to query for pods and return only the ones that match your current version |
@AndrewDryga i am working on testing this new strategy out now so thanks for the insight. But i just wanted to make sure i understood how some of the Libcluster combined with Horde registry code is working under the hood. If we have 3 pods running for the same application. Each of these pods have lets say
If we then trigger an update for Does it just pick a process_id from one of the 3 pods to to try and start the new service on ? So you will have a 2 in 3 chance it tries to start the new service on a pod with |
I'm not using Horde but the pods with version 1 would not see pods with version 2 in the Erlang cluster, so basically, for each of the islands (one per version), everything would behave like it's a cluster with the same codebase. If you have globally unique jobs it also means that you will have two of the workers started (one per island). |
For now i have just created a separate HordeRegistry for each Genserver we would want to be leveraging the Libcluster strategies. As long as we dont have too many its a minor annoyance to fix this issue |
Precursor:
Currently all of our applications are deployed with StatefulSets vs being deployed with Deployments . The current UpdateStrategy of our StatefulSets is Rolling Updates. Here is an explanation of what it does and the other option we have:
Issue:
The combination of Rolling Updates && Libcluster is making it so that we can never add new services to the libcluster/horde registry. Because we will have
Or at least that is what i think is happening here. For the most part i think i have the issue correct and the error message on the pod that is crashing is
Even though i know the version on that pod has the code for
Cogynt.Servers.Workers.CustomFields.start_link
so it must be referring to one of the other 2 pods that had not got the new version yet.Has anyone else every ran into this problem ?
The text was updated successfully, but these errors were encountered: