-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rolling update problem #175
Comments
Hi @teodor-pripoae :)
How are you initialising Ringpop? Are you providing an Identity argument? If you are, then it's possible you are affected by #146. Otherwise, if you are using the builtin identity provider then it's not possible for ringpop to bootstrap without TChannel listening (it will throw an error). |
Hi, I'm initializing it using the GossipAddr from our config. It is printed as the first line in the logs above and it is equal to the pod IP. ch, err := tchannel.NewChannel("avcache-proxy", nil)
if err != nil {
return nil, errors.Trace(fmt.Errorf("channel did not create successfully: %v", err))
}
rp, err := ringpop.New("avcache",
ringpop.Channel(ch),
ringpop.Identity(config.GossipAddr),
ringpop.Logger(bark.NewLoggerFromLogrus(log.LogrusLogger())),
) |
Are you calling |
Hi, I'm running ListenAndServe on the tchannel first. Here is all code for my server: type Server struct {
ringpop *ringpop.Ringpop
channel *tchannel.Channel
proxy *proxy.Proxy
peerProvider discovery.DiscoverProvider
}
func NewPeer(proxyServer *proxy.Proxy) (*Server, error) {
ch, err := tchannel.NewChannel("avcache-proxy", nil)
if err != nil {
return nil, errors.Trace(fmt.Errorf("channel did not create successfully: %v", err))
}
rp, err := ringpop.New("avcache",
ringpop.Channel(ch),
ringpop.Identity(config.GossipAddr),
ringpop.Logger(bark.NewLoggerFromLogrus(log.LogrusLogger())),
)
if err != nil {
return nil, errors.Trace(fmt.Errorf("Unable to create Ringpop: %v", err))
}
server := &Server{
channel: ch,
ringpop: rp,
proxy: proxyServer,
peerProvider: config.PeerProvider(),
}
if err := server.RegisterProxy(); err != nil {
return nil, errors.Trace(fmt.Errorf("could not register proxy handler: %v", err))
}
if err := server.channel.ListenAndServe(config.GossipAddr); err != nil {
return nil, errors.Trace(fmt.Errorf("could not listen on given hostport: %v", err))
}
opts := new(swim.BootstrapOptions)
opts.DiscoverProvider = server.peerProvider
if _, err := server.ringpop.Bootstrap(opts); err != nil {
return nil, errors.Trace(fmt.Errorf("ringpop bootstrap failed: %v", err))
}
return server, nil
}
func (w *Server) RegisterProxy() error {
hmap := map[string]interface{}{"/proxy": w.ProxyHandler}
return json.Register(w.channel, hmap, func(ctx context.Context, err error) {
log.Debugf("error occured: %v", err)
})
} |
It behaves the same with 3 nodes after a rolling restart. I think when one pod is stopped, the kubernetes dns still returns it's ip, so when a fresh pod starts it will also get a few old ips in the peer list. Then, it may try to connect to the stopped pods, and it fails. Am I right ? |
@teodor-pripoae It sounds very interesting to run a ringpop application on kubernetes but we haven't done so in the past. To try and reproduce your issues during rolling restarts I created an example on a separate branch and have a raw README.md of commands that I ran to setup the cluster and execute a rolling upgrade. Unfortunately I was not able to reproduce the problems you described. If it is a DNS caching issue I would not expect it to be a problem in a 3 node setup. From my example and testing in the linked branch I experienced that kubernetes scales up the running applications to a minimum of 4 nodes during the upgrade on a replication controller of 3 nodes. If the old addresses would show up in the cached dns results it would still allow for the new node to connect to 3 existing nodes. Does the linked example represent your setup or do you have a slightly different setup? Can you make changes to the example to reproduce your experienced problems during the rolling restart. That would make debugging a lot easier for me. For now my best guess would be that you run into conflicts of the number of hosts returned by your |
Hi,
I'm experiencing some issues when doing a rolling update over my ringpop cluster.
I'm running the cluster on top of Kubernetes with a headless service for peer communication. Every DNS query to this service returns a list of all ringpop IPs in the cluster.
I implemented the Kubernetes host provider like this:
During a rolling update, old ringpop services are stopped one by one and new ringpop services are created with a different IP. When a new ringpop service starts, it may see old or new ips in the hosts list.
I'm running 2 instances in the cluster right now, one simply fails to start::
The other one attempts to connect to the first and fails all the periodic health checks.
Eventually, the first pod timeouts, it is restarted by the cluster manager and it successfully connects to the second pod.
Is it related to #146 ?
The text was updated successfully, but these errors were encountered: