Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SRV records do not appear to be reloaded by agent before reconnecting to broker #1167

Open
sshipway opened this issue Mar 15, 2021 · 3 comments

Comments

@sshipway
Copy link

Running choria 0.16.0 RPM under centos 7 with mcollective plugins, installed by puppet module

We recently moved our Choria brokers, and added new SRV records for the new hosts, removing the old ones. However, when the old brokers were shut down, forcing agents to reconnect to another available broker, agents did not reload the new SRV records and instead continued to try to connect to only the old hosts rather than the new ones, failing to connect as the old hosts were gone.

Restarting the agents caused the SRV to be re-read and connection continued to the new brokers as expected.

Choria agents should not cache the SRV records for the broker addresses, but should instead re-query DNS each time they attempt to reconnect, in case of changes to the list of brokers.

To duplicate

  • Create cluster of 2 choria brokers.
  • Set SRV records to identify only broker 1
  • Start choria agent; see it connects to broker 1
  • Restart choria broker 1; see that agent is disconnected,and reconnects to broker 1 when it is available
  • Change SRV record to point to broker 2 (only)
  • Shut down broker 1 service
  • Observe that choria agent restarts and attempts to connect to broker 1 but not to broker 2, indicating that it is not rechecking the SRV records defined

Possibly connected to the DNS TTL for the domain?

@ripienaar
Copy link
Member

Agree and this is a known issue. We use the NATS go package that does not let us update names like that :(

I will have a chat with the authors and see if there is something we can do but as it is it’s a bit orthogonal - the nats package doesn’t support SRV at all so choria does the lookup and configure the package but on reconnect we have no way to do so.

@ripienaar
Copy link
Member

In your exact scenario I could perhaps improve the situation - nats can learn about new hosts on its own. So when you expanded the cluster to 2 nodes the connected ones could have known it’s there for reconnect purposes - but I don’t think we enable that behaviour.

Anyway it’s a long standing pain. Will try again with the authors if we can do something.

@sshipway
Copy link
Author

Thanks for the info on this - I know its a bit of an edge case but its a definite caveat that you need to be aware of when migrating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants