Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After upgrading Nomad to 1.4.2, services perpetually register in consul (/PUT to local consul agent) #15265

Closed
usovamaria opened this issue Nov 16, 2022 · 5 comments · Fixed by #15411
Assignees
Labels
Milestone

Comments

@usovamaria
Copy link

Nomad version and environment details

Nomad 1.1.2 -> 1.4.2
Consul 1.9.5
Ubuntu 20.04

Environment configuration:
Servers: 3 nomad servers, consul servers (services register here) + consul servers with ACL.
Clients: we've installed nomad client and consul agent.

Issue

After upgrading Nomad from 1.1.2 to 1.4.2 we saw that the rate of services registration in consul (using /PUT to consul agent) significantly increased according to metrics and logs.

We use metric rate(consul_http_PUT_v1_agent_service_register_count) in order to monitor consul agent API usage. Here is the graph based on this metric:

Screen Shot 2022-11-16 at 13 25 55

Also we've noticed that consul logs amount related to registering service increased too, e.g.:

Nov 16 13:29:18 hostname consul[48446]: 2022-11-16T13:29:18.585+0300 [DEBUG] agent.http: Request finished: method=PUT url=/v1/agent/service/register from=[::1]:23594 latency=7.206663ms.

Log amount obviously strongly correlates to the number of service containers on a host.

In contrast, our production environment, where nomad was not upgraded, the /PUT rate to consul agent is normal (0-0.1 per minute).

We do not experience performance issues on a current environment (with ~3K running containers) but we are going to update a production environment (with ~6K) and we can't predict an impact on consul and nomad performance. The only thing we noticed is that the number of goroutines on consul servers increased (~30K -> ~80K).

We assume that Nomad and Consul interaction during the process of service registration has changed.

Reproduction steps

  1. Bootstrap an environment with Nomad 1.1.2, Consul 1.9.5.
  2. Setup up monitoring of /PUT rate to a local consul agent (consul_http_PUT_v1_agent_service_register_count) and log rate in DEBUG mode.
  3. Run containers via Nomad.
  4. Update Nomad servers and clients to 1.4.2 according to official documentation.
  5. See the change in monitoring.

Expected Behaviour

Register rate of services remains the same or increases insignificantly after Nomad update

@shoenig shoenig added this to the 1.4.3 milestone Nov 16, 2022
@shoenig
Copy link
Member

shoenig commented Nov 16, 2022

Thanks for reporting, @usovamaria

We assume that Nomad and Consul interaction during the process of service registration has changed.

Yeah, ironically we were trying to fix a case (#14917) where Nomad would continuously re-register services into Nomad, but it seems it's broken elsewhere now.

Can you post the services block from a job definition that is experiencing the issue? That will help us track down the cause.

@shoenig shoenig self-assigned this Nov 16, 2022
@usovamaria
Copy link
Author

Hi @shoenig.

Yep, we've seen this issue so we were amazed by the /PUT rate graph and log rate.

There's a services section of an output on nomad job inspect command for a typical job:

                        "Services": [
                            {
                                "Address": "",
                                "AddressMode": "driver",
                                "CanaryMeta": null,
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "driver",
                                        "Advertise": "",
                                        "Args": null,
                                        "Body": "",
                                        "CheckRestart": null,
                                        "Command": "",
                                        "Expose": false,
                                        "FailuresBeforeCritical": 0,
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "InitialStatus": "",
                                        "Interval": 1000000000,
                                        "Method": "",
                                        "Name": "servicename-http",
                                        "OnUpdate": "require_healthy",
                                        "Path": "/ping",
                                        "PortLabel": "",
                                        "Protocol": "",
                                        "SuccessBeforePassing": 0,
                                        "TLSSkipVerify": false,
                                        "TaskName": "servicename-task",
                                        "Timeout": 2000000000,
                                        "Type": "http"
                                    }
                                ],
                                "Connect": null,
                                "EnableTagOverride": false,
                                "Meta": null,
                                "Name": "servicename-http",
                                "OnUpdate": "require_healthy",
                                "PortLabel": "3000",
                                "Provider": "consul",
                                "TaggedAddresses": null,
                                "Tags": [
                                    some tags
                                ],
                                "TaskName": "servicename-task"
                            },
                            {
                                "Address": "",
                                "AddressMode": "driver",
                                "CanaryMeta": null,
                                "CanaryTags": null,
                                "CheckRestart": null,
                                "Checks": [
                                    {
                                        "AddressMode": "driver",
                                        "Advertise": "",
                                        "Args": null,
                                        "Body": "",
                                        "CheckRestart": null,
                                        "Command": "",
                                        "Expose": false,
                                        "FailuresBeforeCritical": 0,
                                        "GRPCService": "",
                                        "GRPCUseTLS": false,
                                        "Header": null,
                                        "InitialStatus": "",
                                        "Interval": 1000000000,
                                        "Method": "",
                                        "Name": "servicename-monitoring",
                                        "OnUpdate": "require_healthy",
                                        "Path": "/ping",
                                        "PortLabel": "",
                                        "Protocol": "",
                                        "SuccessBeforePassing": 0,
                                        "TLSSkipVerify": false,
                                        "TaskName": "servicename-task",
                                        "Timeout": 2000000000,
                                        "Type": "http"
                                    }
                                ],
                                "Connect": null,
                                "EnableTagOverride": false,
                                "Meta": null,
                                "Name": "servicename-monitoring",
                                "OnUpdate": "require_healthy",
                                "PortLabel": "81",
                                "Provider": "consul",
                                "TaggedAddresses": null,
                                "Tags": [
                                    some tags
                                ],
                                "TaskName": "servicename-task"
                            }
                        ],

@shoenig
Copy link
Member

shoenig commented Nov 22, 2022

Hey @usovamaria, so @jrasell and I both tried and failed to reproduce what you're seeing. To continue investigating we just shipped #15311 (trace logging around Consul service registrations) which should help pinpoint the underlying problem.

If you have a chance, could you update one of the Nomad clients to 1.4.3 and enable trace logging.

log_level = "TRACE"

Once we have the output from that we should be able to understand what's going on.

@tgross tgross modified the milestones: 1.4.3, 1.4.x Nov 22, 2022
@kemko
Copy link

kemko commented Nov 23, 2022

Hi, @shoenig.

We upgraded nomad to 1.4.3 and enabled tracing.

It seems like nomad continuously tries to replace the lan_ipv6 field with wan_ipv6 with the same IP address, but task stays immutable:

Nov 23 13:15:19 xxx nomad[725234]:     2022-11-23T13:15:19.631+0300 [TRACE] consul.sync: registrations different: id=_nomad-task-xxx field=tagged_addresses wanted=map[string]api.ServiceAddress{} existing="map[string]api.ServiceAddress{\"lan_ipv6\":api.ServiceAddress{Address:\"xxx\", Port:81}, \"wan_ipv6\":api.ServiceAddress{Address:\"xxx\", Port:81}}"

@github-actions
Copy link

github-actions bot commented Apr 1, 2023

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 1, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
Development

Successfully merging a pull request may close this issue.

4 participants