Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul Connect - Envoy Sidecars failing to bootstrap listeners after certain # of services #8627

Closed
splittingred opened this issue Aug 10, 2020 · 9 comments

Comments

@splittingred
Copy link

splittingred commented Aug 10, 2020

Nomad version

Nomad v0.12.1 (14a6893)

Operating system and Environment details

Running nomad cluster on Debian Buster with Consul+Vault, Consul Connect enabled.

Versions:

  • Nomad v0.12.1 (14a6893) (issue still occurs on Nomad 0.12.2/3)
  • Consul v1.8.0/1.8.3
  • Docker 19.03.113-0debian-buster

Issue

We're having an issue where envoy proxies, after a certain number of them spun up across a nomad cluster, become disconnected from Consul and stop pulling services as clusters or setting up appropriate listeners. (Note all logs here have edited names+public IPs. We can send more detailed job/alloc logs via email if needed.)

This basically causes all traffic across Consul Connect to fail, as all envoy listeners are cleared out.

root@c080253c4b21:/# curl 0.0.0.0:19001/listeners
envoy_metrics_listener::0.0.0.0:1239

This should, from our job file, be:

root@c080253c4b21:/# curl 0.0.0.0:19001/listeners
envoy_metrics_listener::0.0.0.0:1239
http-1:0.0.0.0:4140
grpc:0.0.0.0:4142

We proxy out http1/grpc traffic via those upstreams, and then use service-routers with path prefixes + headers to determine the appropriate routing. This worked, until we hit an approximately high number of sidecars.

However, the envoy proxies dont seem to be measuring any failures:

envoy_cluster_manager_cds_update_failure{local_cluster="service-1"} 0

Allocation ID fb594d5d-3d4f-edc0-8d8b-8184b54c71f7 shows two port bindings:

envoy_stats |   | IP_REDACTED.41 | 23612
-- | -- | -- | --
connect-proxy-service-1 |   | IP_REDACTED.41 | 22083

The biggest red flag we see is in syslog, from Consul: (note that in reproduction attempts I will see these failures in logs prior to Envoy failing to pickup listeners from Consul)

Aug 10 18:20:27 consul[9156]:     2020-08-10T18:20:27.205Z [WARN]  agent: Check socket connection failed: check=service:_nomad-task-fb594d5d-3d4f-edc0-8d8b-8184b54c71f7-group-service-1--sidecar-proxy:1 error="dial tcp 127.0.0.1:22083: connect: connection refused"
Aug 10 18:20:27 consul[9156]:     2020-08-10T18:20:27.205Z [WARN]  agent: Check is now critical: check=service:_nomad-task-fb594d5d-3d4f-edc0-8d8b-8184b54c71f7-group-service-1--sidecar-proxy:1

I do notice that the socket is there:

$:/var/lib/nomad/data/alloc/717b04a9-1068-8a85-d709-cb3b04d49f0e/alloc/tmp# ls -al | grep consul
srwxrwxrwx 1 root   root       0 Aug 10 17:43 consul_grpc.sock

And that it is connectable (nc -U also works):

$:/var/lib/nomad/data/alloc/717b04a9-1068-8a85-d709-cb3b04d49f0e/alloc/tmp# curl --unix-socket consul_grpc.sock http:/v1/config/ -v
* Expire in 0 ms for 6 (transfer 0x55bd76c4cf50)
*   Trying consul_grpc.sock...
* Expire in 200 ms for 4 (transfer 0x55bd76c4cf50)
* Connected to v1 (/var/lib/nomad/data/alloc/717b04a9-1068-8a85-) port 80 (#0)

Consistently seeing these errors for all proxy sidecars. An envoy bootstrap file:

{
  "admin": {
    "access_log_path": "/dev/null",
    "address": {
      "socket_address": {
        "address": "127.0.0.1",
        "port_value": 19001
      }
    }
  },
  "node": {
    "cluster": "service-2",
    "id": "_nomad-task-26398d84-9ad8-db8e-e55e-370dd59f198d-group-service-2-sidecar-proxy",
    "metadata": {
      "namespace": "default",
      "envoy_version": "1.14.2"
    }
  },
  "static_resources": {
    "clusters": [
      {
        "name": "local_agent",
        "connect_timeout": "1s",
        "type": "STATIC",
        "http2_protocol_options": {},
        "hosts": [
          {
            "pipe": {
              "path": "alloc/tmp/consul_grpc.sock"
            }
          }
        ]
      },
      {
        "name": "self_admin",
        "connect_timeout": "5s",
        "type": "STATIC",
        "http_protocol_options": {},
        "hosts": [
          {
            "socket_address": {
              "address": "127.0.0.1",
              "port_value": 19001
            }
          }
        ]
      }
    ],
    "listeners": [
      {
        "name": "envoy_metrics_listener",
        "address": {
          "socket_address": {
            "address": "0.0.0.0",
            "port_value": 1239
          }
        },
        "filter_chains": [
          {
            "filters": [
              {
                "name": "envoy.http_connection_manager",
                "config": {
                  "stat_prefix": "envoy_metrics",
                  "codec_type": "HTTP1",
                  "route_config": {
                    "name": "self_admin_route",
                    "virtual_hosts": [
                      {
                        "name": "self_admin",
                        "domains": [
                          "*"
                        ],
                        "routes": [
                          {
                            "match": {
                              "prefix": "/stats"
                            },
                            "route": {
                              "cluster": "self_admin",
                              "prefix_rewrite": "/stats"
                            }
                          },
                          {
                            "match": {
                              "prefix": "/"
                            },
                            "direct_response": {
                              "status": 404
                            }
                          }
                        ]
                      }
                    ]
                  },
                  "http_filters": [
                    {
                      "name": "envoy.router"
                    }
                  ]
                }
              }
            ]
          }
        ]
      }
    ]
  },
  "stats_config": {
    "stats_tags": [
      {
        "regex": "^cluster\\.((?:([^.]+)~)?(?:[^.]+\\.)?[^.]+\\.[^.]+\\.[^.]+\\.[^.]+\\.[^.]+\\.consul\\.)",
        "tag_name": "consul.custom_hash"
      },
      {
        "regex": "^cluster\\.((?:[^.]+~)?(?:([^.]+)\\.)?[^.]+\\.[^.]+\\.[^.]+\\.[^.]+\\.[^.]+\\.consul\\.)",
        "tag_name": "consul.service_subset"
      },
      {
        "regex": "^cluster\\.((?:[^.]+~)?(?:[^.]+\\.)?([^.]+)\\.[^.]+\\.[^.]+\\.[^.]+\\.[^.]+\\.consul\\.)",
        "tag_name": "consul.service"
      },
      {
        "regex": "^cluster\\.((?:[^.]+~)?(?:[^.]+\\.)?[^.]+\\.([^.]+)\\.[^.]+\\.[^.]+\\.[^.]+\\.consul\\.)",
        "tag_name": "consul.namespace"
      },
      {
        "regex": "^cluster\\.((?:[^.]+~)?(?:[^.]+\\.)?[^.]+\\.[^.]+\\.([^.]+)\\.[^.]+\\.[^.]+\\.consul\\.)",
        "tag_name": "consul.datacenter"
      },
      {
        "regex": "^cluster\\.((?:[^.]+~)?(?:[^.]+\\.)?[^.]+\\.[^.]+\\.[^.]+\\.([^.]+)\\.[^.]+\\.consul\\.)",
        "tag_name": "consul.routing_type"
      },
      {
        "regex": "^cluster\\.((?:[^.]+~)?(?:[^.]+\\.)?[^.]+\\.[^.]+\\.[^.]+\\.[^.]+\\.([^.]+)\\.consul\\.)",
        "tag_name": "consul.trust_domain"
      },
      {
        "regex": "^cluster\\.(((?:[^.]+~)?(?:[^.]+\\.)?[^.]+\\.[^.]+\\.[^.]+)\\.[^.]+\\.[^.]+\\.consul\\.)",
        "tag_name": "consul.target"
      },
      {
        "regex": "^cluster\\.(((?:[^.]+~)?(?:[^.]+\\.)?[^.]+\\.[^.]+\\.[^.]+\\.[^.]+\\.[^.]+)\\.consul\\.)",
        "tag_name": "consul.full_target"
      },
      {
        "tag_name": "local_cluster",
        "fixed_value": "metafields"
      }
    ],
    "use_all_default_tags": true
  },
  "dynamic_resources": {
    "lds_config": {
      "ads": {}
    },
    "cds_config": {
      "ads": {}
    },
    "ads_config": {
      "api_type": "GRPC",
      "grpc_services": {
        "initial_metadata": [
          {
            "key": "x-consul-token",
            "value": ""
          }
        ],
        "envoy_grpc": {
          "cluster_name": "local_agent"
        }
      }
    }
  },
  "layered_runtime": {
    "layers": [
      {
        "name": "static_layer",
        "static_layer": {
          "envoy.deprecated_features:envoy.api.v2.Cluster.tls_context": true,
          "envoy.deprecated_features:envoy.config.trace.v2.ZipkinConfig.HTTP_JSON_V1": true,
          "envoy.deprecated_features:envoy.config.filter.network.http_connection_manager.v2.HttpConnectionManager.Tracing.operation_name": true
        }
      }
    ]
  }
}

Envoy admin clusters endpoint:

self_admin::default_priority::max_connections::1024
self_admin::default_priority::max_pending_requests::1024
self_admin::default_priority::max_requests::1024
self_admin::default_priority::max_retries::3
self_admin::high_priority::max_connections::1024
self_admin::high_priority::max_pending_requests::1024
self_admin::high_priority::max_requests::1024
self_admin::high_priority::max_retries::3
self_admin::added_via_api::false
self_admin::127.0.0.1:19001::cx_active::2
self_admin::127.0.0.1:19001::cx_connect_fail::0
self_admin::127.0.0.1:19001::cx_total::2
self_admin::127.0.0.1:19001::rq_active::0
self_admin::127.0.0.1:19001::rq_error::0
self_admin::127.0.0.1:19001::rq_success::647
self_admin::127.0.0.1:19001::rq_timeout::0
self_admin::127.0.0.1:19001::rq_total::647
self_admin::127.0.0.1:19001::hostname::
self_admin::127.0.0.1:19001::health_flags::healthy
self_admin::127.0.0.1:19001::weight::1
self_admin::127.0.0.1:19001::region::
self_admin::127.0.0.1:19001::zone::
self_admin::127.0.0.1:19001::sub_zone::
self_admin::127.0.0.1:19001::canary::false
self_admin::127.0.0.1:19001::priority::0
self_admin::127.0.0.1:19001::success_rate::-1
self_admin::127.0.0.1:19001::local_origin_success_rate::-1
local_agent::default_priority::max_connections::1024
local_agent::default_priority::max_pending_requests::1024
local_agent::default_priority::max_requests::1024
local_agent::default_priority::max_retries::3
local_agent::high_priority::max_connections::1024
local_agent::high_priority::max_pending_requests::1024
local_agent::high_priority::max_requests::1024
local_agent::high_priority::max_retries::3
local_agent::added_via_api::false
local_agent::alloc/tmp/consul_grpc.sock::cx_active::1
local_agent::alloc/tmp/consul_grpc.sock::cx_connect_fail::0
local_agent::alloc/tmp/consul_grpc.sock::cx_total::1
local_agent::alloc/tmp/consul_grpc.sock::rq_active::1
local_agent::alloc/tmp/consul_grpc.sock::rq_error::0
local_agent::alloc/tmp/consul_grpc.sock::rq_success::0
local_agent::alloc/tmp/consul_grpc.sock::rq_timeout::0
local_agent::alloc/tmp/consul_grpc.sock::rq_total::1
local_agent::alloc/tmp/consul_grpc.sock::hostname::
local_agent::alloc/tmp/consul_grpc.sock::health_flags::healthy
local_agent::alloc/tmp/consul_grpc.sock::weight::1
local_agent::alloc/tmp/consul_grpc.sock::region::
local_agent::alloc/tmp/consul_grpc.sock::zone::
local_agent::alloc/tmp/consul_grpc.sock::sub_zone::
local_agent::alloc/tmp/consul_grpc.sock::canary::false
local_agent::alloc/tmp/consul_grpc.sock::priority::0
local_agent::alloc/tmp/consul_grpc.sock::success_rate::-1
local_agent::alloc/tmp/consul_grpc.sock::local_origin_success_rate::-1

The concerning thing in there is the success_rate::-1, but not sure if that's a false flag?

Consul agent config:

{
  "addresses": {
    "dns": "127.0.0.1",
    "grpc": "127.0.0.1",
    "http": "127.0.0.1"
  },
  "advertise_addr": "{{ GetInterfaceIP \"ens4\" }}",
  "config_entries": {},
  "connect": {
    "enabled": true
  },
  "data_dir": "/opt/consul",
  "datacenter": "REDACTED",
  "disable_remote_exec": true,
  "disable_update_check": true,
  "dns_config": {
    "allow_stale": true,
    "max_stale": "87600h"
  },
  "enable_central_service_config": true,
  "enable_script_checks": true,
  "enable_syslog": true,
  "log_level": "INFO",
  "node_meta": {
    "cluster": "nomad",
    "machine_type": "REDACTED",
    "zone": "REDACTED"
  },
  "node_name": "REDACTED",
  "ports": {
    "dns": 8600,
    "grpc": 8502,
    "http": 8500,
    "serf_lan": 8301,
    "serf_wan": 8302
  },
  "retry_join": [
    "REDACTED"
  ]
}

Reproduction steps

This has been very difficult to reproduce. Essentially we've seen it start to occur at around 100-200+ sidecars across a nomad cluster (8+ clients).

Job file (if appropriate)

Can be sent via email. Here's some appropriate parts:

Connect stanza:

{
  "Connect": {
    "Native": false,
    "SidecarService": {
      "Port": "",
      "Proxy": {
        "Config": {
          "envoy_stats_bind_addr": "0.0.0.0:1239",
          "local_connect_timeout_ms": 5000.0
        },
        "ExposeConfig": null,
        "LocalServiceAddress": "",
        "LocalServicePort": 9005,
        "Upstreams": [
          {
            "DestinationName": "http-1",
            "LocalBindPort": 4140
          },
          {
            "DestinationName": "grpc",
            "LocalBindPort": 4142
          }
        ]
      },
      "Tags": [
        "sidecar-for:service-1"
      ]
    },
    "SidecarTask": {
      "Config": {
        "image": "REDACTED/envoy:latest"
      },
      "Driver": "docker",
      "Env": null,
      "KillSignal": "",
      "KillTimeout": 5000000000,
      "LogConfig": {
        "MaxFileSizeMB": 10,
        "MaxFiles": 10
      },
      "Meta": null,
      "Name": "",
      "Resources": {
        "CPU": 64,
        "Devices": null,
        "DiskMB": 0,
        "IOPS": 0,
        "MemoryMB": 64,
        "Networks": null
      },
      "ShutdownDelay": 0,
      "User": ""
    }
  }
}

Service network block:

  "Name": "service-1",
  "Networks": [
    {
      "CIDR": "",
      "DNS": null,
      "Device": "",
      "DynamicPorts": [
        {
          "HostNetwork": "default",
          "Label": "envoy_stats",
          "To": 1239,
          "Value": 0
        },
        {
          "HostNetwork": "default",
          "Label": "connect-proxy-service-1",
          "To": -1,
          "Value": 0
        }
      ],
      "IP": "",
      "MBits": 8,
      "Mode": "bridge",
      "ReservedPorts": null
    }
  ]

Nomad Client logs (if appropriate)

Aug 10 18:20:27 consul[9156]:     2020-08-10T18:20:27.205Z [WARN]  agent: Check socket connection failed: check=service:_nomad-task-fb594d5d-3d4f-edc0-8d8b-8184b54c71f7-group-service-1--sidecar-proxy:1 error="dial tcp 127.0.0.1:22083: connect: connection refused"
Aug 10 18:20:27 consul[9156]:     2020-08-10T18:20:27.205Z [WARN]  agent: Check is now critical: check=service:_nomad-task-fb594d5d-3d4f-edc0-8d8b-8184b54c71f7-group-service-1--sidecar-proxy:1
@tgross tgross added the theme/consul/connect Consul Connect integration label Aug 11, 2020
@splittingred
Copy link
Author

splittingred commented Aug 12, 2020

Diving in a bit further:

We weren't sure if Envoy was having issues talking to Consul through the consul_grpc.sock socket, so we attempted to override the Envoy bootstrap tpl to bind instead via address/port. This established the connection, but a similar error occurred:


Aug 12 21:58:44 consul[25943]:  agent: Check socket connection failed: check=service:_nomad-task-0fccf9ff-3bda-374b-5d1e-6d7705c575e2-group-service-1-sidecar-proxy:1 error="dial tcp 127.0.0.1:30036: connect: connection refused"
Aug 12 21:58:44 consul[25943]:  agent: Check is now critical: check=service:_nomad-task-0fccf9ff-3bda-374b-5d1e-6d7705c575e2-group-service-1-sidecar-proxy:1

Taking a deeper look at stats, I noticed Envoy's initial xds request is stuck in pending:

cluster.local_agent.upstream_rq_pending_total: 1
cluster.local_agent.upstream_cx_rx_bytes_buffered: 43
cluster.local_agent.upstream_cx_rx_bytes_total: 97
cluster.local_agent.upstream_cx_total: 1
cluster.local_agent.upstream_cx_tx_bytes_buffered: 0
cluster.local_agent.upstream_cx_tx_bytes_total: 16800

It seems the bidi stream is being created, connects fine, but Consul never returns any new services and it hangs.

@splittingred
Copy link
Author

splittingred commented Aug 13, 2020

More networking info from the envoy sidecar. From nomad-client host:

32610 ?        Ssl    2:29 envoy -c /secrets/envoy_bootstrap.json -l debug --disable-hot-restart
~# sudo nsenter -t 32610 -n
:~# tcpdump -i any portrange 8500-8502 -v
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
# over about 5m...
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

netstat dump:

~# netstat -ano
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       Timer
tcp        0      0 0.0.0.0:9999            0.0.0.0:*               LISTEN      off (0.00/0/0)
tcp        0      0 0.0.0.0:1239            0.0.0.0:*               LISTEN      off (0.00/0/0)
tcp        0      0 0.0.0.0:8985            0.0.0.0:*               LISTEN      off (0.00/0/0)
tcp        0      0 0.0.0.0:19001           0.0.0.0:*               LISTEN      off (0.00/0/0)
tcp        0      0 127.0.0.1:37716         127.0.0.1:9999          TIME_WAIT   timewait (27.55/0/0)
tcp        0      0 127.0.0.1:38044         127.0.0.1:9999          TIME_WAIT   timewait (37.54/0/0)
tcp        0      0 127.0.0.1:37442         127.0.0.1:9999          TIME_WAIT   timewait (17.56/0/0)
tcp        0      0 REDACTED_IP:33640     REDACTED_IP:5672      ESTABLISHED off (0.00/0/0)
tcp        0      0 REDACTED_IP:56002     REDACTED_IP:3306      ESTABLISHED keepalive (6830.12/0/0)
tcp        0      0 127.0.0.1:38350         127.0.0.1:9999          TIME_WAIT   timewait (47.51/0/0)
tcp        0      0 REDACTED_IP:54536     REDACTED_IP:3306      ESTABLISHED keepalive (6783.76/0/0)
tcp        0      0 127.0.0.1:38612         127.0.0.1:9999          TIME_WAIT   timewait (57.55/0/0)
tcp        0      0 REDACTED_IP:33628     REDACTED_IP:5672      ESTABLISHED off (0.00/0/0)
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node   Path
unix  3      [ ]         STREAM     CONNECTED     349215242 /var/lib/nomad/data/alloc/4a45a6c1-0bef-1a15-bf52-7d566ed5c1e8/alloc/tmp/consul_grpc.sock
unix  3      [ ]         STREAM     CONNECTED     349212536
unix  2      [ ]         STREAM     CONNECTED     348799443
unix  2      [ ]         STREAM     CONNECTED     348793722

Pretty convinced the bidi stream is hanging. Nomad is establishing conns to consul over 8502:

~# netstat -an | grep 8502
tcp        0      0 127.0.0.1:41350         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41338         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41462         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41132         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:34678         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41602         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41616         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41674         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41392         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:40402         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41932         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41412         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41352         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41568         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41308         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41540         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41454         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41174         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41430         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41536         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41176         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41482         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41292         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41110         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41290         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:34336         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41476         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41452         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:34334         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41296         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41742         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41386         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:33780         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41510         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41282         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41046         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41410         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41228         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41166         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41360         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41212         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41186         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41722         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41806         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41348         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41146         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:33772         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41608         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:42014         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41250         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41120         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41528         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41748         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41666         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41636         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:34686         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41504         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41294         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41116         127.0.0.1:8502          ESTABLISHED
tcp        0      0 127.0.0.1:41216         127.0.0.1:8502          ESTABLISHED
tcp6       0      0 :::8502                 :::*                    LISTEN
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41386         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41528         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41216         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41674         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41294         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41452         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:42014         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41110         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41932         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41174         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41166         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41352         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41282         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41806         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41636         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:34336         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41742         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41392         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41748         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41476         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41410         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41186         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41412         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41430         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41292         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41722         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41360         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41228         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41116         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:34678         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41046         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:33780         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41146         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41350         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41290         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41308         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41568         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41212         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41462         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41482         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41296         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41176         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41616         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41454         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:33772         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41338         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41120         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41348         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41250         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41608         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:34334         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41666         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:40402         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41132         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41602         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41510         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41536         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:34686         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41540         ESTABLISHED
tcp6       0      0 127.0.0.1:8502          127.0.0.1:41504         ESTABLISHED
unix  2      [ ACC ]     STREAM     LISTENING     348850234 @/containerd-shim/moby/6cb297e012e3bcc795b93026affa51d7c0cb52f4c9aed79884a5f3df47fcc1cf/shim.sock@
unix  2      [ ACC ]     STREAM     LISTENING     286163   /tmp/plugin985025497
unix  3      [ ]         STREAM     CONNECTED     118418502
unix  3      [ ]         STREAM     CONNECTED     348850238

@splittingred
Copy link
Author

A note here - we had Consul agent running via systemd, with:

LimitNOFILE=131072

We changed this to:

LimitNOFILE=infinity
LimitNPROC=infinity

Then restarted consul, and things started picking up listeners again. Wondering if Consul hit ulimits.

@splittingred
Copy link
Author

Update: And after fully pulling back the cluster again, new Envoy proxies in new allocations are not gathering listeners again - the same issue. Adjusting ulimits has not solved it.

@splittingred
Copy link
Author

Any update on this by chance?

@splittingred
Copy link
Author

splittingred commented Feb 1, 2021

We're definitely still seeing this, and a few updates:

The sidecars are failing health checks:

Feb 1, 2021 @ 09:58:30.000	2021-02-01T15:58:30.457Z [WARN]  agent: Check socket connection failed: check=service:_nomad-task-4a4cb768-17b8-b976-dc4e-9f98beb947d0-group-auth-grpc-auth-service-grpc-auth9004-sidecar-proxy:1 error="dial tcp 127.0.0.1:22720: connect: connection refused"

If you look at the envoy sidecar proxy, it's not getting its listeners:

# curl 0.0.0.0:19001/listeners
envoy_metrics_listener::0.0.0.0:1239

The domain socket for the Consul API is active:

Active UNIX domain sockets (servers and established)
Proto RefCnt Flags       Type       State         I-Node   Path
unix  2      [ ]         DGRAM                    31352210
unix  3      [ ]         STREAM     CONNECTED     29069764
unix  3      [ ]         STREAM     CONNECTED     29073521 /var/lib/nomad/data/alloc/4256118f-716d-e743-89b9-d03fb197c17a/alloc/tmp/consul_grpc.sock

The only relevant log we see is:

[2021-02-01 16:13:57.729][7][info][main] [source/server/drain_manager_impl.cc:70] shutting down parent after drain

According to https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/operations/draining, this happens when "Individual listeners are being modified or removed via LDS."

We note the timing of this happens nearly immediately after the container comes up, which means the listeners are never populated, which means the Envoy sidecar likely never fully initializes and will then fail its checks. So for some reason, Envoy is failing to gather the listeners from Consul's API.

Any ideas, or things we can try/inspect further?

@chrisboulton
Copy link

We believe this might have actually been caused with a bootstrap deadlock in Consul: hashicorp/consul#9689 - since applying the patch there I don't think we've seen future occurrences of this.

@splittingred
Copy link
Author

Going to close this, as @chrisboulton said, after hashicorp/consul#9689 we have not seen this re-occur

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants