Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad with Consul Connect Service Mesh randomly fails with a bootstrap error #20516

Closed
ryanhulet opened this issue May 2, 2024 · 16 comments · Fixed by #23381
Closed

Nomad with Consul Connect Service Mesh randomly fails with a bootstrap error #20516

ryanhulet opened this issue May 2, 2024 · 16 comments · Fixed by #23381
Assignees
Labels
hcc/cst Admin - internal type/bug
Milestone

Comments

@ryanhulet
Copy link

Nomad version

Nomad v1.7.7
BuildDate 2024-04-16T19:26:43Z
Revision 0f34c85ee63f6472bd2db1e2487611f4b176c70c

Operating system and Environment details

NAME="Ubuntu"
VERSION="20.04.5 LTS (Focal Fossa)"

Issue

Ever since upgrading Nomad to 1.7.7 and Consul to 1.18.1, nomad jobs will fail repeatedly with the error:
envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>
This will happen multiple times, until eventually the job will schedule and start correctly. This results in multiple jobs stuck in pending for hours on end because of rescheduling back off

Reproduction steps

Literally run a job similar to the one attached, using Consul with auto-config and mTLS, envoy version 1.27.3. I am also using Vault 1.15.4 for secrets, and Nomad is using Vault and Consul workflow identities

Expected Result

A job to schedule and register with Consul Connect successfully

Actual Result

REPEATED failures with the envoy_bootstrap error

Job file (if appropriate)

job "canonical-landscape" {
  datacenters = ["<DATACENTER>"]

  namespace = "ops"

  type = "service"

  vault {}
  
 affinity {
    attribute = "${node.unique.name}"
    operator = "set_contains_any"
    value = "<HOSTNAMES>"
    weight = -50
  }

  group "landscape" {

    volume "canonical-landscape" {
      type      = "host"
      read_only = false
      source    = "canonical-landscape"
    }

    volume "gpgs_keys" {
      type      = "host"
      read_only = false
      source    = "gpgs_keys"
    }

    volume "landscape-repository" {
      type      = "host"
      read_only = false
      source    = "landscape-repository"
    }

    network {
      mode = "bridge"

      port "landscape" {
        to = 443
      }

      port "landscape80" {
        to = 80
      }
    }

    service {
      name = "landscape"
      port = "landscape"
      address_mode = "alloc"

      connect {
        sidecar_service {
            proxy {
                upstreams {
                destination_name = "landscape-postgres"
                local_bind_port = 5432
                }
                upstreams {
                    destination_name = "landscape-rabbitmq"
                    local_bind_port = 5672
                }
            }

        }
      }
    }

    service {
      name = "landscape80"
      port = "landscape80"
      address_mode = "alloc"

      connect {
        sidecar_service {

    }

    task "landscape" {

      driver = "docker"

      resources {
          memory = 4096
        }

      template {
        data = <<EOH
[stores]
user = landscape
password = {{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.landscape_pw -}}{{ end }}
host = 127.0.0.1
main = landscape-standalone-main
account-1 = landscape-standalone-account-1
resource-1 = landscape-standalone-resource-1
package = landscape-standalone-package
session = landscape-standalone-session
session-autocommit = landscape-standalone-session?isolation=autocommit
knowledge = landscape-standalone-knowledge

[global]
oops-path = /var/lib/landscape/landscape-oops
# Syslog address is either a socket path or server:port string.
syslog-address = /dev/log
deployment-mode = standalone

[broker]
port = 5672
host = 127.0.0.1 
user = landscape 
password = {{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.rmq_pw -}}{{ end }}
vhost = landscape 
hostagent_virtual_host = landscape-hostagent
hostagent_task_queue = landscape-server-hostagent-task-queue

[landscape]
base-port = 8080
threads = 8
stores = main account-1 resource-1 package session session-autocommit knowledge
oops-key = DF%(instance-letter)s
#soft-timeout = 5
mailer = queue
mailer-path = /var/lib/landscape/landscape-mail-queue
reprepro-binary = /opt/canonical/landscape/scripts/reprepro-wrapper.sh
repository-path = /var/lib/landscape/landscape-repository

[message-server]
base-port = 8090
threads = 8
stores = main account-1 resource-1 package
oops-key = DM%(instance-letter)s
#soft-timeout = 5
#ping-interval = 30

[load-shaper]
critical-load = 10.0
good-load = 3.0
good-duration = 60.0

[pingserver]
base-port = 8070
stores = main account-1 resource-1
threads = 2

[job-handler]
stores = main account-1 resource-1 package
threads = 10
mailer = queue
mailer-path = /var/lib/landscape/landscape-mail-queue

[async-frontend]
base-port = 9090

[api]
stores = main account-1 resource-1 package
threads = 10
base-port = 9080
mailer = queue
mailer-path = /var/lib/landscape/landscape-mail-queue

[maintenance]
threads = 1
stores = main account-1 resource-1 package session session-autocommit knowledge
mailer = queue
mailer-path = /var/lib/landscape/landscape-mail-queue

[schema]
threads = 1
stores = main account-1 resource-1 package session knowledge
store_user = landscape_superuser
store_password = {{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.postgres_pw -}}{{ end }}

[scripts]
threads = 1
stores = main account-1 resource-1 package session knowledge
mailer = queue
mailer-path = /var/lib/landscape/landscape-mail-queue

[package-upload]
stores = main account-1
threads = 10
port = 9100
root-url = http://localhost:9100
mailer = queue
mailer-path = /var/lib/landscape/landscape-mail-queue

[package-search]
port = 9099
stores = main package resource-1
pid-path = /var/run/landscape/landscape-package-search-1.pid
account-threshold = 0
EOH
        // Setting destination to secrets makes this file inaccessible from the host machine
        destination = "secrets/service.conf"
        change_mode = "noop"
      }

      template {
        data = <<EOH
{{- with secret "/nomad-kv/svwi-certs" -}}{{- .Data.data.cert }}{{- end }}
EOH

        destination = "local/svwi.crt"
        change_mode = "restart"
      }

      template {
        data = <<EOH
{{- with secret "/nomad-kv/svwi-certs" -}}{{- .Data.data.key }}{{- end }}
EOH

        destination = "secrets/svwi.key"
        change_mode = "restart"
      }

      template {
        data = <<EOH
{{- with secret "/nomad-kv/default/canonical-landscape" -}}{{- .Data.data.license }}{{- end }}
EOH

        destination = "secrets/license.txt"
        change_mode = "restart"
      }

      template {
        data = <<EOH
[smtp.sendgrid.net]:587 apikey:{{- with secret "/nomad-kv/default/canonical-landscape" -}}{{- .Data.data.sendgrid_pw }}{{- end }}
EOH
        destination = "secrets/postfix"
        change_mode = "restart"
      }

      template {
        data = <<EOH
DB_HOST=127.0.0.1
RMQ_HOST=127.0.0.1
DB_PORT=5432
DB_USER=landscape_superuser
DB_LANDSCAPE_USER=landscape
DB_NAME=postgres
INITIALIZE_SCHEMA=yes
EOH
        // Setting destination to local makes this file accessible from the host machine
        destination = "local/lds.env"
        env = true
        change_mode = "noop"
      }

      template {
        data = <<EOH
DB_LANDSCAPE_PASS="{{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.landscape_pw -}}{{ end }}"
GCP_TOKEN="{{- with secret "/gcp/roleset/<ROLE>/token" }}{{- .Data.token | regexReplaceAll "\\.{3,}" "" }}{{- end }}"
EOH
        destination = "secrets/postgres.env"
        env = true
        change_mode = "noop"
      }

      volume_mount {
        volume      = "canonical-landscape"
        destination = "/var/lib/landscape/hash-id-databases"
        read_only   = false
      }

      volume_mount {
        volume      = "gpgs_keys"
        destination = "/root/.gnupg"
        read_only   = false
      }

      volume_mount {
        volume      = "landscape-repository"
        destination = "/var/lib/landscape/landscape-repository"
        read_only   = false
      }

      config {
        image = "canonical-landscape/app-server"

        }

        volumes = [
          "/etc/ssl/certs:/etc/ssl/certs",
          "secrets/service.conf:/etc/landscape/service.conf",
          "secrets/license.txt:/etc/landscape/license.txt",
          "secrets/postfix:/etc/postfix/sasl_passwd"
          //
        ]

        ports = ["landscape","landscape80"]
      }
    }
  }

  group "postgres" {

    volume "landscape-postgres" {
      type      = "host"
      read_only = false
      source    = "landscape-postgres"
    }

    network {
      mode = "bridge"

      port "postgres" {
        to = 5432
      }
    }

    service {
      name = "landscape-postgres"
      port = "postgres"
      address_mode = "alloc"

      connect {
        sidecar_service {}

      }
    }

    task "postgres" {

      driver = "docker"

      resources {
          memory = 2048
          memory_max = 4096
        }

      template {
        data = <<EOH
# TYPE  DATABASE        USER            ADDRESS                 METHOD

# "local" is for Unix domain socket connections only
local   all             all                                     trust
# IPv4 local connections:
host    all             all             127.0.0.1/32            trust
# IPv6 local connections:
host    all             all             ::1/128                 trust
# Allow replication connections from localhost, by a user with the
# replication privilege.
local   replication     all                                     trust
host    replication     all             127.0.0.1/32            trust
host    replication     all             ::1/128                 trust
EOH
        // Setting destination to secrets makes this file inaccessible from the host machine
        destination = "local/pg_hba.conf"
        change_mode = "noop"
      }

      template {
        data = <<EOH
listen_addresses = '127.0.0.1'
max_connections = 100                   # (change requires restart)
max_prepared_transactions = 100
shared_buffers = 128MB                  # min 128kB
dynamic_shared_memory_type = posix      # the default is the first option
max_wal_size = 1GB
min_wal_size = 80MB
log_timezone = 'Etc/UTC'
datestyle = 'iso, mdy'
timezone = 'Etc/UTC'
lc_messages = 'en_US.utf8'                      # locale for system error message
lc_monetary = 'en_US.utf8'                      # locale for monetary formatting
lc_numeric = 'en_US.utf8'                       # locale for number formatting
lc_time = 'en_US.utf8'                          # locale for time formatting
default_text_search_config = 'pg_catalog.english'
EOH
        // Setting destination to secrets makes this file inaccessible from the host machine
        destination = "local/postgresql.conf"
        change_mode = "noop"
      }

      template {
        data = <<EOH
POSTGRES_PASSWORD="{{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.postgres_pw -}}{{ end }}"        
SUPERUSER_PW="{{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.postgres_pw -}}{{ end }}"
GCP_TOKEN="{{- with secret "/gcp/roleset/<ROLE>/token" }}{{- .Data.token | regexReplaceAll "\\.{3,}" "" }}{{- end }}"
EOH
        destination = "secrets/postgres.env"
        env = true
        change_mode = "noop"

      }

      volume_mount {
        volume      = "landscape-postgres"
        destination = "/var/lib/postgresql/data"
        read_only   = false
      }

      config {
        image = "postgres"

}
        volumes = [
          "/etc/ssl/certs:/etc/ssl/certs",
          // Remove these volumes if the db is being initialized, readd after persistent data is created
          //
        ]

        ports = ["postgres"]
      }
    }
  }

  group "rebbitmq" {


    network {
      mode = "bridge"

      port "rabbitmq" {
        to = 5672
      }
    }

    service {
      name = "landscape-rabbitmq"
      port = "rabbitmq"
      address_mode = "alloc"

      connect {
        sidecar_service {}

      }
    }

    task "rabbitmq" {

      driver = "docker"

      template {
        data = <<EOH
RABBITMQ_DEFAULT_USER=landscape
RABBITMQ_NODE_IP_ADDRESS=127.0.0.1
RABBITMQ_DEFAULT_VHOST=landscape
EOH
        destination = "local/rabbit.env"
        env = true
        change_mode = "noop"
      }

      template {
        data = <<EOH
RABBITMQ_DEFAULT_PASS="{{- with secret "nomad-kv/default/canonical-landscape" }}{{- .Data.data.rmq_pw -}}{{ end }}"
EOH
        destination = "secrets/rabbit.env"
        env = true
        change_mode = "noop"
    }

    config {
        image = "rabbitmq:3.12.7-management"

        volumes = [
          "/etc/ssl/certs:/etc/ssl/certs",
        ]

        ports = ["rabbitmq"]
      }
  }
}
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

May  2 14:07:02 nomadrunner2 consul[3179549]: 2024-05-02T14:07:02.279Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=10.138.0.12:7300 error="rpc error making call: token does not exist: ACL not found"
May  2 14:07:02 nomadrunner2 consul[3179549]: 2024-05-02T14:07:02.279Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-7c5db155-9feb-ddb7-e4b8-fcc01f74d3e8-group-record_device_status-record-d
evice-status- from=[::1]:60828 error="ACL not found"
May  2 14:07:27 nomadrunner2 nomad[372726]:     2024-05-02T14:07:27.544Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres type="Task hook failed" 
msg="envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>" failed=false
May  2 14:07:27 nomadrunner2 nomad[372726]:     2024-05-02T14:07:27.602Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres error="prestart hoo
k \"envoy_bootstrap\" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>"
May  2 14:07:27 nomadrunner2 nomad[372726]:     2024-05-02T14:07:27.602Z [INFO]  client.alloc_runner.task_runner: restarting task: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres reason="Restart wit
hin policy" delay=16.651674377s
May  2 14:07:27 nomadrunner2 nomad[372726]:     2024-05-02T14:07:27.602Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres type=Restarting msg="Tas
k restarting in 16.651674377s" failed=false
May  2 14:07:32 nomadrunner2 consul[3179549]: 2024-05-02T14:07:32.281Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=10.10.5.104:7300 error="rpc error making call: token does not exist: ACL not found"
May  2 14:07:32 nomadrunner2 consul[3179549]: 2024-05-02T14:07:32.281Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-7c5db155-9feb-ddb7-e4b8-fcc01f74d3e8-group-record_device_status-record-d
evice-status- from=[::1]:60828 error="ACL not found"
May  2 14:07:44 nomadrunner2 nomad[372726]:     2024-05-02T14:07:44.346Z [INFO]  client.alloc_runner.task_runner.task_hook.envoy_bootstrap: bootstrapping envoy: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscap
e-postgres namespace="" proxy_id=_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy service=landscape-postgres gateway="" bootstrap_file=/opt/nomad/alloc/83f0ace8-c38a-d642-b902-f
ac10a19e252/connect-proxy-landscape-postgres/secrets/envoy_bootstrap.json grpc_addr=unix://alloc/tmp/consul_grpc.sock admin_bind=127.0.0.2:19001 ready_bind=127.0.0.1:19101
May  2 14:07:44 nomadrunner2 consul[3179549]: 2024-05-02T14:07:44.467Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/service/_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sid
ecar-proxy from=[::1]:57568 error="unknown service ID: _nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy"
May  2 14:07:46 nomadrunner2 consul[3179549]: 2024-05-02T14:07:46.035Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/service/_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sid
ecar-proxy from=[::1]:57570 error="unknown service ID: _nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy"
May  2 14:07:46 nomadrunner2 nomad[372726]:     2024-05-02T14:07:46.818Z [INFO]  client.gc: garbage collecting allocation: alloc_id=9df3c3a3-faa0-d023-f0e2-3addfd778f51 reason="forced collection"
May  2 14:07:46 nomadrunner2 nomad[372726]:     2024-05-02T14:07:46.926Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=9df3c3a3-faa0-d023-f0e2-3addfd778f51 task=postgres type=Killing msg="Sent interrupt. Waiting 5s bef
ore force killing" failed=false
May  2 14:07:47 nomadrunner2 nomad[372726]:     2024-05-02T14:07:47.117Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=9df3c3a3-faa0-d023-f0e2-3addfd778f51 task=connect-proxy-landscape-postgres type=Killing msg="Sent i
nterrupt. Waiting 5s before force killing" failed=false
May  2 14:07:47 nomadrunner2 systemd[1]: opt-nomad-alloc-9df3c3a3\x2dfaa0\x2dd023\x2df0e2\x2d3addfd778f51-connect\x2dproxy\x2dlandscape\x2dpostgres-private.mount: Deactivated successfully.
May  2 14:07:47 nomadrunner2 systemd[1]: opt-nomad-alloc-9df3c3a3\x2dfaa0\x2dd023\x2df0e2\x2d3addfd778f51-connect\x2dproxy\x2dlandscape\x2dpostgres-secrets.mount: Deactivated successfully.
May  2 14:07:48 nomadrunner2 consul[3179549]: 2024-05-02T14:07:48.253Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/service/_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sid
ecar-proxy from=[::1]:57578 error="unknown service ID: _nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy"
May  2 14:07:52 nomadrunner2 consul[3179549]: 2024-05-02T14:07:52.676Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/service/_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sid
ecar-proxy from=[::1]:34052 error="unknown service ID: _nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy"
May  2 14:08:00 nomadrunner2 consul[3179549]: 2024-05-02T14:08:00.801Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/service/_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sid
ecar-proxy from=[::1]:58406 error="unknown service ID: _nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy"
May  2 14:08:02 nomadrunner2 consul[3179549]: 2024-05-02T14:08:02.373Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=10.138.0.8:7300 error="rpc error making call: token does not exist: ACL not found"
May  2 14:08:02 nomadrunner2 consul[3179549]: 2024-05-02T14:08:02.373Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-7c5db155-9feb-ddb7-e4b8-fcc01f74d3e8-group-record_device_status-record-d
evice-status- from=[::1]:60828 error="ACL not found"
May  2 14:08:17 nomadrunner2 consul[3179549]: 2024-05-02T14:08:17.113Z [ERROR] agent.http: Request error: method=GET url=/v1/agent/service/_nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sid
ecar-proxy from=[::1]:34180 error="unknown service ID: _nomad-task-83f0ace8-c38a-d642-b902-fac10a19e252-group-postgres-landscape-postgres-postgres-sidecar-proxy"
May  2 14:08:32 nomadrunner2 consul[3179549]: 2024-05-02T14:08:32.465Z [ERROR] agent.client: RPC failed to server: method=ACL.TokenRead server=10.138.0.12:7300 error="rpc error making call: token does not exist: ACL not found"
May  2 14:08:32 nomadrunner2 consul[3179549]: 2024-05-02T14:08:32.465Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-7c5db155-9feb-ddb7-e4b8-fcc01f74d3e8-group-record_device_status-record-d
evice-status- from=[::1]:60828 error="ACL not found"
May  2 14:08:44 nomadrunner2 nomad[372726]:     2024-05-02T14:08:44.982Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres type="Task hook failed" 
msg="envoy_bootstrap: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>" failed=false
May  2 14:08:45 nomadrunner2 nomad[372726]:     2024-05-02T14:08:45.133Z [ERROR] client.alloc_runner.task_runner: prestart failed: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres error="prestart hoo
k \"envoy_bootstrap\" failed: error creating bootstrap configuration for Connect proxy sidecar: exit status 1; see: <https://developer.hashicorp.com/nomad/s/envoy-bootstrap-error>"
May  2 14:08:45 nomadrunner2 nomad[372726]:     2024-05-02T14:08:45.133Z [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres reason="Exceede
d allowed attempts 2 in interval 30m0s and mode is \"fail\""
May  2 14:08:45 nomadrunner2 nomad[372726]:     2024-05-02T14:08:45.133Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=83f0ace8-c38a-d642-b902-fac10a19e252 task=connect-proxy-landscape-postgres type="Not Restarting" ms
g="Exceeded allowed attempts 2 in interval 30m0s and mode is \"fail\"" failed=true
@colinbruner
Copy link

I've observed this as well.. more information in support case #147547

@louievandyke louievandyke added the hcc/cst Admin - internal label May 10, 2024
@ryanhulet
Copy link
Author

I've observed this as well.. more information in support case #147547

got a link?

@colinbruner
Copy link

I've observed this as well.. more information in support case #147547

got a link?

sorry, I reported this to Hashicorp's Enterprise support portal, behind auth. They are now tracking it internally but public issues helps raise awareness.. and might provide some restored sanity for anyone else experience this issue as well.

@t-davies
Copy link
Contributor

++ also seeing this fairly repeatedly, unsure if the same cause.
See support case #149941.

@ryanhulet @colinbruner - does restarting Consul fix this for you (temporarily at least)?

@ryanhulet
Copy link
Author

ryanhulet commented May 28, 2024

Sometimes.

I finally gave up and destroyed my entire cluster and brought up a new one on 1.6.1

@t-davies
Copy link
Contributor

t-davies commented May 28, 2024

Possibly related to #20185?

I wonder if the Consul sync is getting stuck because it doesn't have tokens to deregister services - exiting early - and therefore never registering newly created services, meaning that Envoy can't bootstrap?

[root@xx logs]# cat envoy_bootstrap.stderr.0
failed fetch proxy config from local agent: Unexpected response code: 404 (unknown service ID: default/xx-xxx/_nomad-task-29d6f08e-ee4a-d2e0-f6e2-39d32be26e68-group-default-xxx-000-sidecar-proxy)

Not sure why it would have gotten worse in newer releases though? But we also see it much more often recently (on 1.7.5 currently) could also just be because we see more deploys now though.

@ryanhulet
Copy link
Author

as far as I can tell it is 100% related to workload identities, 1.6.1 doesn't use them and I haven't had that error a single time since downgrading

@colinbruner
Copy link

Possibly related to #20185?

this is interesting.. the consul acl stanza settings the quoted issue lists in step #1 do match my consul acl configurations..

acl = {
  enabled                  = true
  default_policy           = "deny"
  enable_token_persistence = true
}

I wonder if setting enable_token_persistence to false would mitigate this at all...

@t-davies
Copy link
Contributor

t-davies commented May 29, 2024

Yeah I'm also wondering if disabling persistence will fix it too, was also suggested by support on my ticket.

We're (obviously) not going in an manually deleting the ACL tokens like the original issue repro but if it's getting stuck with some old expired token because it's been persisted then maybe could end up with similar behaviour. In our case we are reloading Consul once per day (with a SIGHUP) when we rotate its agent token - which may also a source for this, not sure if you're doing similar?

@colinbruner
Copy link

In our case we are reloading Consul once per day (with a SIGHUP) when we rotate its agent token - which may also a source for this, not sure if you're doing similar?

similar, but less frequent. We have consul-template generating a Consul ACL token from Vault's Consul secret engine.. this is regenerated and a SIGHUP is issued at the half-life, which I believe to be ~16days.

However, I've seen this on Nomad Clients that have only been around for a few hours.. so not necessarily correlated to after the token is rolled.

Yeah I'm also wondering if disabling persistence will fix it too, was also suggested by support on my ticket.

I'm going to do this in a non-prod environment today, will report back if I'm still seeing errors after the change is made.

@colinbruner
Copy link

@ryanhulet @t-davies upon the advice of Hashicorp support, I set enable_token_persistence to false and upgraded to Nomad 1.8.0 in a non production environment and have yet to see these bootstrap errors thus far.

Will continue to monitor and update if this is not a full mitigation of the error.

@t-davies
Copy link
Contributor

t-davies commented Jun 4, 2024

Thanks, yeah we are in a similar situation with enable_token_persistence off and running 1.8.0 - not reoccurred since either.

@sorenisanerd
Copy link
Contributor

I've been seeing this problem in a variety of forms for a while now, possibly when I started using Workload Identities.

Many times, I see entries like this in my logs:

Jun 05 17:13:40 direct-newt consul[2616748]: 2024-06-05T17:13:40.746Z [ERROR] agent.http: Request error: method=PUT url=/v1/agent/service/deregister/_nomad-task-003d7834-ed06-fb31-9052-23774fb0ce3c-group-db-demoapp1--db-db-sidecar-proxy from=127.0.0.1:37742 error="ACL not found"

So it looks like Nomad is failing to bootstrap Envoy because there's something else that's failing to get deregistered. Why is the token

I have an AWFUL Nomad job that scans through Consul's logs for those errors and uses a management token to do the deregistration. Here's a slimmed down version of it:

job "one-off" {
  type      = "sysbatch"
  group "cmdgroup" {
    task "cmd" {
      driver = "raw_exec"
      config {
        command = "/usr/bin/bash"
        args = ["-c",
          <<-EOA
            journalctl -u consul -n 5000 | grep ACL.not.found | grep deregister | cut -f3 -d= | cut -f1 -d' ' | sort -u | while read x
            do
              echo "Overriding $x request with management token"
              curl -X PUT -v -H "X-Consul-Token: 12345678-abcd-1234-dcba-1234567890ab" http://127.0.0.1:8500$x
              rv=1
            done
            EOA
        ]
      }
    }
  }
}

enable_token_persistence should only affect the Consul agent's own tokens, shouldn't it? I.e. the default, agent, replication, config_file_service_registration, and possibly recovery tokens.
a) If token persistence is disabled, do you inject the tokens whenever the agent starts or do you keep them in your config files?
b) What ACL's are assigned to these tokens? Might I be missing something there?

@t-davies
Copy link
Contributor

t-davies commented Jun 9, 2024

You're quite right, enable_token_persistence should only affect the Consul agents own tokens. Nomad should be using the WLI token to register/deregister the service as far as I can tell. In our case, because we're rotating the Consul agent token pretty frequently the hypothesis was something is happening within Consul (possibly even unrelated to Nomad) that is then causing the sync loop within Nomad to get stuck - but it's quite possible that enable_token_persistence being set is completely unrelated.

In our case we use Vault agent to rotate the token and store it in the config file, so we don't really need token persistence anyway since Consul will just read it from the file if it restarts.

@tgross
Copy link
Member

tgross commented Jun 17, 2024

Hey folks, just wanted to post a heads up that this is on our radar. I've got a strong suspicion that the root cause is the same as what we think is the problem described here: #9307 (comment) That this is showing up for folks when they switch to WI might be related to the timing involved in the two workflows (the old workflow sends the request to the Nomad server, so by the time the request comes back to the Nomad client more time has passed than in the WI workflow).

In any case, I'm working on that fix and I'll see if I can reproduce the specific WI workflow issues y'all are seeing here with and without that fix to see if that resolves this issue as well.

@tgross tgross self-assigned this Jun 17, 2024
tgross added a commit that referenced this issue Jun 20, 2024
Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
tgross added a commit that referenced this issue Jun 21, 2024
Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
@tgross tgross moved this to In Progress in Nomad - Community Issues Triage Jun 24, 2024
tgross added a commit that referenced this issue Jun 26, 2024
Nomad creates a Consul ACL token for each service for registering it in Consul
or bootstrapping the Envoy proxy (for service mesh workloads). Nomad always
talks to the local Consul agent and never directly to the Consul servers. But
the local Consul agent talks to the Consul servers in stale consistency mode to
reduce load on the servers. This can result in the Nomad client making the Envoy
bootstrap request with a token that has not yet replicated to the follower that
the local client is connected to. This request gets a 404 on the ACL token and
that negative entry gets cached, preventing any retries from succeeding.

To workaround this, we'll use a method described by our friends over on
`consul-k8s` where after creating the service token we try to read the token
from the local agent in stale consistency mode (which prevents a failed read
from being cached). This cannot completely eliminate this source of error
because it's possible that Consul cluster replication is unhealthy at the time
we need it, but this should make Envoy bootstrap significantly more robust.

In this changeset, we add the preflight check after we login via Workload
Identity and in the function we use to derive tokens in the legacy
workflow. We've added the timeouts to be configurable via node metadata rather
than the usual static configuration because for most cases, users should not
need to touch or even know these values are configurable; the configuration is
mostly available for testing.

Fixes: #9307
Fixes: #20516
Fixes: #10451

Ref: hashicorp/consul-k8s#887
Ref: https://hashicorp.atlassian.net/browse/NET-10051
@tgross tgross closed this as completed in df67e74 Jun 27, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Nomad - Community Issues Triage Jun 27, 2024
@tgross
Copy link
Member

tgross commented Jun 27, 2024

I've just merged #23381 which hopefully should close out this issue. That's planned for release in Nomad 1.8.2 (with backports to Nomad Enterprise 1.7.x and 1.6.x). Once you've deployed that, if you're still running into issues with Envoy bootstrap, please report a new issue after having gone through the troubleshooting guides in Nomad's Service Mesh troubleshooting and Resolving Common Errors in Envoy Proxy. That'll help us separate out additional problems above-and-beyond the ones we've identified here.

@tgross tgross added this to the 1.8.2 milestone Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal type/bug
Projects
Development

Successfully merging a pull request may close this issue.

6 participants