Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubleshooting blocked evaluation #19827

Closed
suikast42 opened this issue Jan 26, 2024 · 14 comments · Fixed by #19933
Closed

Troubleshooting blocked evaluation #19827

suikast42 opened this issue Jan 26, 2024 · 14 comments · Fixed by #19933

Comments

@suikast42
Copy link
Contributor

I had a similar issue at the past but donÄT undersnatd why my evalatation is blocked.

See #19446

Now I can reproduce the issue.

I have deployed a mssql DB with a static port mapping.
Then I try accidently deloy a second job with the same static port mapping with only one worker node.

That's not a bug that nomad deny the allocation. But the information why the alocation is blocked is nowhere listed.

image

{
  "priority": 50,
  "type": "service",
  "triggeredBy": "job-register",
  "status": "complete",
  "statusDescription": null,
  "failedTGAllocs": [
    {
      "Name": "debezium_server_assan",
      "CoalescedFailures": 0,
      "NodesEvaluated": 1,
      "NodesExhausted": 0,
      "NodesAvailable": {
        "nomadder1": 1
      },
      "ClassFiltered": null,
      "ConstraintFiltered": null,
      "ClassExhausted": null,
      "DimensionExhausted": null,
      "QuotaExhausted": null,
      "Scores": null
    }
  ],
  "previousEval": null,
  "nextEval": null,
  "blockedEval": "4e0cfef8-208a-16c3-f648-f0bdb775b5f9",
  "modifyIndex": 43323,
  "modifyTime": "2024-01-26T08:29:33.664Z",
  "createIndex": 43320,
  "createTime": "2024-01-26T08:29:33.653Z",
  "waitUntil": null,
  "namespace": "default",
  "plainJobId": "assan_cdc",
  "relatedEvals": [
    "4e0cfef8-208a-16c3-f648-f0bdb775b5f9"
  ],
  "job": "[\"assan_cdc\",\"default\"]",
  "node": null
}
{
  "priority": 50,
  "type": "service",
  "triggeredBy": "queued-allocs",
  "status": "blocked",
  "statusDescription": "created to place remaining allocations",
  "failedTGAllocs": [
    {
      "Name": "debezium_server_assan",
      "CoalescedFailures": 0,
      "NodesEvaluated": 1,
      "NodesExhausted": 0,
      "NodesAvailable": {
        "nomadder1": 1
      },
      "ClassFiltered": null,
      "ConstraintFiltered": null,
      "ClassExhausted": null,
      "DimensionExhausted": null,
      "QuotaExhausted": null,
      "Scores": null
    }
  ],
  "previousEval": "2ba2855b-629d-e743-eff3-71fb17b479b4",
  "nextEval": null,
  "blockedEval": null,
  "modifyIndex": 43321,
  "modifyTime": "2024-01-26T08:29:33.657Z",
  "createIndex": 43321,
  "createTime": "2024-01-26T08:29:33.657Z",
  "waitUntil": null,
  "namespace": "default",
  "plainJobId": "assan_cdc",
  "relatedEvals": [
    "2ba2855b-629d-e743-eff3-71fb17b479b4"
  ],
  "job": "[\"assan_cdc\",\"default\"]",
  "node": null
}

nomad job status

ID            = assan_cdc
Name          = assan_cdc
Submit Date   = 2024-01-26T08:29:33Z
Type          = service
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group             Queued  Starting  Running  Failed  Complete  Lost  Unknown
debezium_server_assan  1       0         0        0       0         0     0

Placement Failure
Task Group "debezium_server_assan":


Latest Deployment
ID          = 292a527f
Status      = running
Description = Deployment is running

Deployed
Task Group             Desired  Placed  Healthy  Unhealthy  Progress Deadline
debezium_server_assan  1        0       0        0          N/A

Allocations
No allocations placed

deployment status 292a527f

ID          = 292a527f
Job ID      = assan_cdc
Job Version = 0
Status      = running
Description = Deployment is running

Deployed
Task Group             Desired  Placed  Healthy  Unhealthy  Progress Deadline
debezium_server_assan  1        0       0        0          N/A

An information like 'not enough cpu, mem' or 'port conflict and no more nodes avlialabe' cloud be very handy for trouble shooting

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 6, 2024

Hi @suikast42 👋

Which version of Nomad are you running? I just tested on Nomad 1.7.3 and I do get the expected results on port collision:

image image image

@suikast42
Copy link
Contributor Author

This is strange:

nomad --version
Nomad v1.7.3
BuildDate 2024-01-15T16:55:40Z
Revision 60ee328

@suikast42
Copy link
Contributor Author

Ok I try it with a simple deployment

job "whoami" {

  group "whoami" {
    count = 1

    network {
      mode = "bridge"
      port "web" {
        to=8080
        static = 8080
      }
    }

    service {
      name = "${NOMAD_NAMESPACE}-${NOMAD_GROUP_NAME}"
      port = "web"

      tags = [
        "traefik.enable=true",
        "traefik.http.routers.${NOMAD_GROUP_NAME}-${NOMAD_ALLOC_ID}.rule=Host(`${NOMAD_NAMESPACE}.${NOMAD_GROUP_NAME}.cloud.private`)",
        "traefik.http.routers.${NOMAD_GROUP_NAME}-${NOMAD_ALLOC_ID}.tls=true",
      ]

      check {
        type     = "http"
        path     = "/health"
        port     = "web"
        interval = "10s"
        timeout  = "2s"
      }
    }

    task "whoami" {
      driver = "docker"
#      driver = "containerd-driver"
      config {
        image = "traefik/whoami"
        ports = ["web"]
        args  = ["--port", "${NOMAD_PORT_web}"]
      }

      resources {
        cpu    = 100
        memory = 128
      }
    }
  }
}

The second time I deloy the same job with the name whoami2 and let the rest of the definition the same

The result
image

image

 nomad job status whoami2
ID            = whoami2
Name          = whoami2
Submit Date   = 2024-02-08T09:28:05Z
Type          = service
Priority      = 50
Datacenters   = *
Namespace     = default
Node Pool     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost  Unknown
whoami      1       0         0        0       0         0     0

Placement Failure
Task Group "whoami":


Latest Deployment
ID          = 70630165
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
whoami      1        0       0        0          N/A

Allocations
No allocations placed
 nomad eval  list
ID        Priority  Triggered By        Job ID         Namespace  Node ID   Status    Placement Failures
96fd4b1b  50        queued-allocs       whoami2        default    <none>    blocked   N/A - In Progress
ca193e9c  50        job-register        whoami2        default    <none>    complete  true
nomad eval status 96fd4b1b
ID                 = 96fd4b1b
Create Time        = 4m45s ago
Modify Time        = 4m45s ago
Status             = blocked
Status Description = created to place remaining allocations
Type               = service
TriggeredBy        = queued-allocs
Job ID             = whoami2
Namespace          = default
Priority           = 50
Placement Failures = N/A - In Progress

Failed Placements
Task Group "whoami" (failed to place 1 allocation):
nomad eval status ca193e9c
ID                 = ca193e9c
Create Time        = 5m59s ago
Modify Time        = 5m59s ago
Status             = complete
Status Description = complete
Type               = service
TriggeredBy        = job-register
Job ID             = whoami2
Namespace          = default
Priority           = 50
Placement Failures = true

Failed Placements
Task Group "whoami" (failed to place 1 allocation):


Evaluation "96fd4b1b" waiting for additional capacity to place remainder

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 8, 2024

Hum...sorry I still can't reproduce the problem 🤔

How many clients do you have? Could you share the full output of when you run nomad job run for the second time?

@suikast42
Copy link
Contributor Author

Have one worker an one master

2024-02-09 12:11:08.264	
[nomad.service 💻 master-01] [🐞] []  nomad.job.service_sched.binpack: preemption not possible : eval_id=25b134d9-d664-2128-a0d9-6bf682a66539 job_id=whoami2 namespace=default network_resource="&{bridge     0 <nil> [{web 42000 42000 default}] []}"     



2024-02-09 12:11:08.264	
[nomad.service 💻 master-01] [🐞] []  nomad.job.service_sched: failed to place all allocations, blocked eval created: eval_id=25b134d9-d664-2128-a0d9-6bf682a66539 job_id=whoami2 namespace=default blocked_eval_id=fb84782d-35c5-d3e6-2fb3-85404e7c1a98     
2024-02-09 12:11:08.264	
[nomad.service 💻 master-01] [🐞] []  nomad.job.service_sched: reconciled current state with desired state: eval_id=25b134d9-d664-2128-a0d9-6bf682a66539 job_id=whoami2 namespace=default     
2024-02-09 12:11:08.264	
[nomad.service 💻 master-01] [🐞] []  nomad.job.service_sched: setting eval status: eval_id=25b134d9-d664-2128-a0d9-6bf682a66539 job_id=whoami2 namespace=default status=complete     
2024-02-09 12:11:08.264	
[nomad.service 💻 master-01] [✅] []    | Desired Changes for "whoami2": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)     
2024-02-09 12:11:08.265	
[nomad.service 💻 master-01] [🐞] []  http: request complete: method=POST path=/v1/job/whoami2/plan duration=2.186051ms     
2024-02-09 12:11:12.063	
[nomad.service 💻 master-01] [🐞] []  worker.service_sched.binpack: preemption not possible : eval_id=c4362777-c47e-a814-3dd3-9031a69144d8 job_id=whoami2 namespace=default worker_id=3ba79f80-9ba2-cdfe-ba09-9ce8fc0955e1 network_resource="&{bridge     0 <nil> [{web 42000 42000 default}] []}"     
2024-02-09 12:11:12.063	
[nomad.service 💻 master-01] [🐞] []  worker.service_sched: reconciled current state with desired state: eval_id=c4362777-c47e-a814-3dd3-9031a69144d8 job_id=whoami2 namespace=default worker_id=3ba79f80-9ba2-cdfe-ba09-9ce8fc0955e1     
2024-02-09 12:11:12.063	
[nomad.service 💻 master-01] [🐞] []  worker: dequeued evaluation: worker_id=3ba79f80-9ba2-cdfe-ba09-9ce8fc0955e1 eval_id=c4362777-c47e-a814-3dd3-9031a69144d8 type=service namespace=default job_id=whoami2 node_id="" triggered_by=job-register     
2024-02-09 12:11:12.064	
[nomad.service 💻 master-01] [✅] []    | Desired Changes for "whoami2": (place 1) (inplace 0) (destructive 0) (stop 0) (migrate 0) (ignore 0) (canary 0)     
2024-02-09 12:11:12.068	
[nomad.service 💻 master-01] [🐞] []  worker.service_sched: failed to place all allocations, blocked eval created: eval_id=c4362777-c47e-a814-3dd3-9031a69144d8 job_id=whoami2 namespace=default worker_id=3ba79f80-9ba2-cdfe-ba09-9ce8fc0955e1 blocked_eval_id=84b2b984-606f-5e1b-7ac9-0cfc0a88debe     
2024-02-09 12:11:12.068	
[nomad.service 💻 master-01] [🐞] []  worker: created evaluation: worker_id=3ba79f80-9ba2-cdfe-ba09-9ce8fc0955e1 eval="<Eval \"84b2b984-606f-5e1b-7ac9-0cfc0a88debe\" JobID: \"whoami2\" Namespace: \"default\">" waitUntil="\"0001-01-01 00:00:00 +0000 UTC\""     
2024-02-09 12:11:12.073	
[nomad.service 💻 master-01] [🐞] []  worker.service_sched: setting eval status: eval_id=c4362777-c47e-a814-3dd3-9031a69144d8 job_id=whoami2 namespace=default worker_id=3ba79f80-9ba2-cdfe-ba09-9ce8fc0955e1 status=complete     
2024-02-09 12:11:12.078	
[nomad.service 💻 master-01] [🐞] []  worker: ack evaluation: worker_id=3ba79f80-9ba2-cdfe-ba09-9ce8fc0955e1 eval_id=c4362777-c47e-a814-3dd3-9031a69144d8 type=service namespace=default job_id=whoami2 node_id="" triggered_by=job-register     
2024-02-09 12:11:12.078	
[nomad.service 💻 master-01] [🐞] []  worker: updated evaluation: worker_id=3ba79f80-9ba2-cdfe-ba09-9ce8fc0955e1 eval="<Eval \"c4362777-c47e-a814-3dd3-9031a69144d8\" JobID: \"whoami2\" Namespace: \"default\">"     
2024-02-09 12:11:12.090	
[nomad.service 💻 master-01] [🐞] []  http: request complete: method=GET path=/v1/job/whoami2 duration="475.382µs"     
2024-02-09 12:11:12.106	
[nomad.service 💻 master-01] [🐞] []  http: request complete: method=GET path=/v1/job/whoami2/allocations duration="289.615µs"     
2024-02-09 12:11:12.110	
[nomad.service 💻 master-01] [🐞] []  http: request complete: method=GET path=/v1/job/whoami2/evaluations duration="300.334µs"     
2024-02-09 12:11:12.187	
[nomad.service 💻 master-01] [🐞] []  http: request complete: method=GET path=/v1/job/whoami2/deployment?index=1 duration="315.753µs"     
2024-02-09 12:11:12.188	
[nomad.service 💻 master-01] [🐞] []  http: request complete: method=GET path=/v1/job/whoami2/summary?index=1 duration="337.219µs"     
2024-02-09 12:11:12.190	
[nomad.service 💻 master-01] [🐞] []  http: request complete: method=GET path=/v1/job/whoami2/deployment duration="420.686µs"     
2024-02-09 12:11:12.190	
[nomad.service 💻 master-01] [🐞] []  http: request complete: method=GET path="/v1/vars?prefix=nomad%2Fjobs%2Fwhoami2" duration="310.217µs"     
2024-02-09 12:11:12.197	
[nomad.service 💻 master-01] [🐞] []  http: request complete: method=GET path=/v1/job/whoami2/deployment duration="332.207µs"     
2024-02-09 12:11:12.207	
[nomad.service 💻 master-01] [🐞] []  http: request complete: method=GET path=/v1/job/whoami2 duration="364.175µs"     
2024-02-09 12:11:14.125	
[nomad.service 💻 master-01] [🐞] []  http: request complete: method=GET path=/v1/job/whoami2/deployment?index=58495 duration="296.495µs"     

@suikast42
Copy link
Contributor Author

I try it with bridge and host network mode. Both the same.

@suikast42
Copy link
Contributor Author

suikast42 commented Feb 9, 2024

My nomad and consul configs. Maybe that helps?

Consul server

datacenter = "nomadder1"
data_dir =  "/opt/services/core/consul/data"
log_level = "INFO"
node_name = "master-01"
server = true
bind_addr = "0.0.0.0"
advertise_addr = "172.42.1.10"
client_addr = "0.0.0.0"
encrypt = "G1CHAD7wwu0tU28BlKkirSahTJ/Tqpo9ClOAycQAUwE="
server_rejoin_age_max = "8640h"
# https://developer.hashicorp.com/consul/docs/connect/observability/ui-visualization
ui_config{
   enabled = true
   dashboard_url_templates {
       service = "https://grafana.cloud.private/d/lDlaj-NGz/service-overview?orgId=1&var-service={{Service.Name}}&var-namespace={{Service.Namespace}}&var-partition={{Service.Partition}}&var-dc={{Datacenter}}"
   }
   metrics_provider = "prometheus"
   metrics_proxy {
     base_url = "http://mimir.service.consul:9009/prometheus"

     add_headers = [
 #      {
 #         name = "Authorization"
 #         value = "Bearer <token>"
 #      }
       {
          name = "X-Scope-OrgID"
          value = "1"
       }
     ]
     path_allowlist = ["/prometheus/api/v1/query_range", "/prometheus/api/v1/query"]
   }
}
addresses {
  #  grpc = "127.0.0.1"
    grpc_tls = "127.0.0.1"
}
ports {
    http = -1
    https = 8501
   # grpc = 8502
    grpc_tls = 8503
}
connect {
     enabled = true
}
retry_join =  ["172.42.1.10"]

bootstrap_expect = 1

auto_encrypt{
    allow_tls = true
}
performance{
    raft_multiplier = 1
}

node_meta{
  node_type = "server"
}
tls{
    defaults {
        ca_file = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
        cert_file = "/etc/opt/certs/consul/consul.pem"
        key_file = "/etc/opt/certs/consul/consul-key.pem"
        verify_incoming = true
        verify_outgoing = true
    }
    internal_rpc {
        verify_server_hostname = true
    }
}
#watches = [
#  {
#    type = "checks"
#    handler = "/usr/bin/health-check-handler.sh"
#  }
#]

telemetry {
  disable_hostname = true
  prometheus_retention_time = "72h"
}

nomad server

log_level = "DEBUG"
name = "master-01"
datacenter = "nomadder1"
data_dir =  "/opt/services/core/nomad/data"

#You should only set this value to true on server agents
#if the terminated server will never join the cluster again
#leave_on_interrupt= false

#You should only set this value to true on server agents
#if the terminated server will never join the cluster again
#leave_on_terminate = false

server {
  enabled = true
  job_max_priority = 100 # 100 is the default
  job_default_priority = 50 # 50 is the default
  bootstrap_expect =  1
  encrypt = "4PRfoE6Mj9dHTLpnzmYD1+THdlyAo2Ji4U6ewMumpAw="
  rejoin_after_leave = true
  server_join {
    retry_join =  ["172.42.1.10"]
    retry_max = 0
    retry_interval = "15s"
  }
}

bind_addr = "0.0.0.0" # the default
advertise {
  # Defaults to the first private IP address.
  http = "172.42.1.10"
  rpc  = "172.42.1.10"
  serf = "172.42.1.10"
}

tls {
  http = true
  rpc  = true

  ca_file   = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
  cert_file = "/etc/opt/certs/nomad/nomad.pem"
  key_file  = "/etc/opt/certs/nomad/nomad-key.pem"

  verify_server_hostname = true
  verify_https_client    = true
}



ui {
  enabled =  true
  label {
   text =  "💙💛 Fenerbaçhe 1907 💛💙"
   background_color = "#163962"
   text_color = "##ffed00"
  }
  consul {
    ui_url = "https://consul.cloud.private"
  }

  vault {
    ui_url = "https://vault.cloud.private"
  }
}

consul{
 ssl= true
 address = "127.0.0.1:8501"
 grpc_address = "127.0.0.1:8503"
 # this works only with ACL enabled
 allow_unauthenticated= true
 ca_file   = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
 grpc_ca_file   = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
 cert_file = "/etc/opt/certs/consul/consul.pem"
 key_file  = "/etc/opt/certs/consul/consul-key.pem"
}

telemetry {
  collection_interval = "1s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

consul agent

datacenter = "nomadder1"
data_dir =  "/opt/services/core/consul/data"
log_level = "INFO"
node_name = "worker-01"
bind_addr = "0.0.0.0"
advertise_addr = "172.42.1.20"
client_addr = "0.0.0.0"
encrypt = "G1CHAD7wwu0tU28BlKkirSahTJ/Tqpo9ClOAycQAUwE="

addresses {
  #  grpc = "127.0.0.1"
    grpc_tls = "127.0.0.1"
}
ports {
    http = -1
    https = 8501
  #  grpc = 8502
    grpc_tls = 8503
}
connect {
     enabled = true
}
retry_join =  ["172.42.1.10"]

auto_encrypt{
    tls = true
}
performance{
    raft_multiplier = 1
}

node_meta{
  node_type = "worker"
}
tls{
    defaults {
        ca_file = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
        cert_file = "/etc/opt/certs/consul/consul.pem"
        key_file = "/etc/opt/certs/consul/consul-key.pem"
        verify_incoming = false
        verify_outgoing = true
    }
    internal_rpc {
        verify_server_hostname = true
    }
}
#watches = [
#  {
#    type = "checks"
#    handler = "/usr/bin/health-check-handler.sh"
#  }
#]

telemetry {
  disable_hostname = true

nomad agent

log_level = "DEBUG"
name = "worker-01"
datacenter = "nomadder1"
data_dir =  "/opt/services/core/nomad/data"
bind_addr = "0.0.0.0" # the default

leave_on_interrupt= true
#https://github.com/hashicorp/nomad/issues/17093
#systemctl kill -s SIGTERM nomad will suppress node drain if
#leave_on_terminate set to false
leave_on_terminate = true

advertise {
  # Defaults to the first private IP address.
  http = "172.42.1.20"
  rpc  = "172.42.1.20"
  serf = "172.42.1.20"
}
client {
  enabled = true
  network_interface = "eth1"
  meta {
    node_type= "worker"
    connect.log_level = "debug"
    connect.sidecar_image= "registry.cloud.private/envoyproxy/envoy:v1.29.0"
  }
  server_join {
    retry_join =  ["172.42.1.10"]
    retry_max = 0
    retry_interval = "15s"
  }
  # Either leave_on_interrupt or leave_on_terminate must be set
  # for this to take effect.
  drain_on_shutdown {
    deadline           = "2m"
    force              = false
    ignore_system_jobs = false
  }
  host_volume "ca_cert" {
    path      = "/usr/local/share/ca-certificates/cloudlocal"
    read_only = true
  }
  host_volume "cert_ingress" {
    path      = "/etc/opt/certs/ingress"
    read_only = true
  }
  ## Cert consul client
  ## Needed for consul_sd_configs
  ## Should be deleted after resolve https://github.com/suikast42/nomadder/issues/100
  host_volume "cert_consul" {
    path      = "/etc/opt/certs/consul"
    read_only = true
  }

  ## Cert consul client
  ## Needed for jenkins
  ## Should be deleted after resolve https://github.com/suikast42/nomadder/issues/100
  host_volume "cert_nomad" {
    path      = "/etc/opt/certs/nomad"
    read_only = true
  }

  ## Cert docker client
  ## Needed for jenkins
  ## Should be deleted after migrating to vault
  host_volume "cert_docker" {
    path      = "/etc/opt/certs/docker"
    read_only = true
  }

  host_network "public" {
    interface = "eth0"
    #cidr = "203.0.113.0/24"
    #reserved_ports = "22,80"
  }
  host_network "default" {
      interface = "eth1"
  }
  host_network "private" {
    interface = "eth1"
  }
  host_network "local" {
    interface = "lo"
  }

  reserved {
  # cpu (int: 0) - Specifies the amount of CPU to reserve, in MHz.
  # cores (int: 0) - Specifies the number of CPU cores to reserve.
  # memory (int: 0) - Specifies the amount of memory to reserve, in MB.
  # disk (int: 0) - Specifies the amount of disk to reserve, in MB.
  # reserved_ports (string: "") - Specifies a comma-separated list of ports to reserve on all fingerprinted network devices. Ranges can be specified by using a hyphen separating the two inclusive ends. See also host_network for reserving ports on specific host networks.
    cpu    = 1000
    memory = 2048
  }
  max_kill_timeout  = "1m"
}

tls {
  http = true
  rpc  = true

  ca_file   = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
  cert_file = "/etc/opt/certs/nomad/nomad.pem"
  key_file  = "/etc/opt/certs/nomad/nomad-key.pem"

  verify_server_hostname = true
  verify_https_client    = true
}

consul{
  ssl= true
  address = "127.0.0.1:8501"
  grpc_address = "127.0.0.1:8503"
  # this works only with ACL enabled
  allow_unauthenticated= true
  ca_file   = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
  grpc_ca_file   = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
  cert_file = "/etc/opt/certs/consul/consul.pem"
  key_file  = "/etc/opt/certs/consul/consul-key.pem"
}


telemetry {
  collection_interval = "1s"
  disable_hostname = true
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

plugin "docker" {
  config {
    allow_privileged = false
    disable_log_collection  = false
#    volumes {
#      enabled = true
#      selinuxlabel = "z"
#    }
    infra_image = "registry.cloud.private/google_containers/pause-amd64:3.2"
    infra_image_pull_timeout ="30m"
    extra_labels = ["job_name", "job_id", "task_group_name", "task_name", "namespace", "node_name", "node_id"]
    logging {
      type = "journald"
       config {
          labels-regex =".*"
       }
    }
    gc{
      container = true
      dangling_containers{
        enabled = true
      # period = "3m"
      # creation_grace = "5m"
      }
    }

  }
}

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 9, 2024

Thank you for the extra information @suikast42!

The server logs allowed me to find the problem. I believe you have service job preemption enabled, which triggered a different code path from the default configuration I was using. I opened #19933 to fix this issue.

To confirm that this is the case, could you share the output of the command nomad operator scheduler get-config?

@suikast42
Copy link
Contributor Author

Interesting 👌

Here is the output. By the way I updated to 1.7.4 But nothing changed of course 😂

Scheduler Algorithm           = spread
Memory Oversubscription       = true
Reject Job Registration       = false
Pause Eval Broker             = false
Preemption System Scheduler   = true
Preemption Service Scheduler  = true
Preemption Batch Scheduler    = true
Preemption SysBatch Scheduler = true
Modify Index                  = 30913

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 9, 2024

Thanks! Yeah, Preemption Service Scheduler = true would trigger this. The fix will be available in the next Nomad release.

Thank you again for the report!

@suikast42
Copy link
Contributor Author

I can confirm

After setting nomad operator scheduler set-config -preempt-service-scheduler false I see the detail ;-)

image

@suikast42
Copy link
Contributor Author

Thanks! Yeah, Preemption Service Scheduler = true would trigger this. The fix will be available in the next Nomad release.

Thank you again for the report!

Yes I do this beacuase I activate MemoryOversubscription. Thus I thought that's more dynmamic for my usecase 😁

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 10, 2024

Oh yes, preemption is a very nice feature. But it triggers some different code paths that sometimes are not kept up-to-date 😬

But I'm glad we were able to get to the bottom of this. I was really confused why it wasn't happening to me 😅

Copy link

github-actions bot commented Jan 1, 2025

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 1, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Development

Successfully merging a pull request may close this issue.

2 participants