Setting reserved_ports results in jobs being blocked #8421

evandam · 2020-07-10T20:44:31Z

Nomad version

Nomad v0.11.3

Issue

Similar to #1046, we're seeing jobs stuck in pending state when reserved_ports is set.

/etc/nomad/client.json

{
  "client": {
    "disable_remote_exec": false,
    "enabled": true,
    "max_kill_timeout": "30s",
    "reserved": {
      "reserved_ports": "0-19999,24224"
    },
    "meta": {
      "chef_role": "nomad-compute",
      "role": "nomad-compute"
    },
    "no_host_uuid": false,
    "node_class": "compute",
    "server_join": {
      "retry_join": [
        "provider=aws tag_key=role tag_value=nomad-server region=us-west-2 addr_type=private_v4"
      ]
    }
  }
}

Running a job results in it being stuck in a pending state, creating evals with errors like so:

ID                 = 2a3d639d-d030-7313-923f-829ad9b8e125
Create Time        = 2020-07-09T19:10:32-07:00
Modify Time        = 2020-07-09T19:10:32-07:00
Status             = blocked
Status Description = created due to placement conflicts
Type               = batch
TriggeredBy        = max-plan-attempts
Previous Eval      = e9674379-8362-cf7f-c004-d42eab81a88b
Priority           = 50
Placement Failures = N/A - In Progress
Previous Eval      = e9674379-8362-cf7f-c004-d42eab81a88b
Next Eval          = <none>
Blocked Eval       = <none>

ID                 = e9674379-8362-cf7f-c004-d42eab81a88b
Create Time        = 2020-07-09T19:09:32-07:00
Modify Time        = 2020-07-09T19:10:32-07:00
Status             = failed
Status Description = maximum attempts reached (2)
Type               = batch
TriggeredBy        = max-plan-attempts
Previous Eval      = 40c308c7-2caa-ee63-0e6e-ad6e7dea9b71
Priority           = 50
Placement Failures = false
Previous Eval      = 40c308c7-2caa-ee63-0e6e-ad6e7dea9b71
Next Eval          = <none>
Blocked Eval       = 2a3d639d-d030-7313-923f-829ad9b8e125

Here is the job HCL being used:

prod-platform-core-rake.hcl

job "prod-platform-core-rake" {
  meta {
    uuid      = "99de99de-4daa-42c6-a193-a3b192d2c112"
    image_tag = "prod-c98af5b"
  }

  region      = "us-west-2"
  datacenters = ["prod-usw2-prod1"]

  type = "batch"

  parameterized {
    payload       = "forbidden"
    meta_required = ["RAKE_TASK_NAME"]
  }

  constraint {
    attribute = "${meta.chef_role}"
    operator  = "="
    value     = "nomad-compute"
  }
  spread {
    attribute = "${node.datacenter}"
    weight    = 100
  }

  reschedule {
    attempts  = 0
    unlimited = false
  }

  group "app" {
    count = 1

    restart {
      attempts = 0
      interval = "10m"
      delay    = "15s"
      mode     = "fail"
    }

    ephemeral_disk {
      sticky  = false
      migrate = false
      size    = 300
    }

    task "prod-platform-core-rake" {
      driver = "docker"

      config {
        image = "quay.io/wonolo/platform-core:${NOMAD_META_image_tag}"

        entrypoint = ["bundle", "exec", "rake"]
        args       = ["${NOMAD_META_RAKE_TASK_NAME}"]

        force_pull  = true
        dns_servers = ["172.17.0.1"]

        logging {
          type = "syslog"

          config {
            syslog-address = "udp://127.0.0.1:5140"
            tag            = "{{.ID}}"
            syslog-format  = "rfc5424"
          }
        }
      }

      logs {
        max_files     = 10
        max_file_size = 15
      }

      resources {
        memory = 1000 # 1GB
      }
    }
  }
}

I'm not sure if this is related to #4605 since Nomad does not seem to pick up the network interfaces on our EC2 instances.

The text was updated successfully, but these errors were encountered:

scalp42 · 2020-07-31T21:22:40Z

Any chance someone can look into this? @shoenig 🙏

nickethier · 2020-08-06T18:53:51Z

Hey @evandam @scalp42 I spent some time today trying to reproduce this with a few different combinations of reserved ports and service/batch jobs. I'm unable to hit the described scenario. If you have a jobfile and config I can drop in an run to generate this that would be great though I know how to hard and time consuming it can be to get to that.

Would you be able to share the full output of the evaluation from the API and/or the list of evals for the job. Thanks!

evandam · 2020-08-18T17:44:07Z

Hi @nickethier

I just tried adding reserved_ports back in with Nomad 0.12.2 but we're still seeing evals failing and allocations stuck in "pending"

Here's the output of curl $NOMAD_ADDR/v1/job/prod-platform-sidekiq/evaluations and an HCL that should be helpful and close enough to ours: https://gist.github.com/evandam/4ec92bb619be59b6e27e2e81cb110c2b

There's a ton of evaluations in that output that I believe are all the same error, but dropping them all in just in case it helps.

Running nomad job run with reserved_ports set resulted in the following:

$ nomad job run prod/prod-platform-sidekiq.hcl
==> Monitoring evaluation "1877a872"
    Evaluation triggered by job "prod-platform-sidekiq"
    Evaluation within deployment: "ce4cb9f4"
    Allocation "29da0d4a" modified: node "66393720", group "queue-default"
    Evaluation status changed: "pending" -> "failed"
==> Evaluation "1877a872" finished with status "failed"

$ nomad eval status 1877a872
ID                 = 1877a872
Create Time        = 13s ago
Modify Time        = 11s ago
Status             = failed
Status Description = maximum attempts reached (5)
Type               = service
TriggeredBy        = job-register
Job ID             = prod-platform-sidekiq
Priority           = 50
Placement Failures = false

Removing the reserved_ports config from all of our nodes and redeploying was successful, though.

sfs77 · 2020-08-28T08:55:59Z

yeah, I encountered something similar.
Nomad v0.11.1
after set resevered.reserved_ports = "31000-32000" and restart Nomad

Job can't update and blocked (queued)

sfs77 · 2020-08-28T10:14:54Z

@evandam by my test, if current running job already take a port in reserved_ports, set the reserved_ports and restart Nomad node will cause updating jobs blocked. after stop&start (reschedule) all the jobs which have took the ports in reserved_ports, things go right.

@nickethier not investigate into code, does the guess above is reasonable ?

evandam · 2020-08-31T17:12:30Z

@Sea-Flying in my case the job that was already running did not use a port in reserved_ports. We're getting a port dynamically between 20000-32000 as usual and the only port we're reserving in that range is 24224.

tgross · 2023-03-09T17:19:51Z

We believe that this is closed by #16401, which will ship in Nomad 1.5.1 (with backports)

shoenig added theme/core stage/needs-investigation labels Jul 13, 2020

nickethier self-assigned this Aug 6, 2020

tgross unassigned nickethier Jun 16, 2021

tgross added stage/needs-verification Issue needs verifying it still exists theme/config and removed stage/needs-investigation labels Jun 16, 2021

schmichael mentioned this issue Mar 9, 2023

scheduling: prevent self-collision in dynamic port network offerings #16401

Merged

tgross closed this as completed Mar 9, 2023

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Done in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting reserved_ports results in jobs being blocked #8421

Setting reserved_ports results in jobs being blocked #8421

evandam commented Jul 10, 2020

scalp42 commented Jul 31, 2020

nickethier commented Aug 6, 2020

evandam commented Aug 18, 2020

sfs77 commented Aug 28, 2020 •

edited

Loading

sfs77 commented Aug 28, 2020 •

edited

Loading

evandam commented Aug 31, 2020

tgross commented Mar 9, 2023

Setting reserved_ports results in jobs being blocked #8421

Setting reserved_ports results in jobs being blocked #8421

Comments

evandam commented Jul 10, 2020

Nomad version

Issue

scalp42 commented Jul 31, 2020

nickethier commented Aug 6, 2020

evandam commented Aug 18, 2020

sfs77 commented Aug 28, 2020 • edited Loading

sfs77 commented Aug 28, 2020 • edited Loading

evandam commented Aug 31, 2020

tgross commented Mar 9, 2023

sfs77 commented Aug 28, 2020 •

edited

Loading

sfs77 commented Aug 28, 2020 •

edited

Loading