Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting reserved_ports results in jobs being blocked #8421

Closed
evandam opened this issue Jul 10, 2020 · 7 comments
Closed

Setting reserved_ports results in jobs being blocked #8421

evandam opened this issue Jul 10, 2020 · 7 comments
Labels
stage/needs-verification Issue needs verifying it still exists theme/config theme/core

Comments

@evandam
Copy link

evandam commented Jul 10, 2020

Nomad version

Nomad v0.11.3

Issue

Similar to #1046, we're seeing jobs stuck in pending state when reserved_ports is set.

/etc/nomad/client.json
{
  "client": {
    "disable_remote_exec": false,
    "enabled": true,
    "max_kill_timeout": "30s",
    "reserved": {
      "reserved_ports": "0-19999,24224"
    },
    "meta": {
      "chef_role": "nomad-compute",
      "role": "nomad-compute"
    },
    "no_host_uuid": false,
    "node_class": "compute",
    "server_join": {
      "retry_join": [
        "provider=aws tag_key=role tag_value=nomad-server region=us-west-2 addr_type=private_v4"
      ]
    }
  }
}

Running a job results in it being stuck in a pending state, creating evals with errors like so:

ID                 = 2a3d639d-d030-7313-923f-829ad9b8e125
Create Time        = 2020-07-09T19:10:32-07:00
Modify Time        = 2020-07-09T19:10:32-07:00
Status             = blocked
Status Description = created due to placement conflicts
Type               = batch
TriggeredBy        = max-plan-attempts
Previous Eval      = e9674379-8362-cf7f-c004-d42eab81a88b
Priority           = 50
Placement Failures = N/A - In Progress
Previous Eval      = e9674379-8362-cf7f-c004-d42eab81a88b
Next Eval          = <none>
Blocked Eval       = <none>
ID                 = e9674379-8362-cf7f-c004-d42eab81a88b
Create Time        = 2020-07-09T19:09:32-07:00
Modify Time        = 2020-07-09T19:10:32-07:00
Status             = failed
Status Description = maximum attempts reached (2)
Type               = batch
TriggeredBy        = max-plan-attempts
Previous Eval      = 40c308c7-2caa-ee63-0e6e-ad6e7dea9b71
Priority           = 50
Placement Failures = false
Previous Eval      = 40c308c7-2caa-ee63-0e6e-ad6e7dea9b71
Next Eval          = <none>
Blocked Eval       = 2a3d639d-d030-7313-923f-829ad9b8e125

Here is the job HCL being used:

prod-platform-core-rake.hcl
job "prod-platform-core-rake" {
  meta {
    uuid      = "99de99de-4daa-42c6-a193-a3b192d2c112"
    image_tag = "prod-c98af5b"
  }

  region      = "us-west-2"
  datacenters = ["prod-usw2-prod1"]

  type = "batch"

  parameterized {
    payload       = "forbidden"
    meta_required = ["RAKE_TASK_NAME"]
  }

  constraint {
    attribute = "${meta.chef_role}"
    operator  = "="
    value     = "nomad-compute"
  }
  spread {
    attribute = "${node.datacenter}"
    weight    = 100
  }

  reschedule {
    attempts  = 0
    unlimited = false
  }

  group "app" {
    count = 1

    restart {
      attempts = 0
      interval = "10m"
      delay    = "15s"
      mode     = "fail"
    }

    ephemeral_disk {
      sticky  = false
      migrate = false
      size    = 300
    }

    task "prod-platform-core-rake" {
      driver = "docker"

      config {
        image = "quay.io/wonolo/platform-core:${NOMAD_META_image_tag}"

        entrypoint = ["bundle", "exec", "rake"]
        args       = ["${NOMAD_META_RAKE_TASK_NAME}"]

        force_pull  = true
        dns_servers = ["172.17.0.1"]

        logging {
          type = "syslog"

          config {
            syslog-address = "udp://127.0.0.1:5140"
            tag            = "{{.ID}}"
            syslog-format  = "rfc5424"
          }
        }
      }

      logs {
        max_files     = 10
        max_file_size = 15
      }

      resources {
        memory = 1000 # 1GB
      }
    }
  }
}

I'm not sure if this is related to #4605 since Nomad does not seem to pick up the network interfaces on our EC2 instances.

@scalp42
Copy link
Contributor

scalp42 commented Jul 31, 2020

Any chance someone can look into this? @shoenig 🙏

@nickethier
Copy link
Member

Hey @evandam @scalp42 I spent some time today trying to reproduce this with a few different combinations of reserved ports and service/batch jobs. I'm unable to hit the described scenario. If you have a jobfile and config I can drop in an run to generate this that would be great though I know how to hard and time consuming it can be to get to that.

Would you be able to share the full output of the evaluation from the API and/or the list of evals for the job. Thanks!

@nickethier nickethier self-assigned this Aug 6, 2020
@evandam
Copy link
Author

evandam commented Aug 18, 2020

Hi @nickethier

I just tried adding reserved_ports back in with Nomad 0.12.2 but we're still seeing evals failing and allocations stuck in "pending"

Here's the output of curl $NOMAD_ADDR/v1/job/prod-platform-sidekiq/evaluations and an HCL that should be helpful and close enough to ours: https://gist.github.com/evandam/4ec92bb619be59b6e27e2e81cb110c2b

There's a ton of evaluations in that output that I believe are all the same error, but dropping them all in just in case it helps.

Running nomad job run with reserved_ports set resulted in the following:

$ nomad job run prod/prod-platform-sidekiq.hcl
==> Monitoring evaluation "1877a872"
    Evaluation triggered by job "prod-platform-sidekiq"
    Evaluation within deployment: "ce4cb9f4"
    Allocation "29da0d4a" modified: node "66393720", group "queue-default"
    Evaluation status changed: "pending" -> "failed"
==> Evaluation "1877a872" finished with status "failed"

$ nomad eval status 1877a872
ID                 = 1877a872
Create Time        = 13s ago
Modify Time        = 11s ago
Status             = failed
Status Description = maximum attempts reached (5)
Type               = service
TriggeredBy        = job-register
Job ID             = prod-platform-sidekiq
Priority           = 50
Placement Failures = false

Removing the reserved_ports config from all of our nodes and redeploying was successful, though.

@sfs77
Copy link

sfs77 commented Aug 28, 2020

yeah, I encountered something similar.
Nomad v0.11.1
after set resevered.reserved_ports = "31000-32000" and restart Nomad

Job can't update and blocked (queued)

@sfs77
Copy link

sfs77 commented Aug 28, 2020

@evandam by my test, if current running job already take a port in reserved_ports, set the reserved_ports and restart Nomad node will cause updating jobs blocked. after stop&start (reschedule) all the jobs which have took the ports in reserved_ports, things go right.

@nickethier not investigate into code, does the guess above is reasonable ?

@evandam
Copy link
Author

evandam commented Aug 31, 2020

@Sea-Flying in my case the job that was already running did not use a port in reserved_ports. We're getting a port dynamically between 20000-32000 as usual and the only port we're reserving in that range is 24224.

@tgross
Copy link
Member

tgross commented Mar 9, 2023

We believe that this is closed by #16401, which will ship in Nomad 1.5.1 (with backports)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/needs-verification Issue needs verifying it still exists theme/config theme/core
Projects
Development

No branches or pull requests

6 participants