Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CLI feedback when job cannot be placed because the preferred datacenter is missing #241

Closed
djsly opened this issue Oct 9, 2015 · 9 comments

Comments

@djsly
Copy link

djsly commented Oct 9, 2015

When the datacenter specified in the job spec does not match the region identifiers used by the Nomad cluster, the job says it cannot be placed but does not say why.

We should be able to indicate that the evaluation failed because no matching datacenter can be found.

- @cbednarski

Original issue below.


I am running the digital ocean demo you have under the demo folder.

Created two snapshots using packer for the nomad service and statsite
Created a 3 server cluster for nomad and two clients using
all droplets have 512mb

I even removed the all resource constraint in the example since I do not know the list of capabilities of my nodes (I read about some network speed identification issues)

            resources {
#               cpu = 500 # 500 Mhz
#               memory = 256 # 256MB
#               network {
#                   mbits = 10
#                   dynamic_ports = ["redis"]
#               }
            }
==> Monitoring evaluation "11dd8cde-932c-a02b-3974-57f2576c4508"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "b9e5af1c-7524-6464-4c00-01289c399d20" status "failed" (0/0 nodes filtered)
      * No nodes were eligible for evaluation
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "11dd8cde-932c-a02b-3974-57f2576c4508" finished with status "complete"

ID                                    DC    Name                 Class        Drain  Status
6f7abc0a-be46-d22a-c3db-0d9d654d0462  nyc3  nomad-client-nyc3-1  linux-64bit  false  ready
b540e87f-665a-c0c9-43ba-bb8a6626f4b1  nyc3  nomad-client-nyc3-0  linux-64bit  false  ready

I looked at the code I couldn't easily find why I would have 0 nodes available.

@cbednarski
Copy link
Contributor

If you look at the agent logs from the nomad leader there may be some more information to help assess. The CLI output is not super verbose.

Off the top of my head, though, do you have docker installed on your nodes? If you are running nomad as a non-root user, does nomad have permission to read/write to the docker socket? Something like usermod -aG docker nomad-user.

@djsly
Copy link
Author

djsly commented Oct 9, 2015

I currently cannot see anything obvious... Nomad was started as root (using your packer image)

executed fresh trial

==> Monitoring evaluation "5e3b5693-50c1-321d-6aae-ff40ded59127"
    Evaluation triggered by job "example"
    Scheduling error for group "cache" (failed to find a node for placement)
    Allocation "d37da00a-bb78-a851-0e37-8f4c3af502a0" status "failed" (0/0 nodes filtered)
      * No nodes were eligible for evaluation
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "5e3b5693-50c1-321d-6aae-ff40ded59127" finished with status "complete"

from the leader

root@nomad-server-nyc3-2:~# ps aux | grep nomad
root      1515  3.0  2.6 215284 13520 ?        Ssl  19:59   2:58 /usr/local/bin/nomad agent -config /usr/local/etc/nomad
root@nomad-server-nyc3-2:/var/log# tail -f nomad.log 
    2015/10/08 21:25:34 [DEBUG] memberlist: TCP connection from: 104.131.44.151:44764
    2015/10/08 21:25:57 [DEBUG] memberlist: Initiating push/pull sync with: 104.131.44.151:4648
    2015/10/08 21:26:34 [DEBUG] memberlist: TCP connection from: 159.203.87.96:37494
    2015/10/08 21:26:57 [DEBUG] memberlist: Initiating push/pull sync with: 104.131.44.151:4648


    2015/10/08 21:30:16 [DEBUG] http: Request /v1/jobs (7.321553ms)    2015/10/08 21:30:16 [DEBUG] worker: dequeued evaluation 5e3b5693-50c1-321d-6aae-ff40ded59127
    2015/10/08 21:30:16 [DEBUG] sched: <Eval '5e3b5693-50c1-321d-6aae-ff40ded59127' JobID: 'example'>: allocs: (place 2) (update 0) (migrate 0) (stop 0) (ignore 0)

    2015/10/08 21:30:16 [DEBUG] worker: submitted plan for evaluation 5e3b5693-50c1-321d-6aae-ff40ded59127
    2015/10/08 21:30:16 [DEBUG] sched: <Eval '5e3b5693-50c1-321d-6aae-ff40ded59127' JobID: 'example'>: setting status to complete
    2015/10/08 21:30:16 [DEBUG] worker: updated evaluation <Eval '5e3b5693-50c1-321d-6aae-ff40ded59127' JobID: 'example'>
    2015/10/08 21:30:16 [DEBUG] worker: ack for evaluation 5e3b5693-50c1-321d-6aae-ff40ded59127
    2015/10/08 21:30:16 [DEBUG] http: Request /v1/evaluation/5e3b5693-50c1-321d-6aae-ff40ded59127 (173.304µs)
    2015/10/08 21:30:16 [DEBUG] http: Request /v1/evaluation/5e3b5693-50c1-321d-6aae-ff40ded59127/allocations (191.792µs)
    2015/10/08 21:30:16 [DEBUG] http: Request /v1/allocation/d37da00a-bb78-a851-0e37-8f4c3af502a0 (251.792µs)
    2015/10/08 21:30:34 [DEBUG] memberlist: TCP connection from: 159.203.87.96:37498
    2015/10/08 21:30:34 [DEBUG] memberlist: TCP connection from: 104.131.44.151:44769

From the clients

root@nomad-client-nyc3-1:/var/log# ps aux | grep nomad
root      1509  0.0  1.3 205776  6848 ?        Ssl  20:04   0:03 /usr/local/bin/nomad agent -config /usr/local/etc/nomad
Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: DEBUG
                Region: global (DC: nyc3)
                Server: false

==> Nomad agent started! Log data will stream in below:

    2015/10/08 20:01:54 [INFO] client: using state directory /opt/nomad/client
    2015/10/08 20:01:54 [INFO] client: using alloc directory /opt/nomad/alloc
    2015/10/08 20:01:54 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool eth0'
    2015/10/08 20:01:54 [WARN] fingerprint.network: Unable to read link speed from /sys/class/net/eth0/speed
    2015/10/08 20:01:54 [DEBUG] fingerprint.network: Unable to read link speed; setting to default 100
    2015/10/08 20:01:54 [DEBUG] client: applied fingerprints [arch cpu host memory storage network]
    2015/10/08 20:01:54 [DEBUG] client: available drivers [docker exec]
    2015/10/08 20:01:54 [DEBUG] client: node registration complete
    2015/10/08 20:01:54 [DEBUG] client: updated allocations at index 1 (0 allocs)
    2015/10/08 20:01:54 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 20:01:57 [DEBUG] client: state updated to ready
    2015/10/08 20:17:30 [DEBUG] client: updated allocations at index 27 (0 allocs)
    2015/10/08 20:17:30 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 20:22:47 [DEBUG] client: updated allocations at index 36 (0 allocs)
    2015/10/08 20:22:47 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 20:27:53 [DEBUG] client: updated allocations at index 51 (0 allocs)
    2015/10/08 20:27:53 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 20:32:54 [DEBUG] client: updated allocations at index 64 (0 allocs)
    2015/10/08 20:32:54 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 20:38:02 [DEBUG] client: updated allocations at index 74 (0 allocs)
    2015/10/08 20:38:02 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 20:43:09 [DEBUG] client: updated allocations at index 84 (0 allocs)
    2015/10/08 20:43:09 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 21:03:57 [DEBUG] client: updated allocations at index 107 (0 allocs)
    2015/10/08 21:03:57 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 21:19:15 [DEBUG] client: updated allocations at index 129 (0 allocs)
    2015/10/08 21:19:15 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 21:24:28 [DEBUG] client: updated allocations at index 134 (0 allocs)
    2015/10/08 21:24:28 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 21:29:37 [DEBUG] client: updated allocations at index 142 (0 allocs)
    2015/10/08 21:29:37 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: DEBUG
                Region: global (DC: nyc3)
                Server: false

==> Nomad agent started! Log data will stream in below:

    2015/10/08 20:04:36 [INFO] client: using state directory /opt/nomad/client
    2015/10/08 20:04:36 [INFO] client: using alloc directory /opt/nomad/alloc
    2015/10/08 20:04:36 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool eth0'
    2015/10/08 20:04:36 [WARN] fingerprint.network: Unable to read link speed from /sys/class/net/eth0/speed
    2015/10/08 20:04:36 [DEBUG] fingerprint.network: Unable to read link speed; setting to default 100
    2015/10/08 20:04:36 [DEBUG] client: applied fingerprints [arch cpu host memory storage network]
    2015/10/08 20:04:36 [DEBUG] client: available drivers [exec docker]
    2015/10/08 20:04:36 [DEBUG] client: node registration complete
    2015/10/08 20:04:36 [DEBUG] client: updated allocations at index 1 (0 allocs)
    2015/10/08 20:04:36 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 20:04:39 [DEBUG] client: state updated to ready
    2015/10/08 20:15:10 [DEBUG] client: updated allocations at index 27 (0 allocs)
    2015/10/08 20:15:10 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 20:20:20 [DEBUG] client: updated allocations at index 36 (0 allocs)
    2015/10/08 20:20:20 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 20:25:26 [DEBUG] client: updated allocations at index 45 (0 allocs)
    2015/10/08 20:25:26 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 20:30:38 [DEBUG] client: updated allocations at index 51 (0 allocs)
    2015/10/08 20:30:38 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 20:35:47 [DEBUG] client: updated allocations at index 74 (0 allocs)
    2015/10/08 20:35:47 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 20:40:48 [DEBUG] client: updated allocations at index 84 (0 allocs)
    2015/10/08 20:40:48 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 21:01:28 [DEBUG] client: updated allocations at index 107 (0 allocs)
    2015/10/08 21:01:28 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 21:21:45 [DEBUG] client: updated allocations at index 134 (0 allocs)
    2015/10/08 21:21:45 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 21:26:47 [DEBUG] client: updated allocations at index 142 (0 allocs)
    2015/10/08 21:26:47 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)
    2015/10/08 21:31:54 [DEBUG] client: updated allocations at index 153 (0 allocs)
    2015/10/08 21:31:54 [DEBUG] client: allocs: (added 0) (removed 0) (updated 0) (ignore 0)

@djsly
Copy link
Author

djsly commented Oct 9, 2015

thats the only WARN that I can digest

   2015/10/08 20:01:54 [WARN] fingerprint.network: Unable to parse Speed in output of '/sbin/ethtool eth0'
    2015/10/08 20:01:54 [WARN] fingerprint.network: Unable to read link speed from /sys/class/net/eth0/speed
    2015/10/08 20:01:54 [DEBUG] fingerprint.network: Unable to read link speed; setting to default 100

hence the reason I removed the network : mbits = 10, resource

@cbednarski
Copy link
Contributor

Ah, I think you are missing -server from your startup command. Server is false here:

==> Nomad agent configuration:

                 Atlas: <disabled>
                Client: true
             Log Level: DEBUG
                Region: global (DC: nyc3)
                Server: false

And you started with this:

/usr/local/bin/nomad agent -config /usr/local/etc/nomad

I think you will want to start your server nodes like this:

/usr/local/bin/nomad agent -server -config /usr/local/etc/nomad

The difference between an agent and a server agent is that the server agent can perform scheduling decisions and tracks the state of your cluster. Non-server agent nodes simply start jobs and keep track of what's running on their host.

@djsly
Copy link
Author

djsly commented Oct 9, 2015

Thanks, the snippet I provided were from both clients, (I realized that a lot of text to try to separate, I will try to use bold text next time to make it clearer ;) )

The good news is that I was able to run the example !

The issue was the datacenter was left to be dc1

While I now have nyc3 registered!

    # Specify the datacenters within the region this job can run in.
    datacenters = ["nyc3"]

==> Monitoring evaluation "ac417394-a1ec-10b3-5d73-98ca04dcb34a"
    Evaluation triggered by job "example"
    Allocation "b886839c-9a87-ec7e-4af3-8c91d1372853" created: node "b540e87f-665a-c0c9-43ba-bb8a6626f4b1", group "cache"
    Allocation "d25ab556-bb10-aac3-4ee0-5e7f9492b810" created: node "6f7abc0a-be46-d22a-c3db-0d9d654d0462", group "cache"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "ac417394-a1ec-10b3-5d73-98ca04dcb34a" finished with status "complete"

Sorry about the trouble, I guess this goes to show that we need a bit more verbosity on the client side to help identify those quite rookie mistakes :)

Something like this would have been helpful

Trying to evaluate job on the following DC : ['dc1','dc2']
* No nodes were eligible for evaluation
** Available nodes are
*** Node1: nyc3, 1cpu, 2gb
*** Node2: nyc3, 1cpu, 4gb

@cbednarski
Copy link
Contributor

Thanks for following up with the solution! We can certainly improve the feedback here.

@cbednarski cbednarski changed the title 0/0 nodes filtered while I have 2 active node -- running the digital ocean demo Improve CLI feedback when job cannot be placed because the region is mismatched Oct 9, 2015
@cbednarski cbednarski changed the title Improve CLI feedback when job cannot be placed because the region is mismatched Improve CLI feedback when job cannot be placed because the datacenter is missing Oct 9, 2015
@cbednarski cbednarski changed the title Improve CLI feedback when job cannot be placed because the datacenter is missing Improve CLI feedback when job cannot be placed because the preferred datacenter is missing Oct 9, 2015
@djsly
Copy link
Author

djsly commented Oct 9, 2015

Feel free to rename / close for the improvement request. 
Thanks again !
Next step running my 15 container stack :/

Sent from Outlook on iPhone

On Thu, Oct 8, 2015 at 11:27 PM -0700, "Chris Bednarski" notifications@github.com wrote:

Thanks for following up with the solution! We can certainly improve the feedback here.


Reply to this email directly or view it on GitHub.

@dadgar
Copy link
Contributor

dadgar commented Jan 9, 2016

This is fixed by #619

@dadgar dadgar closed this as completed Jan 9, 2016
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 27, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants