-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use of reserved_ports causes job-register evaluations to fail without a reason #1617
Comments
We hit this issue when trying to configure |
Here is the exact error from our application code:
|
Some background: We schedule services with dynamic port allocation on the exec driver. The stack trace pasted by @jshaw86 is from the service failing to listen on the port dynamically allocated by Nomad. The port is already in use by another process, most likely another instance of this service that has opened an outbound socket. The kernel allocates local ports for outbound sockets from the same port range that Nomad uses for allocating dynamic ports. With enough services running, the chances of local port collisions gets pretty high. We were hoping that the (FWIW, we are considering setting up a secondary interface for all outbound sockets. It should be a usable workaround in the meantime.) |
@bagelswitch @parasyte Could you guys give a runnable job file and client configs that will reproduce this (hopefully just using the vagrant box). I could not reproduce by having three nodes with one have reserved port and the others not running a system job. |
We run CentOS on AWS and encountered the same problem when using ReservedPorts. Could it be related to AWS? |
@kaskavalci the above example is on vagrant so unrelated to AWS. It's a function of the total number of jobs schedule and total amount of network being used on the nomad agent. If you don't do a lot of network then you can schedule more jobs before you see the issue. If you do a lot of network then you can schedule less jobs before you see the issue. |
@jshaw86 That actually looks like a different issue than what @bagelswitch reported. His problem was that the scheduler wouldn't even create the allocation. This is a failure once it is on the client. Could you create a new issue with the details you provided. Titled |
@kaskavalci Which problem did you hit the one jshaw86 showed or the original problem? I have a good idea about jshaws but trying to reproduce the original one! |
I do not have logs since the cluster is now terminated. I can describe our issue though. When nomad clients are configured with ReservedPorts on AWS, Nomad Client will not return an evaluation. There are no info messages such as exhausted disk or anything, just |
Yes, there are two different issues; the port collision in #1728 is different. We also experienced this issue (no allocations are created when |
FWIW, my case was also on AWS. re. job file, I can repro w/ just the example job created by nomad init - it doesn't matter whether the job specifies static or dynamic ports - client/server configs are as originally posted, although I'm reasonably certain nothing matters other than the use of reserved port ranges in the client config. |
@bagelswitch Can you give your client config? |
Here is the allocation log for my side. It goes on like this. Nothing on nomad servers on INFO level unfortunately. If you need DEBUG traces, I can turn them on.
Client configuration:
|
My client config:
|
I failed to reproduce this on nomad 0.5-rc1. I posted my client configs in a gist. Everything worked as expected:
I unset the region, removed your custom service checks, and switched the docker container to redis -- hopefully none of those could possibly trigger what you saw. Let me know if you see any other differences between my setup and yours. I'll try to downgrade my cluster to 0.4.1 and try again. This is a Debian Jessie (8.6) cluster in GCE on n1-standard-2 vms (3 servers + 3 clients). Kernel 3.16.0 and docker 1.12.3. |
Started with a fresh 0.4.1 cluster, removed TLS settings, and still could not reproduce the bug. nomad run produces normal output and I verified the job ran on the 2 nodes with the runner node class:
Are you still encountering this issue reliably? Can you try upgrading to 0.5-rc1 on the off chance the behavior has at least changed to give us some idea of where a problem might lie. |
@schmichael - note that I can only repro this on AWS - I believe that may be the case for the others reporting in this thread as well, based on the comments above. I do not see the problem in an otherwise identical setup, running on an internal OpenStack environment. |
@bagelswitch Argh, shoot. VPC or EC2 classic? |
VPC |
Hey, Reproduced this. Will try to get a fix out soon |
Hooray! Thanks for the update Alex. We have a hack in our local nomad build to workaround it for the moment. Looking forward to the fix. |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.4.1-rc1 ('c7bbcb56905c90a7567183c0c6dbffc050f52567+CHANGES')
Operating system and Environment details
Ubuntu 14.04 LTS, kernel 3.13.0-93-generic on AWS t2.medium VMs
Issue
I believe I am seeing the same behavior described in #1046 while running v0.4.1-rc1
In a small cluster with 3 servers and 5 clients, while running a system job, if any client config specifies any reserved ports, job-register evaluations fail with no allocation on that particular client - successful allocations do occur for other clients that don't specify reserved ports. This occurs for service jobs as well, but is easier to see with system jobs due to the predictable placement.
Reproduction steps
Create one or more server nodes with config like:
Create one or more client nodes and configure at least one to specify reserved ports:
Create a job specifying any network resources (problem occurs with either static or dynamic ports, in my case I'm trying a job with only static ports), see job file below.
Run the job, in this case I have 5 client nodes, 3 are of the appropriate node class for the job, and one of the 3 has reserved ports configured:
Note the 2 rather than the expected 3 allocations. Job status shows:
Leader server log is below, logs on the client with reserved ports show nothing at all.
If the reserved port config is removed from this client, and it is restarted, the same job can be run and it will receive an allocation, with no other changes to config or job definition.
Nomad Server logs (if appropriate)
Leader server log:
Nomad Client logs (if appropriate)
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: