-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Troubleshooting blocked evaluation #19827
Comments
Hi @suikast42 👋 Which version of Nomad are you running? I just tested on Nomad 1.7.3 and I do get the expected results on port collision: ![]() ![]() ![]() |
This is strange: nomad --version |
Ok I try it with a simple deployment job "whoami" {
group "whoami" {
count = 1
network {
mode = "bridge"
port "web" {
to=8080
static = 8080
}
}
service {
name = "${NOMAD_NAMESPACE}-${NOMAD_GROUP_NAME}"
port = "web"
tags = [
"traefik.enable=true",
"traefik.http.routers.${NOMAD_GROUP_NAME}-${NOMAD_ALLOC_ID}.rule=Host(`${NOMAD_NAMESPACE}.${NOMAD_GROUP_NAME}.cloud.private`)",
"traefik.http.routers.${NOMAD_GROUP_NAME}-${NOMAD_ALLOC_ID}.tls=true",
]
check {
type = "http"
path = "/health"
port = "web"
interval = "10s"
timeout = "2s"
}
}
task "whoami" {
driver = "docker"
# driver = "containerd-driver"
config {
image = "traefik/whoami"
ports = ["web"]
args = ["--port", "${NOMAD_PORT_web}"]
}
resources {
cpu = 100
memory = 128
}
}
}
} The second time I deloy the same job with the name whoami2 and let the rest of the definition the same
|
Hum...sorry I still can't reproduce the problem 🤔 How many clients do you have? Could you share the full output of when you run |
Have one worker an one master
|
I try it with bridge and host network mode. Both the same. |
My nomad and consul configs. Maybe that helps? Consul server datacenter = "nomadder1"
data_dir = "/opt/services/core/consul/data"
log_level = "INFO"
node_name = "master-01"
server = true
bind_addr = "0.0.0.0"
advertise_addr = "172.42.1.10"
client_addr = "0.0.0.0"
encrypt = "G1CHAD7wwu0tU28BlKkirSahTJ/Tqpo9ClOAycQAUwE="
server_rejoin_age_max = "8640h"
# https://developer.hashicorp.com/consul/docs/connect/observability/ui-visualization
ui_config{
enabled = true
dashboard_url_templates {
service = "https://grafana.cloud.private/d/lDlaj-NGz/service-overview?orgId=1&var-service={{Service.Name}}&var-namespace={{Service.Namespace}}&var-partition={{Service.Partition}}&var-dc={{Datacenter}}"
}
metrics_provider = "prometheus"
metrics_proxy {
base_url = "http://mimir.service.consul:9009/prometheus"
add_headers = [
# {
# name = "Authorization"
# value = "Bearer <token>"
# }
{
name = "X-Scope-OrgID"
value = "1"
}
]
path_allowlist = ["/prometheus/api/v1/query_range", "/prometheus/api/v1/query"]
}
}
addresses {
# grpc = "127.0.0.1"
grpc_tls = "127.0.0.1"
}
ports {
http = -1
https = 8501
# grpc = 8502
grpc_tls = 8503
}
connect {
enabled = true
}
retry_join = ["172.42.1.10"]
bootstrap_expect = 1
auto_encrypt{
allow_tls = true
}
performance{
raft_multiplier = 1
}
node_meta{
node_type = "server"
}
tls{
defaults {
ca_file = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
cert_file = "/etc/opt/certs/consul/consul.pem"
key_file = "/etc/opt/certs/consul/consul-key.pem"
verify_incoming = true
verify_outgoing = true
}
internal_rpc {
verify_server_hostname = true
}
}
#watches = [
# {
# type = "checks"
# handler = "/usr/bin/health-check-handler.sh"
# }
#]
telemetry {
disable_hostname = true
prometheus_retention_time = "72h"
} nomad server log_level = "DEBUG"
name = "master-01"
datacenter = "nomadder1"
data_dir = "/opt/services/core/nomad/data"
#You should only set this value to true on server agents
#if the terminated server will never join the cluster again
#leave_on_interrupt= false
#You should only set this value to true on server agents
#if the terminated server will never join the cluster again
#leave_on_terminate = false
server {
enabled = true
job_max_priority = 100 # 100 is the default
job_default_priority = 50 # 50 is the default
bootstrap_expect = 1
encrypt = "4PRfoE6Mj9dHTLpnzmYD1+THdlyAo2Ji4U6ewMumpAw="
rejoin_after_leave = true
server_join {
retry_join = ["172.42.1.10"]
retry_max = 0
retry_interval = "15s"
}
}
bind_addr = "0.0.0.0" # the default
advertise {
# Defaults to the first private IP address.
http = "172.42.1.10"
rpc = "172.42.1.10"
serf = "172.42.1.10"
}
tls {
http = true
rpc = true
ca_file = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
cert_file = "/etc/opt/certs/nomad/nomad.pem"
key_file = "/etc/opt/certs/nomad/nomad-key.pem"
verify_server_hostname = true
verify_https_client = true
}
ui {
enabled = true
label {
text = "💙💛 Fenerbaçhe 1907 💛💙"
background_color = "#163962"
text_color = "##ffed00"
}
consul {
ui_url = "https://consul.cloud.private"
}
vault {
ui_url = "https://vault.cloud.private"
}
}
consul{
ssl= true
address = "127.0.0.1:8501"
grpc_address = "127.0.0.1:8503"
# this works only with ACL enabled
allow_unauthenticated= true
ca_file = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
grpc_ca_file = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
cert_file = "/etc/opt/certs/consul/consul.pem"
key_file = "/etc/opt/certs/consul/consul-key.pem"
}
telemetry {
collection_interval = "1s"
disable_hostname = true
prometheus_metrics = true
publish_allocation_metrics = true
publish_node_metrics = true
} consul agent datacenter = "nomadder1"
data_dir = "/opt/services/core/consul/data"
log_level = "INFO"
node_name = "worker-01"
bind_addr = "0.0.0.0"
advertise_addr = "172.42.1.20"
client_addr = "0.0.0.0"
encrypt = "G1CHAD7wwu0tU28BlKkirSahTJ/Tqpo9ClOAycQAUwE="
addresses {
# grpc = "127.0.0.1"
grpc_tls = "127.0.0.1"
}
ports {
http = -1
https = 8501
# grpc = 8502
grpc_tls = 8503
}
connect {
enabled = true
}
retry_join = ["172.42.1.10"]
auto_encrypt{
tls = true
}
performance{
raft_multiplier = 1
}
node_meta{
node_type = "worker"
}
tls{
defaults {
ca_file = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
cert_file = "/etc/opt/certs/consul/consul.pem"
key_file = "/etc/opt/certs/consul/consul-key.pem"
verify_incoming = false
verify_outgoing = true
}
internal_rpc {
verify_server_hostname = true
}
}
#watches = [
# {
# type = "checks"
# handler = "/usr/bin/health-check-handler.sh"
# }
#]
telemetry {
disable_hostname = true
nomad agent log_level = "DEBUG"
name = "worker-01"
datacenter = "nomadder1"
data_dir = "/opt/services/core/nomad/data"
bind_addr = "0.0.0.0" # the default
leave_on_interrupt= true
#https://github.com/hashicorp/nomad/issues/17093
#systemctl kill -s SIGTERM nomad will suppress node drain if
#leave_on_terminate set to false
leave_on_terminate = true
advertise {
# Defaults to the first private IP address.
http = "172.42.1.20"
rpc = "172.42.1.20"
serf = "172.42.1.20"
}
client {
enabled = true
network_interface = "eth1"
meta {
node_type= "worker"
connect.log_level = "debug"
connect.sidecar_image= "registry.cloud.private/envoyproxy/envoy:v1.29.0"
}
server_join {
retry_join = ["172.42.1.10"]
retry_max = 0
retry_interval = "15s"
}
# Either leave_on_interrupt or leave_on_terminate must be set
# for this to take effect.
drain_on_shutdown {
deadline = "2m"
force = false
ignore_system_jobs = false
}
host_volume "ca_cert" {
path = "/usr/local/share/ca-certificates/cloudlocal"
read_only = true
}
host_volume "cert_ingress" {
path = "/etc/opt/certs/ingress"
read_only = true
}
## Cert consul client
## Needed for consul_sd_configs
## Should be deleted after resolve https://github.com/suikast42/nomadder/issues/100
host_volume "cert_consul" {
path = "/etc/opt/certs/consul"
read_only = true
}
## Cert consul client
## Needed for jenkins
## Should be deleted after resolve https://github.com/suikast42/nomadder/issues/100
host_volume "cert_nomad" {
path = "/etc/opt/certs/nomad"
read_only = true
}
## Cert docker client
## Needed for jenkins
## Should be deleted after migrating to vault
host_volume "cert_docker" {
path = "/etc/opt/certs/docker"
read_only = true
}
host_network "public" {
interface = "eth0"
#cidr = "203.0.113.0/24"
#reserved_ports = "22,80"
}
host_network "default" {
interface = "eth1"
}
host_network "private" {
interface = "eth1"
}
host_network "local" {
interface = "lo"
}
reserved {
# cpu (int: 0) - Specifies the amount of CPU to reserve, in MHz.
# cores (int: 0) - Specifies the number of CPU cores to reserve.
# memory (int: 0) - Specifies the amount of memory to reserve, in MB.
# disk (int: 0) - Specifies the amount of disk to reserve, in MB.
# reserved_ports (string: "") - Specifies a comma-separated list of ports to reserve on all fingerprinted network devices. Ranges can be specified by using a hyphen separating the two inclusive ends. See also host_network for reserving ports on specific host networks.
cpu = 1000
memory = 2048
}
max_kill_timeout = "1m"
}
tls {
http = true
rpc = true
ca_file = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
cert_file = "/etc/opt/certs/nomad/nomad.pem"
key_file = "/etc/opt/certs/nomad/nomad-key.pem"
verify_server_hostname = true
verify_https_client = true
}
consul{
ssl= true
address = "127.0.0.1:8501"
grpc_address = "127.0.0.1:8503"
# this works only with ACL enabled
allow_unauthenticated= true
ca_file = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
grpc_ca_file = "/usr/local/share/ca-certificates/cloudlocal/cluster-ca-bundle.pem"
cert_file = "/etc/opt/certs/consul/consul.pem"
key_file = "/etc/opt/certs/consul/consul-key.pem"
}
telemetry {
collection_interval = "1s"
disable_hostname = true
prometheus_metrics = true
publish_allocation_metrics = true
publish_node_metrics = true
}
plugin "docker" {
config {
allow_privileged = false
disable_log_collection = false
# volumes {
# enabled = true
# selinuxlabel = "z"
# }
infra_image = "registry.cloud.private/google_containers/pause-amd64:3.2"
infra_image_pull_timeout ="30m"
extra_labels = ["job_name", "job_id", "task_group_name", "task_name", "namespace", "node_name", "node_id"]
logging {
type = "journald"
config {
labels-regex =".*"
}
}
gc{
container = true
dangling_containers{
enabled = true
# period = "3m"
# creation_grace = "5m"
}
}
}
} |
Thank you for the extra information @suikast42! The server logs allowed me to find the problem. I believe you have service job preemption enabled, which triggered a different code path from the default configuration I was using. I opened #19933 to fix this issue. To confirm that this is the case, could you share the output of the command |
Interesting 👌 Here is the output. By the way I updated to 1.7.4 But nothing changed of course 😂
|
Thanks! Yeah, Thank you again for the report! |
Yes I do this beacuase I activate MemoryOversubscription. Thus I thought that's more dynmamic for my usecase 😁 |
Oh yes, preemption is a very nice feature. But it triggers some different code paths that sometimes are not kept up-to-date 😬 But I'm glad we were able to get to the bottom of this. I was really confused why it wasn't happening to me 😅 |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
I had a similar issue at the past but donÄT undersnatd why my evalatation is blocked.
See #19446
Now I can reproduce the issue.
I have deployed a mssql DB with a static port mapping.
Then I try accidently deloy a second job with the same static port mapping with only one worker node.
That's not a bug that nomad deny the allocation. But the information why the alocation is blocked is nowhere listed.
nomad job status
deployment status 292a527f
An information like 'not enough cpu, mem' or 'port conflict and no more nodes avlialabe' cloud be very handy for trouble shooting
The text was updated successfully, but these errors were encountered: