Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connect Gateways: Resources are not respected/supported #10899

Closed
chuckyz opened this issue Jul 14, 2021 · 13 comments · Fixed by #11927
Closed

Connect Gateways: Resources are not respected/supported #10899

chuckyz opened this issue Jul 14, 2021 · 13 comments · Fixed by #11927
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul/connect Consul Connect integration type/bug
Milestone

Comments

@chuckyz
Copy link
Contributor

chuckyz commented Jul 14, 2021

Nomad version

1.0.2

Operating system and Environment details

Ubuntu 18.04

Issue

resources{} under `conn

Reproduction steps

Run the job below.

Expected Result

This launches a task named ingress-gateway-grpc with 2000mhz cpu and 2048mb memory.

Actual Result

This launches a task named ingress-gateway-grpc with 250mhz cpu and 128mb memory.

Job file (if appropriate)

job "ingress-grpc" {
  datacenters = ["dc1"]
  type = "service"
  meta {
    GATEWAY_NAME = "ingress-gateway-grpc"
    DEPLOY_TIME = "2021-07-12T13:31:02-07:00"
  }
  group "ingress-group" {
    network {
      mode = "bridge"
      port "inbound" {
        static = 7000
        to     = 7001
      }
      port "envoy_prom" { to = 9102 }
    }
    service {
      name = "${NOMAD_META_GATEWAY_NAME}"
      port = "7000"
      tags = ["http"]
      meta {
        envoy_metrics_port = "${NOMAD_HOST_PORT_envoy_prom}"
      }
      connect {
        sidecar_task {
          resources {
            cpu    = 2000
            memory = 2048
          }
        }
        gateway {
          proxy {
          }
          ingress {
            listener {
              port     = 7000
              protocol = "http"
              service {
                name = "none"
                hosts = ["none"]
              }
            }
          }
        }
      }
    }
  }
}

We've tried to put resources{} under gateway but it doesn't seem to work.

Notes:

sidecar_task seems to get it's resources from

Resources *Resources `hcl:"resources,block"`

whereas all gateway tasks seem to get their config from

type ConsulGateway struct {

which doesn't include any *Resources calls.

Looking into

type ConsulProxy struct {
it seems there's no task definitions anywhere?

I looked through the envoy_bootstrap_hook.go file as well and was unable to find the actual task definition for the gateways.

@shoenig do you know where in the code we should be looking? I'm more than happy to submit a PR passing through resources to the config. I think it should probably go under gateway { ingress { resources{} } } or gateway { proxy { resources {} }.

@jrasell
Copy link
Member

jrasell commented Jul 15, 2021

Hi @chuckyz and thanks for the report.

I have not been able to reproduce this locally yet. When running on main I had to modify the job slightly and when the job registered the resources of the sidecar task where detailed correctly as set within the jobspec. I was unable to get the job running with 1.0.2 at this point. If you have any addition reproduction steps that could help, that would be appreciated. I'll mark the issue as needing further investigation.

@idrennanvmware
Copy link
Contributor

We use this pretty extensively IIRC, and I have a vague recollection of this being addressed in a subsequent patch

#9854

@jrasell jrasell self-assigned this Jul 15, 2021
@shoenig
Copy link
Member

shoenig commented Jul 15, 2021

Heh good memory @idrennanvmware, but I think #9854 was about a precondition when using expose on a service check for a Connect service, preventing folks from being able to make use of sidecar_task & expose together.

I'm not sure yet what's going on here. Tweaked the jobspec (out of laziness, shouldn't be related) to make it work

           ingress {
             listener {
               port     = 7000
-              protocol = "http"
+              protocol = "tcp"
               service {
                 name = "none"
-                hosts = ["none"]
+                # hosts = ["none"]
               }
             }
           }

and it seems to run fine with the expected resources

Task "connect-ingress-ingress-gateway-grpc" is "running"
Task Resources
CPU         Memory          Disk     Addresses
2/2000 MHz  14 MiB/2.0 GiB  300 MiB  
➜ nomad version 
Nomad v1.0.2 (fff533a3fefe848b6997f56855327f653e4ec491)

@chuckyz can I ask, where are you getting the reported allocated resources metrics from?

@notnoop
Copy link
Contributor

notnoop commented Jul 21, 2021

Hi @chuckyz . Like Seth, I was able to set the resources as expected. Honoring resources for consul connect proxies was added in #9639 that shipped in Nomad 1.0.2. Since it's a server-side change, all servers must be running 1.0.2 or later.

Here is my attempt at replication:

$ nomad job run -detach ./job.hcl
Job registration successful
Evaluation ID: e360b99c-fcb1-d1cc-e023-8b342fbc8cd7
$ nomad job inspect ingress-grpc | jq '.Job.TaskGroups[0].Tasks[0].Resources'
{
  "CPU": 2000,
  "Devices": null,
  "DiskMB": 0,
  "IOPS": 0,
  "MemoryMB": 2048,
  "Networks": null
}
$ nomad version
Nomad v1.0.2 (4c1d4fc6a5823ebc8c3e748daec7b4fda3f11037)
$ nomad server members
Name                         Address    Port  Status  Leader  Protocol  Build  Datacenter  Region
notnoop-C02X1N38JG5H.global  127.0.0.1  4648  alive   true    2         1.0.2  dc1         global

Where job.hcl is:

job "ingress-grpc" {
  datacenters = ["dc1"]
  type = "service"
  meta {
    GATEWAY_NAME = "ingress-gateway-grpc"
    DEPLOY_TIME = "2021-07-12T13:31:02-07:00"
  }
  group "ingress-group" {
    network {
      mode = "bridge"
      port "inbound" {
        static = 7000
        to     = 7001
      }
      port "envoy_prom" { to = 9102 }
    }
    service {
      name = "${NOMAD_META_GATEWAY_NAME}"
      port = "7000"
      tags = ["http"]
      meta {
        envoy_metrics_port = "${NOMAD_HOST_PORT_envoy_prom}"
      }
      connect {
        sidecar_task {
          resources {
            cpu    = 2000
            memory = 2048
          }
        }
        gateway {
          proxy {
          }
          ingress {
            listener {
              port     = 7000
              protocol = "tcp"
              service {
                name = "none"
                #hosts = ["none"]
              }
            }
          }
        }
      }
    }
  }
}

I have also confirmed that I get 250 CPU / 128 MB RAM when the server 1.0.1:

$ ./nomad job run -detach ./job.hcl
Job registration successful
Evaluation ID: 256be59a-0d24-f859-dccb-218bafcc62b2
$ ./nomad job inspect ingress-grpc | jq '.Job.TaskGroups[0].Tasks[0].Resources'
{
  "CPU": 250,
  "Devices": null,
  "DiskMB": 0,
  "IOPS": 0,
  "MemoryMB": 128,
  "Networks": null
}
$ ./nomad version
Nomad v1.0.1 (c9c68aa55a7275f22d2338f2df53e67ebfcb9238)
$ ./nomad server members
Name                         Address    Port  Status  Leader  Protocol  Build  Datacenter  Region
notnoop-C02X1N38JG5H.global  127.0.0.1  4648  alive   true    2         1.0.1  dc1         global

@chuckyz
Copy link
Contributor Author

chuckyz commented Jul 21, 2021

We're running 1.0.2+ent, I wonder if maybe it was missed from there somehow?

We'll upgrade to 1.1.2+ent soonish and confirm if this is fixed or not.

@notnoop
Copy link
Contributor

notnoop commented Jul 21, 2021

This is quite puzzling indeed. I just verified the behavior on 1.0.2+ent and also tested the resulting allocation and docker container:

ubuntu@ip-172-31-69-72:~/gh-10899$ nomad job run ./job.hcl
==> Monitoring evaluation "743bd901"
    Evaluation triggered by job "ingress-grpc"
    Evaluation within deployment: "40b90b51"
    Allocation "b88313c1" created: node "d3455e40", group "ingress-group"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "743bd901" finished with status "complete"
ubuntu@ip-172-31-69-72:~/gh-10899$ nomad alloc status b88313c1
ID                  = b88313c1-60c9-f701-d22a-6e7ea1339bcc
Eval ID             = 743bd901
Name                = ingress-grpc.ingress-group[0]
Node ID             = d3455e40
Node Name           = ip-172-31-69-72
Job ID              = ingress-grpc
Job Version         = 0
Client Status       = running
Client Description  = Tasks are running
Desired Status      = run
Desired Description = <none>
Created             = 13s ago
Modified            = 1s ago
Deployment ID       = 40b90b51
Deployment Health   = healthy

Allocation Addresses (mode = "bridge")
Label        Dynamic  Address
*inbound     yes      172.31.69.72:7000 -> 7001
*envoy_prom  yes      172.31.69.72:30316 -> 9102

Task "connect-ingress-ingress-gateway-grpc" is "running"
Task Resources
CPU         Memory          Disk     Addresses
4/2000 MHz  14 MiB/2.0 GiB  300 MiB

Task Events:
Started At     = 2021-07-21T18:46:35Z
Finished At    = N/A
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                  Type        Description
2021-07-21T18:46:35Z  Started     Task started by client
2021-07-21T18:46:35Z  Task Setup  Building Task Directory
2021-07-21T18:46:33Z  Received    Task received by client
ubuntu@ip-172-31-69-72:~/gh-10899$ sudo docker ps
CONTAINER ID   IMAGE                                      COMMAND                  CREATED         STATUS         PORTS     NAMES
b645f5e33369   envoyproxy/envoy:v1.18.3                   "/docker-entrypoint.…"   2 minutes ago   Up 2 minutes             connect-ingress-ingress-gateway-grpc-b88313c1-60c9-f701-d22a-6e7ea1339bcc
43f724fb4018   gcr.io/google_containers/pause-amd64:3.1   "/pause"                 2 minutes ago   Up 2 minutes             nomad_init_b88313c1-60c9-f701-d22a-6e7ea1339bcc
ubuntu@ip-172-31-69-72:~/gh-10899$ sudo docker inspect b645f5e33369 | jq '.[0].HostConfig | [.CpuShares, .Memory / 1024 /1024]'
[
  2000,
  2048
]
ubuntu@ip-172-31-69-72:~/gh-10899$ nomad version
Nomad v1.0.2+ent (8b533dbc27ed9293871b0c6041c2e285e18f2a97)
ubuntu@ip-172-31-69-72:~/gh-10899$ nomad server members
Name                    Address       Port  Status  Leader  Protocol  Build      Datacenter  Region
ip-172-31-69-72.global  172.31.69.72  4648  alive   true    2         1.0.2+ent  dc1         global

@shoenig
Copy link
Member

shoenig commented Jul 22, 2021

In playing around with this I've realized there's a difference between having an absent sidecar_task vs. an empty sidecar_task - the default resources aren't the same. However from either starting point, doing a plan suggests the desired resources would be applied as expected. @chuckyz any chance you could show the output of doing a job plan with that job file?

from absent sidecar_task

+/- Job: "a"
+/- Task Group: "ingress-group" (1 create/destroy update)
  +/- Service {
        AddressMode:              "auto"
        EnableTagOverride:        "false"
        Meta[envoy_metrics_port]: "${NOMAD_HOST_PORT_envoy_prom}"
        Name:                     "ingress-gateway-grpc"
        PortLabel:                "7000"
        TaskName:                 ""
        Tags {
          Tags: "http"
        }
    +/- ConsulConnect {
        Native: "false"
      + SidecarTask {
        + Resources {
          + CPU:      "2000"
          + DiskMB:   "0"
          + IOPS:     "0"
          + MemoryMB: "2048"
          }
        + LogConfig {
          + MaxFileSizeMB: "10"
          + MaxFiles:      "10"
          }
        }
        }
      }
  +/- Task: "connect-ingress-ingress-gateway-grpc" (forces create/destroy update)
    +/- ShutdownDelay: "5000000000" => "0"
    +/- Resources {
      +/- CPU:      "250" => "2000"
          DiskMB:   "0"
          IOPS:     "0"
      +/- MemoryMB: "128" => "2048"
        }
    +/- LogConfig {
      +/- MaxFileSizeMB: "2" => "10"
      +/- MaxFiles:      "2" => "10"
        }

from empty sidecar_task

+/- Job: "a"
+/- Task Group: "ingress-group" (1 create/destroy update)
  +/- Service {
        AddressMode:              "auto"
        EnableTagOverride:        "false"
        Meta[envoy_metrics_port]: "${NOMAD_HOST_PORT_envoy_prom}"
        Name:                     "ingress-gateway-grpc"
        PortLabel:                "7000"
        TaskName:                 ""
        Tags {
          Tags: "http"
        }
    +/- ConsulConnect {
          Native: "false"
      +/- SidecarTask {
        +/- Resources {
          +/- CPU:      "100" => "2000"
              DiskMB:   "0"
              IOPS:     "0"
          +/- MemoryMB: "300" => "2048"
            }
          }
        }
      }
  +/- Task: "connect-ingress-ingress-gateway-grpc" (forces create/destroy update)
    +/- Resources {
      +/- CPU:      "100" => "2000"
          DiskMB:   "0"
          IOPS:     "0"
      +/- MemoryMB: "300" => "2048"
        }

@gulavanir
Copy link

gulavanir commented Dec 15, 2021

I am also seeing the same issue, resources added under sidecar_task to configure ingress gateway job are not honored.

Job file:

job "ingress" {
  region      = "global"
  datacenters = ["dev-setup"]
  type        = "service"

  group "ingress-group" {
    count = 1

    scaling {
      enabled = false
      min     = 1
      max     = 1
      policy  = {}
    }

    constraint {
      operator = "distinct_hosts"
      value    = "true"
    }

    constraint {
      attribute = "${meta.general_compute_linux}"
      value     = "true"
    }

    network {
      mode = "bridge"

      port "inbound" {
        static = 8080
      }

      port "hc" {
        static = 8081
      }
    }

    service {
      name = "ingress"
      port = "inbound"

      // https://doc.traefik.io/traefik/routing/providers/consul-catalog/
      
      tags = [
        "https",
        
        "traefik.enable=true",
        "traefik.http.services.ingress.loadbalancer.server.scheme=https",
        "traefik.http.services.ingress.loadbalancer.serverstransport=ingress-transport@file",
        
      ]

      check {
        type     = "http"
        name     = "ingress-gateway-hc"
        port     = "hc"
        path     = "/ready"
        interval = "5s"
        timeout  = "1s"
      }

      connect {
        sidecar_task {
          resource {
            cpu    = 140
            memory = 200
          }
        }
        gateway {
          proxy {
            connect_timeout = "500ms"

            config {

            }
          }

          ingress {
            tls {
              enabled = true
            }

            listener {
              port     = 8080
              protocol = "http"

              service {
                name = "ingress-http-backend"
                # https://www.envoyproxy.io/docs/envoy/latest/faq/debugging/why_is_my_route_not_found
                hosts = [
                  "localhost",
                  "localhost:29999",
                  "ingress.service.consul",           
                ]
              }
            }
          }
        }
      }
    }
  }
}

The deployed job shows SidecarTask value as null and nomad job plan also doesn't show any changes.

"Connect": { "Native": false, "SidecarService": null, "SidecarTask": null, "Gateway": {...} }

We are running nomad version - v1.1.5

@shoenig
Copy link
Member

shoenig commented Jan 24, 2022

@gulavanir, are you submitting jobs with -hcl1? That jobspec is not valid, but only correctly enforced using the new hcl2 parser, e.g.

➜ nomad job run test.nomad 
Error getting job struct: Failed to parse using HCL 2. Use the HCL 1 parser with `nomad run -hcl1`, or address the following issues:
test.nomad:31,4-12: Unsupported block type; Blocks of type "resource" are not expected here. Did you mean "resources"?

Indeed if I submit that job with -hcl1 it gets accepted, but the resources are not set.

➜ nomad job run -hcl1 test.nomad
Task "connect-ingress-ingress" is "running"
Task Resources
CPU        Memory          Disk     Addresses
2/250 MHz  14 MiB/128 MiB  300 MiB 

If I fix that job to set resources (with an s), it runs correctly:

Task "connect-ingress-ingress" is "running"
Task Resources
CPU        Memory          Disk     Addresses
1/140 MHz  14 MiB/200 MiB  300 MiB  

@shoenig
Copy link
Member

shoenig commented Jan 24, 2022

Ahh @chuckyz are you also using -hcl1? It seems the resources block is just ignored with the old parser altogether.

@idrennanvmware
Copy link
Contributor

idrennanvmware commented Jan 24, 2022

@shoenig we enforce HCL1 spec because we have templates that have variables that are not HCL2 compliant. See reference here:#9838

So without an escape character (IIRC a few other suggested something similar) we're in between a bit of a rock and a hard place

Am I misreading the referenced github issue and this is actually now supported?

@shoenig
Copy link
Member

shoenig commented Jan 24, 2022

@idrennanvmware you haven't missed anything; we aren't going to drop HCL1 support until there is a usable path to HCL2 which as you note, doesn't exist yet.

But there are bugs in the hand-written HCL1 parser, and this particular one is right here, where we make an outdated assumption that if sidecar_service is absent, there is no need to parse sidecar_task. I'll get a fix ready for this which we can backport.

@shoenig shoenig assigned shoenig and unassigned jrasell Jan 25, 2022
@shoenig shoenig added this to the 1.3.0 milestone Jan 25, 2022
@shoenig shoenig added stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/needs-backporting and removed stage/needs-investigation labels Jan 25, 2022
shoenig added a commit that referenced this issue Jan 25, 2022
The HCL1 parser did not respect connect.sidecar_task.resources if the
connect.sidecar_service block was not set (an optimiztion that no longer
makes sense with connect gateways).

Fixes #10899
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul/connect Consul Connect integration type/bug
Projects
Development

Successfully merging a pull request may close this issue.

6 participants