Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connect: Add local_request_timeout_ms to allow configuring the Envoy request timeout on local_app #9554

Merged
merged 1 commit into from
Jan 25, 2021

Conversation

chrisboulton
Copy link
Contributor

@chrisboulton chrisboulton commented Jan 12, 2021

Per #6382, there are two HTTP request timeouts that are configurable for meshed traffic: the egress Envoy, and the ingress Envoy on local_app. The request timeout in Envoy defaults to 15s. Today, using a service-router configuration entry, you can change the Egress envoy timeout, but it's not possible without using an escape hatch override to tweak the timeout for local_app. This change makes that possible in the same way that you can change local_connect_timeout_ms: via the proxy configuration or a proxy default.

The exact timeout we're interested in tweaking in Envoy is the route action timeout: https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/route/v3/route_components.proto#envoy-v3-api-field-config-route-v3-routeaction-timeout

By default, today's behaviour remains the same: the Envoy route timeout is not configured, which means Consul Connect/Envoy will use the default in Envoy: 15s. I could go either way on this: always explicitly configuring the timeout (same way as local_connect_timeout_ms works, or just leaving out of the configuration unless the user explicitly configures it.

If you explicitly set local_request_timeout_ms = 0, it will disable the timeout in local_app completely (to allow for streaming connections with an infinite lifetime). Other values will set the appropriate timeout.

The request timeout only applies to HTTP based listeners.

Configured via proxy-defaults:

Kind      = "proxy-defaults"
Name      = "global"
Namespace = "default"
Config {
  local_request_timeout_ms = 1000
}

Configurable via proxy config on a service: (snip from Nomad job)

connect {
  sidecar_service {
    proxy {
      local_service_port  = 8080
      config {
        local_request_timeout_ms = 0
      }
    }
  }
}

Resulting route configuration:

{
 "route_config": {
  "@type": "type.googleapis.com/envoy.api.v2.RouteConfiguration",
  "name": "public_listener",
  "virtual_hosts": [
   {
    "name": "public_listener",
    "domains": [
     "*"
    ],
    "routes": [
     {
      "match": {
       "prefix": "/"
      },
      "route": {
       "cluster": "local_app",
       "timeout": "0s"
      }
     }
    ]
   }
  ]
 },
 "last_updated": "2021-01-12T16:42:47.038Z"
}

Most of what I'm doing with Connect code in Consul at the moment is based on exploratory code spelunking - I'd welcome suggestions if there's a better to this.

Alternatives and Other Things

In my comment on #6382 (comment), I discussed the other request timeout, and its configurability only existing with the explicit creation of a service-router - I think this is a bit of a usability issue, it'd be great with some kind of default to be able to globally change this timeout in one hit. There's also some commentary I left regarding enabling the respect_expected_rq_timeout flag, which allows timeout headers to be propagated between Envoys.

@hashicorp-cla
Copy link

hashicorp-cla commented Jan 12, 2021

CLA assistant check
All committers have signed the CLA.

@chrisboulton
Copy link
Contributor Author

@rboyer - apologies for the direct mention, but I was curious if you or the team had any feedback on this or the callouts in #6382, where I did notice you tagged/triaged the issue.

@rboyer
Copy link
Member

rboyer commented Jan 22, 2021

The specific changes in this PR make sense and you even managed to do the extra bits of adding tests and documentation updates. 🎉

Could you do one last thing on this branch before I merge it? Please create a .changelog/9554.txt file like this with the contents:

```release-note:feature
connect: Add local_request_timeout_ms to allow configuring the Envoy request timeout on local_app
```

Then our automated changelog generation code will make sure it's added to the right places during release time.

@vercel
Copy link

vercel bot commented Jan 23, 2021

@chrisboulton is attempting to deploy a commit to the HashiCorp Team on Vercel.

A member of the Team first needs to authorize it.

@vercel
Copy link

vercel bot commented Jan 23, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployments, click below or on the icon next to each commit.

consul-ui-staging – ./ui

🔍 Inspect: https://vercel.com/hashicorp/consul-ui-staging/etnjiqkb5
✅ Preview: Canceled

[Deployment for 68aee94 canceled]

@vercel vercel bot temporarily deployed to Preview – consul-ui-staging January 23, 2021 17:18 Inactive
@chrisboulton
Copy link
Contributor Author

chrisboulton commented Jan 23, 2021

@rboyer Yup, have just pushed a change log entry up, and also rebased. Super appreciate you taking a look at this!

Copy link
Member

@rboyer rboyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rboyer rboyer merged commit 8a35df8 into hashicorp:master Jan 25, 2021
@hashicorp-ci
Copy link
Contributor

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/316661.

@fredwangwang
Copy link
Contributor

Would this change be added to 1.9.x?

@blake
Copy link
Member

blake commented Feb 20, 2021

Hi @fredwangwang since this is a new feature and not a bug/stability/security fix, it won't be backported to 1.9.x. It will be available in the Consul 1.10.

@arushi315
Copy link

arushi315 commented May 5, 2021

Hi @chrisboulton
Thank you so much for contributing to this issue. For my service that has long polling connections (~2mins), I am exploring to increase the timeout.
I upgraded to consul 1.10 beta and this is what I am trying: (snippet from nomad job for my service)

 connect {
        sidecar_service {
          proxy {
            config {
              local_request_timeout_ms = 0
            }
          }
        }
      }

Tried with specific timeout as well - local_request_timeout_ms = 360000
No luck with this. I still get 504 Gateway Timeout - "upstream request timeout" after 15seconds.
I tried updating the timeout in route config as well for ingress job but that didn't help.

 "route": {
          "prefix_rewrite": "/ready",
           "cluster": "self_admin", 
           "timeout": "0s"
   }

I am new to this so still trying to understand how the configuration works. Would really appreciate any insight/suggestion on this.
Thank you.

@chrisboulton
Copy link
Contributor Author

Hey @arushi315 -- I haven't pulled down the Consul 1.0 beta to mess with this myself yet, but keep in mind there are two places that can be timing out. Based on my understanding of your setup, these would be:

  • The Envoy/Consul Connect on your ingress gateway that talks to the service (we'll call this the egress Envoy)
  • The Envoy/Consul Connect that connects locally to your service (we'll call this the ingress or Envoy)

This PR changes the local_app of the Envoy ingress cluster, only.

I think you've found that you also need to change the timeout on the egress Envoy - the only way (per my comment here: #6382 (comment) - "Upstream Request Timeouts") to do that today is with a service-router entry` -- or my branch/change that hardcodes all those values to zero.

Looking at what you pasted, I also think you pasted the wrong block from the ingress gateway -- that one is specifically a route for handling the self_admin Envoy cluster, not the actual traffic entering the ingress gateway for your service. You'll want to dump the Envoy config, and look in the dynamic_route_configs section for a type.googleapis.com/envoy.api.v2.RouteConfiguration entry which includes an entry like so to see if you have a timeout configured there, for example:

      "virtual_hosts": [
       {
        "name": "xxx",
        "domains": [
         "*",
         "*:14145"
        ],
        "routes": [
         {
          "match": {
           "prefix": "/"
          },
          "route": {
           "cluster": "xxx.internal.7e225fe1-9399-9945-8548-77c97d033ff2.consul",
           "timeout": "0s"
          }

@chrisboulton chrisboulton deleted the request-timeout branch May 5, 2021 23:12
@arushi315
Copy link

Thank you @chrisboulton for the quick response and explaining this.
Let me review the envoy configuration you suggested and also look into setting up service-router.

@arushi315
Copy link

Hi @chrisboulton,
Just to clarify I can have "local_request_timeout_ms" either in proxy default or proxy config for my service, right?
Configured via proxy-defaults:

Kind      = "proxy-defaults"
Name      = "global"
Namespace = "default"
Config {
  local_request_timeout_ms = 1000
}

OR

Configurable via proxy config on a service: (snip from Nomad job)

connect {
  sidecar_service {
    proxy {
      config {
        local_request_timeout_ms = 0
      }
    }
  }
}

As of now I have this configuration:
proxy config on a service: (snip from Nomad job):

connect {
       sidecar_service {
         proxy {
           config {
             local_request_timeout_ms = 0
           }
         }
       }
     }

and service-router:

{
            "Kind": "service-router",
            "Name": "myservice",
            "Routes": [
                {
                    "Destination": {
                        "Service": "",
                        "RetryOnConnectFailure": true,
                        "RequestTimeout": "0s"
                    }
                }
            ]
        }

Checking the envoy configuration, I do not see "timeout"

"dynamic_route_configs": [
    {
     "version_info": "00000006",
     "route_config": {
      "@type": "type.googleapis.com/envoy.config.route.v3.RouteConfiguration",
      "name": "8080",
      "virtual_hosts": [
       {
        "name": "ingress-http-backend",
        "domains": [
         "*",
         "*:8080"
        ],
        "routes": [
         {
          "match": {
           "prefix": "/myservice"
          },
          "route": {
           "cluster": "myservice.default.dev-setup.internal.0d84ced6-a530-9c50-5c7d-28f07b1018a8.consul",
           "hash_policy": [
            {
             "terminal": true,
             "query_parameter": {
              "name": "myservicesessionid"
             }
            },
            {
             "header": {
              "header_name": "myservicesessionid"
             },
             "terminal": true
            }
           ]
          }
         },
         {
          "match": {
           "prefix": "/"
          },
          "route": {
           "cluster": "ingress-http-backend.default.dev-setup.internal.0d84ced6-a530-9c50-5c7d-28f07b1018a8.consul"
          }
         }
        ]
       }
      ]

Anything you see that I am missing here?

@arushi315
Copy link

@chrisboulton I finally got it working. My teammate corrected my configuration :)

"dynamic_route_configs": [
    {
     "version_info": "00000006",
     "route_config": {
      "@type": "type.googleapis.com/envoy.config.route.v3.RouteConfiguration",
      "name": "8080",
      "virtual_hosts": [
       {
        "name": "ingress-http-backend",
        "domains": [
         "*",
         "*:8080"
        ],
        "routes": [
         {
          "match": {
           "prefix": "/myservice"
          },
          "route": {
           "cluster": "myservice.default.dev-setup.internal.0d84ced6-a530-9c50-5c7d-28f07b1018a8.consul",
           "timeout": "999s",
           "hash_policy": [
            {
             "terminal": true,
             "query_parameter": {
              "name": "myservicesessionid"
             }
            },
            {
             "header": {
              "header_name": "myservicesessionid"
             },
             "terminal": true
            }
           ]
          }
         },
         {
          "match": {
           "prefix": "/"
          },
          "route": {
           "cluster": "ingress-http-backend.default.dev-setup.internal.0d84ced6-a530-9c50-5c7d-28f07b1018a8.consul"
          }
         }
        ]
       }
      ]

Thank you again @chrisboulton for helping.

@kds-rune
Copy link

kds-rune commented Jun 16, 2021

Using:

  • Nomad 1.1.1
  • Consul 1.10.0-beta4

I'm sorry; itis a bit confusing for me how this should work. What did you change in your example when you say you got it working, @arushi315 ?

I have: client -> traefik proxy -> consul ingress gateway -> api-service (soap)

When I made a long-running request, it timed out after 15s ("upstream request timeout")
I then upgraded to consul 1.10.0-beta4 (from 1.9.6), and added the configuration(s) shown below.

Consul

  • Setting "RequestTimeout" to 0 or "0s" in "service-router" still caused "upstream request timeout" after 15s
  • Setting "RequestTimeout" to higher value in "service-router" fixed the timeout. I now i get "java socket timeout exception" <-> "broken pipe" instead, after 60s. Down the rabbithole; another envoy timeout setting?
  • Adding "proxy-defaults" (below) didn't seem to make any difference for me. Tried with value of 0 & value of 3600000. The "RequestTimeout" on "service-router" seemed to control the timeout. I also removed the "local_request_timeout_ms" from the service settings on the nomad job, and it didn't seem to make any difference with/without "proxy-defaults".

Proxy defaults

{
    "Kind": "proxy-defaults",
    "Name": "global",
    "Config": {
        "local_request_timeout_ms": 600000
    }
}

Backend service-router

{
    "Kind": "service-router",
    "Name": "my-service",
    "Routes": [
        {
            "Match": {
                "HTTP": {}
            },
            "Destination": {
                "RequestTimeout": "10m0s"
            }
        }
    ],
 }

@arushi315
Copy link

Hi @kds-rune

Using Consul 1.10.0 beta version and setting the "RequestTimeout" did the trick for me and I didn't need to add anything in the proxy-defaults.
In my scenario I am not using traefik proxy so not aware of any timeout limitation there and there is no timeout set on the backend service itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants