Implement prepared query upstreams watching for envoy #5224

mkeeler · 2019-01-15T17:58:51Z

This implements non-blocking request polling at the cache layer which is currently only used for prepared queries.

I have everything setup to poll every ~30s (there is a little bit of jitter added). Is 30s reasonable? Still thinking on that but regardless the general code should be ready.

In addition to the new unit tests I also tested going through the connect + envoy guide with a few modifications.

I used hashicorp/http-echo instead of a regular tcp-echo server and a container with curl instead of netcat.
I spawned 3 instances of the http-echo servers (echo, echo2, and echo3) each with a different text output (so I could differentiate which node was being used as the upstream)
I used "echo-query" with a destination_type = "prepared_query" in my client sidecar upstream definition.

When creating the query I had it using the "echo" service. I curled the client proxy local port and got back the first echo servers text response. I then updated the query to use the "echo2" service. After about 15s of curl I started getting back the echo2 servers text response. I flipped it back to echo1 and again after some time I started getting responses back from the first echo server.

In general things seem to just work.

agent/cache/watch.go

Co-Authored-By: mkeeler <mkeeler@users.noreply.github.com>

agent/cache/watch.go

agent/cache/watch_test.go

banks

I think in general this looks awesome and way simpler than I was fearing - good spot that this could be abstracted in Notify.

The only actual change request I have is in code you didn't write (and I did!). The backoff on error strategy doesn't add any jitter at all. While different clients might end up at different failure numbers during a server outage, it's likely that as the outage length increases (or especially if servers are just slow so RPCs queue and end up being grouped together) clients will end up on about the same period so when servers come back up there will be waves of close-together requests as clients in each failure band retry.

I was going to comment that I thought jitter should be big to avoid this but actually the jitter on the happy path is less of a concern since it's harder to get lots of agents starting their notify at exactly the same time than ones that implicitly coordinate due to server slowness or outage.

If we fix error backoff with a bunch of jitter I think this is good for now!

The only other point was that it took a lot of reasoning to figure out the polWait calculation as mentioned inline and I'm bound to forget that reasoning by tomorrow so a comment that hints at the rationale might be useful!

agent/cache/watch.go

agent/proxycfg/state.go

…re/4969

mkeeler · 2019-01-16T17:20:33Z

@rboyer and @banks I think the wait time calculations should be much clearer now (and better documented).

banks

Changes all LGTM!

What do you think about changing backOffWait to include lots of lovely jitter @mkeeler?

It's possibly scope creep but I think it's important here and arguably more important than with existing blocking clients since blocking tends to be cheaper than a full prepared query execution which also demands an immediate response from the server. I'd be wary of landing this PR without that as it would make server failure recovery potentially much worse once these queries are being used widely.

mkeeler · 2019-01-17T16:44:04Z

@banks I did add some jitter into backOffWait. One thing I am debating is whether the case where failures <= CacheRefreshBackoffMin should have any jitter. In the original implementation Until 3 consecutive failures the wait time would be 0s. Is that still okay in the polling scenario? Right now the code might do 3 query executions in succession with no waiting until we start backing off. Then again in the error case even blocking queries could return immediately initiate another retry.

Its probably fine as it is unless we wanted to introduce a minimum error wait time.

banks

Hmm yeah maybe a min of a few hundred ms at least is a good plan in general! I'll unblock for now though.

agent/cache/cache.go

mkeeler · 2019-01-17T17:55:13Z

I tried adding a min wait time and tons of tests start failing. They assume that the min wait is 0.

Note that the tests override this minimum time to zero it out so we don’t add tons of time to tests for no reason.

… just override the value

mkeeler · 2019-01-17T20:57:47Z

So I thought that making the min wait overridable in tests would work but the cache latency then breaks other tests not in that package and unable to override the value at runtime. Therefore I am going to hold off on changing that as it is the current behavior anyways.

One other approach that would work would be to have two new files to declare the constants with one file for the test config and another for a non-test config. That would be extremely ugly though.

banks · 2019-01-18T12:12:50Z

I agree, happy to land it as-is, worst case is not any worse than it was and quickly dissipates.

mkeeler added 4 commits January 15, 2019 11:12

Add Notify support for non-blocking queries

8a14f74

Enable proxycfg watches on prepared queries

1b35517

Get rid of dated comment

b8b5d15

Get rid of debug printf

23b96d0

mkeeler requested a review from a team January 15, 2019 17:59

mkeeler added this to the 1.4.1 milestone Jan 15, 2019