Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement prepared query upstreams watching for envoy #5224

Merged
merged 11 commits into from
Jan 18, 2019
Merged

Conversation

mkeeler
Copy link
Member

@mkeeler mkeeler commented Jan 15, 2019

Fixes #4969

This implements non-blocking request polling at the cache layer which is currently only used for prepared queries.

I have everything setup to poll every ~30s (there is a little bit of jitter added). Is 30s reasonable? Still thinking on that but regardless the general code should be ready.

In addition to the new unit tests I also tested going through the connect + envoy guide with a few modifications.

  1. I used hashicorp/http-echo instead of a regular tcp-echo server and a container with curl instead of netcat.
  2. I spawned 3 instances of the http-echo servers (echo, echo2, and echo3) each with a different text output (so I could differentiate which node was being used as the upstream)
  3. I used "echo-query" with a destination_type = "prepared_query" in my client sidecar upstream definition.

When creating the query I had it using the "echo" service. I curled the client proxy local port and got back the first echo servers text response. I then updated the query to use the "echo2" service. After about 15s of curl I started getting back the echo2 servers text response. I flipped it back to echo1 and again after some time I started getting responses back from the first echo server.

In general things seem to just work.

@mkeeler mkeeler requested a review from a team January 15, 2019 17:59
@mkeeler mkeeler added this to the 1.4.1 milestone Jan 15, 2019
agent/cache/watch.go Outdated Show resolved Hide resolved
Co-Authored-By: mkeeler <mkeeler@users.noreply.github.com>
agent/cache/watch.go Outdated Show resolved Hide resolved
agent/cache/watch.go Outdated Show resolved Hide resolved
agent/cache/watch.go Outdated Show resolved Hide resolved
agent/cache/watch_test.go Outdated Show resolved Hide resolved
agent/cache/watch_test.go Outdated Show resolved Hide resolved
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general this looks awesome and way simpler than I was fearing - good spot that this could be abstracted in Notify.

The only actual change request I have is in code you didn't write (and I did!). The backoff on error strategy doesn't add any jitter at all. While different clients might end up at different failure numbers during a server outage, it's likely that as the outage length increases (or especially if servers are just slow so RPCs queue and end up being grouped together) clients will end up on about the same period so when servers come back up there will be waves of close-together requests as clients in each failure band retry.

I was going to comment that I thought jitter should be big to avoid this but actually the jitter on the happy path is less of a concern since it's harder to get lots of agents starting their notify at exactly the same time than ones that implicitly coordinate due to server slowness or outage.

If we fix error backoff with a bunch of jitter I think this is good for now!

The only other point was that it took a lot of reasoning to figure out the polWait calculation as mentioned inline and I'm bound to forget that reasoning by tomorrow so a comment that hints at the rationale might be useful!

agent/cache/watch.go Outdated Show resolved Hide resolved
agent/cache/watch.go Show resolved Hide resolved
agent/cache/watch.go Outdated Show resolved Hide resolved
agent/cache/watch.go Show resolved Hide resolved
agent/cache/watch.go Show resolved Hide resolved
agent/cache/watch.go Show resolved Hide resolved
agent/cache/watch.go Show resolved Hide resolved
agent/proxycfg/state.go Outdated Show resolved Hide resolved
@mkeeler
Copy link
Member Author

mkeeler commented Jan 16, 2019

@rboyer and @banks I think the wait time calculations should be much clearer now (and better documented).

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes all LGTM!

What do you think about changing backOffWait to include lots of lovely jitter @mkeeler?

It's possibly scope creep but I think it's important here and arguably more important than with existing blocking clients since blocking tends to be cheaper than a full prepared query execution which also demands an immediate response from the server. I'd be wary of landing this PR without that as it would make server failure recovery potentially much worse once these queries are being used widely.

@mkeeler
Copy link
Member Author

mkeeler commented Jan 17, 2019

@banks I did add some jitter into backOffWait. One thing I am debating is whether the case where failures <= CacheRefreshBackoffMin should have any jitter. In the original implementation Until 3 consecutive failures the wait time would be 0s. Is that still okay in the polling scenario? Right now the code might do 3 query executions in succession with no waiting until we start backing off. Then again in the error case even blocking queries could return immediately initiate another retry.

Its probably fine as it is unless we wanted to introduce a minimum error wait time.

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah maybe a min of a few hundred ms at least is a good plan in general! I'll unblock for now though.

agent/cache/cache.go Show resolved Hide resolved
@mkeeler
Copy link
Member Author

mkeeler commented Jan 17, 2019

I tried adding a min wait time and tons of tests start failing. They assume that the min wait is 0.

Note that the tests override this minimum time to zero it out so we don’t add tons of time to tests for no reason.
@mkeeler
Copy link
Member Author

mkeeler commented Jan 17, 2019

So I thought that making the min wait overridable in tests would work but the cache latency then breaks other tests not in that package and unable to override the value at runtime. Therefore I am going to hold off on changing that as it is the current behavior anyways.

One other approach that would work would be to have two new files to declare the constants with one file for the test config and another for a non-test config. That would be extremely ugly though.

@banks
Copy link
Member

banks commented Jan 18, 2019

I agree, happy to land it as-is, worst case is not any worse than it was and quickly dissipates.

@mkeeler mkeeler merged commit 7e6b3e6 into master Jan 18, 2019
@mkeeler mkeeler deleted the feature/4969 branch January 23, 2019 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants