-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement prepared query upstreams watching for envoy #5224
Conversation
Co-Authored-By: mkeeler <mkeeler@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in general this looks awesome and way simpler than I was fearing - good spot that this could be abstracted in Notify.
The only actual change request I have is in code you didn't write (and I did!). The backoff on error strategy doesn't add any jitter at all. While different clients might end up at different failure numbers during a server outage, it's likely that as the outage length increases (or especially if servers are just slow so RPCs queue and end up being grouped together) clients will end up on about the same period so when servers come back up there will be waves of close-together requests as clients in each failure band retry.
I was going to comment that I thought jitter should be big to avoid this but actually the jitter on the happy path is less of a concern since it's harder to get lots of agents starting their notify at exactly the same time than ones that implicitly coordinate due to server slowness or outage.
If we fix error backoff with a bunch of jitter I think this is good for now!
The only other point was that it took a lot of reasoning to figure out the polWait calculation as mentioned inline and I'm bound to forget that reasoning by tomorrow so a comment that hints at the rationale might be useful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes all LGTM!
What do you think about changing backOffWait
to include lots of lovely jitter @mkeeler?
It's possibly scope creep but I think it's important here and arguably more important than with existing blocking clients since blocking tends to be cheaper than a full prepared query execution which also demands an immediate response from the server. I'd be wary of landing this PR without that as it would make server failure recovery potentially much worse once these queries are being used widely.
@banks I did add some jitter into Its probably fine as it is unless we wanted to introduce a minimum error wait time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah maybe a min of a few hundred ms at least is a good plan in general! I'll unblock for now though.
I tried adding a min wait time and tons of tests start failing. They assume that the min wait is 0. |
Note that the tests override this minimum time to zero it out so we don’t add tons of time to tests for no reason.
… just override the value
So I thought that making the min wait overridable in tests would work but the cache latency then breaks other tests not in that package and unable to override the value at runtime. Therefore I am going to hold off on changing that as it is the current behavior anyways. One other approach that would work would be to have two new files to declare the constants with one file for the test config and another for a non-test config. That would be extremely ugly though. |
I agree, happy to land it as-is, worst case is not any worse than it was and quickly dissipates. |
Fixes #4969
This implements non-blocking request polling at the cache layer which is currently only used for prepared queries.
I have everything setup to poll every ~30s (there is a little bit of jitter added). Is 30s reasonable? Still thinking on that but regardless the general code should be ready.
In addition to the new unit tests I also tested going through the connect + envoy guide with a few modifications.
destination_type = "prepared_query"
in my client sidecar upstream definition.When creating the query I had it using the "echo" service. I curled the client proxy local port and got back the first echo servers text response. I then updated the query to use the "echo2" service. After about 15s of curl I started getting back the echo2 servers text response. I flipped it back to echo1 and again after some time I started getting responses back from the first echo server.
In general things seem to just work.