http: delaying attach pending requests #2871

lizan · 2018-03-22T09:35:23Z

Signed-off-by: Lizan Zhou zlizan@google.com

Description:
Another approach to #2715, attach pending request in next event after onResponseComplete.

Risk Level: Low
Testing: unit test, integration test
Docs Changes: N/A
Release Notes: N/A

Signed-off-by: Lizan Zhou <zlizan@google.com>

mattklein123

Thanks a ton for iterating on this into a solution that everyone is happy with. Overall looks good, just a few comments. Thank you.

mattklein123 · 2018-03-22T15:37:46Z

source/common/http/http1/conn_pool.cc

+
+  if (!pending_requests_.empty() && !upstream_ready_posted_) {
+    upstream_ready_posted_ = true;
+    dispatcher_.post([this]() { onUpstreamReady(); });


Instead of using post(), I would allocate a new "timer" and just enable it with a 0ms timeout here. The reason for this is that when the conn pool is destroyed (as might happen in certain cases), the wakeup is disabled. This pattern is used in a bunch of places for similar reasons.

mattklein123 · 2018-03-22T15:39:40Z

source/common/http/http1/conn_pool.cc

-  } else {
+void ConnPoolImpl::onUpstreamReady() {
+  upstream_ready_posted_ = false;
+  while (!pending_requests_.empty() && !ready_clients_.empty()) {


I believe that because of https://github.com/lizan/envoy/blob/11426feab2df2feb5b0cceb0e2b9876840a65c5a/source/common/http/http1/conn_pool.cc#L163, there is no chance of starvation here. I.e., if you lose a ready client we will create new clients if we need them. Please make sure to add a test case for this exact case (if you didn't already).

Do you mean pending_requests_ should be always smaller than ready_clients_? I think the line only make sure the the sum of ready_clients_ and busy_clients_ is larger than pending_requests_ so we should check both of them here, no?

I mean that when connections die, we should make a new connection if the number of existing ready + busy connections is not sufficient. I think the code already does that. I just want to make sure we have test coverage of this case. Does the integration test you have cover it? (Basically I think the code you have is correct. I just want to make sure we have good coverage).

OK, the new integration test I added is testing that behavior (and what would fail before this PR because of #2715).

Signed-off-by: Lizan Zhou <zlizan@google.com>

mattklein123

LGTM, thanks. One test comment.

mattklein123 · 2018-03-23T18:40:28Z

test/common/http/http1/conn_pool_test.cc

+        mock_dispatcher_(dispatcher), mock_upstream_ready_timer_(upstream_ready_timer) {
+    ON_CALL(*mock_upstream_ready_timer_, enableTimer(std::chrono::milliseconds(0)))
+        .WillByDefault(Invoke(
+            [&](const std::chrono::milliseconds&) { mock_upstream_ready_timer_->callback_(); }));


The fact that this gets invoked inline is not an accurate view of how the real code will work. I would not add this ON_CALL, and explicitly fire the ready even when needed outside of the immediate call stack. I realize this is tedious, but given how important it is that this code is correct and how we would like it to track the real code as much as possible, I think it's worth it.

zuercher · 2018-03-23T22:45:29Z

Going to unassign myself as I'll be unavailable next week. @alyssawilk can assign another non-senior reviewwe on Monday as necessary.

alyssawilk · 2018-03-26T12:59:47Z

@mattklein123 is assigned and he can cover both senior and cross-company. Matt, I think you're most familiar with this code anyway, but let me know if you want a second pair of eyes on this one and I can take a look as well!

Signed-off-by: Lizan Zhou <zlizan@google.com>

mattklein123 · 2018-03-28T04:20:27Z

@lizan can you check tests?

Signed-off-by: Lizan Zhou <zlizan@google.com>

lizan · 2018-03-28T05:50:36Z

Yep, I'm trying to reproduce them locally.

lizan · 2018-03-28T20:51:37Z

Hmm, I cannot reproduce them outside CircleCI, (on my macOS, GCE with docker or without docker)... Will continue to investigate, does anyone have idea?

mattklein123 · 2018-03-28T22:46:00Z

@lizan I'm guessing the test failures are spurious. Can you just push a dummy commit?

lizan · 2018-03-28T22:54:02Z

It seems consistent in CircleCI, I've pushed many merge commits which triggers CI run.

mattklein123 · 2018-03-28T23:35:39Z

@lizan there must be some timing issue here. I haven't looked but given that ASAN/TSAN failed, maybe that's a clue? Or try running in a loop on your machine?

lizan · 2018-03-28T23:41:06Z

Yeah that's my suspect too. I had runs_per_test=100 with no luck, will try more...

lizan · 2018-03-28T23:42:46Z

The latest ASAN/TSAN failure is JVM OOO java.io.IOException: Cannot allocate memory, so for sure they are irrelevant (CircleCI failure)

lizan · 2018-03-29T04:08:33Z

oops, enable CircleCI in my fork resulted the CI result from there (with less resource, hence OOM) being reported here... Will try to add some dummy commit to debug the CI, feel free to ignore the notifications.

mattklein123 · 2018-04-10T23:24:09Z

@lizan friendly ping. What's the status of this one?

lizan · 2018-04-11T01:35:03Z

I'm still trying to fix the test but had no luck yet. It never reproduced on my local machine and only once on Istio circle CI, it seems only fails when 1st time running in a machine or something weird. I'll get back to this later this week.

Signed-off-by: Lizan Zhou <zlizan@google.com>

lizan · 2018-04-21T05:32:32Z

@mattklein123 I changed the logic a bit (not using timer for onConnected event) and seems it fixed the tests and it also perform better. PTAL again. Thanks!

mattklein123

LGTM, thanks for the follow up here. Would like @alyssawilk to reconfirm that this works for her.

mattklein123 · 2018-04-22T15:50:34Z

source/common/http/http1/conn_pool.h

@@ -134,6 +134,9 @@ class ConnPoolImpl : Logger::Loggable<Logger::Id::pool>, public ConnectionPool::
  std::list<DrainedCb> drained_callbacks_;
  Upstream::ResourcePriority priority_;
  const Network::ConnectionSocket::OptionsSharedPtr socket_options_;
+


nit: del newline

alyssawilk

I definitely like this approach, thanks for getting this working @lizan. A few nits, one concern about the test, and I would love a test for "stray data" killing off the connection (which I think may remove a latent race anyway)

alyssawilk · 2018-04-23T15:02:35Z

source/common/http/http1/conn_pool.cc

  client.stream_wrapper_.reset();
-  if (pending_requests_.empty()) {
+  if (pending_requests_.empty() || delay) {
    // There is nothing to service so just move the connection into the ready list.


Update comment please

alyssawilk · 2018-04-23T15:06:01Z

source/common/http/http1/conn_pool.cc

@@ -209,25 +215,44 @@ void ConnPoolImpl::onResponseComplete(ActiveClient& client) {
    host_->cluster().stats().upstream_cx_max_requests_.inc();
    onDownstreamReset(client);
  } else {
-    processIdleClient(client);
+    processIdleClient(client, true);


Can we comment why, perhaps with a link to the original issue to make it clear why we are doing this?

alyssawilk · 2018-04-23T15:11:10Z

source/common/http/http1/conn_pool.cc

+
+void ConnPoolImpl::onUpstreamReady() {
+  upstream_ready_enabled_ = false;
+  while (!pending_requests_.empty() && !ready_clients_.empty()) {


sanity check: this is simply deferring work we used to do in the last dispatcher loop to the next loop, so there's no corner case we'll end up batching together too much work and triggering the watchdog, right?

The chance of batching together is when multiple upstream connections are completing responses at same time (same epoll callback), I don't think there will will be too much work to trigger watchdog.

alyssawilk · 2018-04-23T15:15:58Z

test/integration/http_integration.cc

+  // Response 1.
+  upstream_request_->encodeHeaders(Http::TestHeaderMapImpl{{":status", "200"}}, false);
+  upstream_request_->encodeData(512, true);
+  fake_upstream_connection_->close();


Sorry, I may not have had sufficient caffeine, but how are we making sure that the test Envoy is receiving the FIN before assigning the next request? If there were a context switch between encodeData and the close() couldn't Envoy read the whole request and reassign the upstream before the close() occurs? We may need to unit test this rather than integration test. Alternately we could do a raw tcp connection, send the response and stray data in one write call. I think that'd guarantee no race and I'd hope any activity on the delayed-use connection would cause it to be removed.

The dispatcher is not running until next wait* call, so in the right next line waitForEndSream the dispatcher runs, the connection read 512 bytes and FIN in same event. So the request is not being scheduled before close() occurs.

Also added a unit test in conn_pool_test too.

Yeah, I'd like to think this will be fine, we've just had a spate of macos integration flakes due to slightly different network semantics. We can just keep an eye out for test flakes and remove this if it causes problems now that we have a unit test. @zuercher just so it's on his radar.

Signed-off-by: Lizan Zhou <zlizan@google.com>

alyssawilk · 2018-04-24T13:01:12Z

source/common/http/http1/conn_pool.cc

@@ -236,7 +239,8 @@ void ConnPoolImpl::onUpstreamReady() {
 void ConnPoolImpl::processIdleClient(ActiveClient& client, bool delay) {
  client.stream_wrapper_.reset();
  if (pending_requests_.empty() || delay) {
-    // There is nothing to service so just move the connection into the ready list.
+    // There is nothing to service or delay processing is requested, so just move the connection


delay -> delayed

alyssawilk · 2018-04-24T13:03:36Z

test/integration/http_integration.cc

+  // Response 1.
+  upstream_request_->encodeHeaders(Http::TestHeaderMapImpl{{":status", "200"}}, false);
+  upstream_request_->encodeData(512, true);
+  fake_upstream_connection_->close();


Yeah, I'd like to think this will be fine, we've just had a spate of macos integration flakes due to slightly different network semantics. We can just keep an eye out for test flakes and remove this if it causes problems now that we have a unit test. @zuercher just so it's on his radar.

alyssawilk

Looks great! Let's fix that one last typo and I'll merge it in.

Signed-off-by: Lizan Zhou <zlizan@google.com>

Attach pending upstream requests in next event after onResponseComplete. Risk Level: Medium Testing: unit test, integration test Docs Changes: N/A Release Notes: N/A Fixes envoyproxy#2715 Signed-off-by: Lizan Zhou <zlizan@google.com> Signed-off-by: Rama <rama.rao@salesforce.com>

lambdai · 2019-06-06T06:37:16Z

source/common/http/http1/conn_pool.cc

-    // There is nothing to service so just move the connection into the ready list.
+  if (pending_requests_.empty() || delay) {
+    // There is nothing to service or delayed processing is requested, so just move the connection
+    // into the ready list.
    ENVOY_CONN_LOG(debug, "moving to ready", *client.codec_client_);
    client.moveBetweenLists(busy_clients_, ready_clients_);


@lizan Too early to mark the client ready in this branch. Expecting anther read attempt to see if there is EOF. This client could be immediately used by another down stream request.
At the next cycle, dispatcher always handle write attempt first, even there is EOF in the read buffer, that is a 503 for this upstream request.

At the end of the upstream read event when we reach EOF the client will be removed so this is safe. https://github.com/envoyproxy/envoy/blob/v1.10.0/source/common/network/connection_impl.cc#L486

http: delaying attach pending requests

11426fe

Signed-off-by: Lizan Zhou <zlizan@google.com>

mattklein123 reviewed Mar 22, 2018

View reviewed changes

zuercher assigned mattklein123 and zuercher Mar 22, 2018

lizan added 3 commits March 23, 2018 01:17

Use dedicated timer for conn pool

59ba695

Signed-off-by: Lizan Zhou <zlizan@google.com>

Merge remote-tracking branch 'upstream/master' into http1_conn_pool_fix

c13dc08

fix tsan

8bc58a4

Signed-off-by: Lizan Zhou <zlizan@google.com>

mattklein123 reviewed Mar 23, 2018

View reviewed changes

zuercher removed their assignment Mar 23, 2018

lizan added 2 commits March 27, 2018 17:40

explicity test timer

ea7e7d0

Signed-off-by: Lizan Zhou <zlizan@google.com>

Merge remote-tracking branch 'upstream/master' into http1_conn_pool_fix

a78da8c

fix format

0d931ce

Signed-off-by: Lizan Zhou <zlizan@google.com>

lizan force-pushed the http1_conn_pool_fix branch 3 times, most recently from ed84a4e to 76d7827 Compare March 29, 2018 04:07

lizan added 2 commits April 20, 2018 20:40

Merge remote-tracking branch 'upstream/master' into h1fix

835d3b1

Only delay onResponseComplete

49e831e

Signed-off-by: Lizan Zhou <zlizan@google.com>

lizan force-pushed the http1_conn_pool_fix branch from cdfa577 to 49e831e Compare April 21, 2018 00:54

mattklein123 reviewed Apr 22, 2018

View reviewed changes

alyssawilk self-assigned this Apr 23, 2018

alyssawilk reviewed Apr 23, 2018

View reviewed changes

lizan added 5 commits April 23, 2018 15:52

address comments, add unit test

c9a14d8

Signed-off-by: Lizan Zhou <zlizan@google.com>

Merge remote-tracking branch 'upstream/master' into http1_conn_pool_fix

5fa5305

fix format

f531874

Signed-off-by: Lizan Zhou <zlizan@google.com>

fix

0a8afec

Signed-off-by: Lizan Zhou <zlizan@google.com>

Merge remote-tracking branch 'upstream/master' into http1_conn_pool_fix

0b8f237

alyssawilk reviewed Apr 24, 2018

View reviewed changes

alyssawilk previously approved these changes Apr 24, 2018

View reviewed changes

lizan added 2 commits April 24, 2018 08:31

fix typo

64e9f88

Signed-off-by: Lizan Zhou <zlizan@google.com>

Merge remote-tracking branch 'upstream/master' into http1_conn_pool_fix

f8360d4

lizan dismissed alyssawilk’s stale review via f8360d4 April 24, 2018 15:32

alyssawilk approved these changes Apr 24, 2018

View reviewed changes

alyssawilk merged commit ee6f148 into envoyproxy:master Apr 24, 2018

alyssawilk mentioned this pull request May 10, 2018

Crash in ConnectionImpl::onMessageBeginBase #3337

Closed

alyssawilk mentioned this pull request Dec 3, 2018

test: fix some flakes that might relate to #5104. #5172

Closed

alyssawilk mentioned this pull request Mar 6, 2019

Missed upstream disconnect leading to 503 UC #6190

Closed

lizan deleted the http1_conn_pool_fix branch March 8, 2019 00:04

alyssawilk mentioned this pull request May 6, 2019

Envoy (re)uses connection after receiving FIN from upstream #6815

Closed

lizan mentioned this pull request Jun 6, 2019

Delay connection reuse for a poll cycle to catch closed connections. #7159

Closed

lambdai reviewed Jun 6, 2019

View reviewed changes

oulinbao mentioned this pull request Oct 17, 2019

Bug: 503 UC sometimes, upstream_reset_before_response_started #8639

Closed

http: delaying attach pending requests #2871

http: delaying attach pending requests #2871

Conversation

lizan commented Mar 22, 2018

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattklein123 Mar 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattklein123 left a comment

Choose a reason for hiding this comment

mattklein123 Mar 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zuercher commented Mar 23, 2018

alyssawilk commented Mar 26, 2018

mattklein123 commented Mar 28, 2018

lizan commented Mar 28, 2018

lizan commented Mar 28, 2018

mattklein123 commented Mar 28, 2018

lizan commented Mar 28, 2018

mattklein123 commented Mar 28, 2018

lizan commented Mar 28, 2018

lizan commented Mar 28, 2018

lizan commented Mar 29, 2018

mattklein123 commented Apr 10, 2018

lizan commented Apr 11, 2018

lizan commented Apr 21, 2018

mattklein123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alyssawilk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alyssawilk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mattklein123 Mar 22, 2018 •

edited

Loading

mattklein123 Mar 23, 2018 •

edited

Loading