set_node_on_first_message_only seems to never set node ID #9682

howardjohn · 2020-01-14T23:31:39Z

Issue Template

Title: set_node_on_first_message_only seems to never set node ID

Description:
I turned on set_node_on_first_message_only, and now things are breaking because it appears the node id is actually never sent. Turning on verbose logging on the XDS server shows the first requests we get looks like:

version_info:"2020-01-14T23:12:41Z/1" type_url:"type.googleapis.com/envoy.api.v2.Cluster" response_nonce:"uj8hmiCQEes=2abe5090-3bef-43b8-be61-899e9e61623a"

Note the missing node id

@kyessenov

The text was updated successfully, but these errors were encountered:

kyessenov · 2020-01-21T20:00:29Z

Can you point me to a binary and confirm this is a CDS request?

howardjohn · 2020-01-21T20:03:02Z

@kyessenov its from this PR on Istio side: istio/istio#19266.

The above is the very first request we see. its possible we are somehow dropping a request, but I was logging right after Recv so it would be a pretty low level issue

kyessenov · 2020-01-21T20:09:05Z

I'm not seeing an issue with envoy on master, let me check istio build.

kyessenov · 2020-01-21T20:19:49Z

Fetched latest envoy build from istio, and tested both ADS and CDS. I seem to get the node ID supplied. Can you help reproducing this?

lambdai · 2020-01-22T21:46:38Z

CC: @lambdai

stale · 2020-02-29T21:55:38Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

stale · 2020-03-07T22:37:12Z

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

kyessenov · 2020-03-09T17:08:49Z

Can this issue be re-opened please?

howardjohn · 2020-03-09T17:19:17Z

It looks like what happens is:

Pod starts up, sends CDS request with node metadata, then all the other XDS without metadata. 👍
I restart the control plane pod. Client envoy logs:

[Envoy (Epoch 0)] [2020-03-09 17:17:05.214][22][debug][connection] [external/envoy/source/common/network/connection_impl.cc:192] [C5] closing socket: 0
[Envoy (Epoch 0)] [2020-03-09 17:17:05.214][22][debug][client] [external/envoy/source/common/http/codec_client.cc:91] [C5] disconnect. resetting 0 pending requests
[Envoy (Epoch 0)] [2020-03-09 17:17:05.214][22][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:265] [C5] client disconnected, failure reason:
[Envoy (Epoch 0)] [2020-03-09 17:17:05.214][22][debug][router] [external/envoy/source/common/router/router.cc:991] [C0][S9517281317529450485] upstream reset: reset reason connection failure
[Envoy (Epoch 0)] [2020-03-09 17:17:05.214][22][debug][http] [external/envoy/source/common/http/async_client_impl.cc:96] async http request response headers (end_stream=true):
':status', '200'
'content-type', 'application/grpc'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection failure'

[Envoy (Epoch 0)] [2020-03-09 17:17:05.214][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure

New control plane comes up, client envoy connects. It sends a CDS request with no node id. Logs from client envoy side:

[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:47] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);

[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][router] [external/envoy/source/common/router/router.cc:475] [C0][S13281822872239397772] cluster 'xds-grpc' match for URL '/envoy.service.discovery.v2.AggregatedDiscoveryService/StreamAggregatedResources'
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][router] [external/envoy/source/common/router/router.cc:615] [C0][S13281822872239397772] router decoding headers:
':method', 'POST'
':path', '/envoy.service.discovery.v2.AggregatedDiscoveryService/StreamAggregatedResources'
':authority', 'xds-grpc'
':scheme', 'https'
'te', 'trailers'
'content-type', 'application/grpc'
'x-envoy-internal', 'true'
'x-forwarded-for', '10.244.0.56'

[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:337] queueing request due to no available connections
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:47] creating a new connection
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][client] [external/envoy/source/common/http/codec_client.cc:34] [C71] connecting
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:703] [C71] connecting to 10.96.184.120:15012
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:712] [C71] connection in progress
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:890] [C71] setting stream-level initial window size to 268435456
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:912] [C71] updating connection-level initial window size to 268435456
[Envoy (Epoch 0)] [2020-03-09 17:07:10.381][23][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:264] [C2] new stream
[Envoy (Epoch 0)] [2020-03-09 17:07:10.381][23][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:733] [C2][S15525249081796867195] request headers complete (end_stream=true):
':authority', '127.0.0.1:15000'
':path', '/stats?usedonly&filter=^(server.state|listener_manager.workers_started)'
':method', 'GET'
'user-agent', 'Go-http-client/1.1'
'accept-encoding', 'gzip'

[Envoy (Epoch 0)] [2020-03-09 17:07:10.381][23][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1285] [C2][S15525249081796867195] request end stream
[Envoy (Epoch 0)] [2020-03-09 17:07:10.381][23][debug][admin] [external/envoy/source/server/http/admin.cc:1350] [C2][S15525249081796867195] request complete: path: /stats?usedonly&filter=^(server.state|listener_manager.workers_started)
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:578] [C71] delayed connection error: 111
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:192] [C71] closing socket: 0
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][client] [external/envoy/source/common/http/codec_client.cc:91] [C71] disconnect. resetting 0 pending requests
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:265] [C71] client disconnected, failure reason:
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][router] [external/envoy/source/common/router/router.cc:991] [C0][S13281822872239397772] upstream reset: reset reason connection failure
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][http] [external/envoy/source/common/http/async_client_impl.cc:96] async http request response headers (end_stream=true):
':status', '200'
'content-type', 'application/grpc'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection failure'

[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][config] [external/envoy/source/common/config/grpc_subscription_impl.cc:88] gRPC update for type.googleapis.com/envoy.api.v2.RouteConfiguration failed
...
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][config] [external/envoy/source/common/config/grpc_subscription_impl.cc:88] gRPC update for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment failed
...
[Envoy (Epoch 0)] [2020-03-09 17:07:11.236][23][debug][config] [external/envoy/source/common/config/grpc_subscription_impl.cc:88] gRPC update for type.googleapis.com/envoy.api.v2.Listener failed
[Envoy (Epoch 0)] [2020-03-09 17:07:11.236][23][debug][config] [external/envoy/source/common/config/grpc_subscription_impl.cc:88] gRPC update for type.googleapis.com/envoy.api.v2.Cluster faile

So basically we just need to make sure that when connecting to a new upstream we are still sending the node meta

kyessenov · 2020-03-09T18:30:12Z

It appears that the flag was disabled in tests by @fredlas during incremental xDS work and not restored back. Something broke it in the mean time. I'll take a look.

fredlas · 2020-03-09T18:41:01Z

I know the one you mean - #8478 restored it, but was later rolled back. This sounds like more reason to get that change fully in.

kyessenov · 2020-03-09T18:43:35Z

@fredlas I don't have a clear picture on the state of things. Should I attempt to restore set_node_on_first_message or wait for that change to be brought back in?

mattklein123 · 2020-03-09T18:44:43Z

#8478 was abandoned and @wgallagher is working on the merge now. He is on vacation this week though. @kyessenov if this is not super urgent I would wait until next week to talk to @wgallagher. If urgent I think it's OK to figure out a fix on top of current master.

kyessenov · 2020-03-09T18:45:50Z

I'm fine waiting. It's unfortunate though since the flag is considered stable.

ahmelsayed · 2020-03-17T17:50:40Z

Is this by any chance related to envoyproxy/go-control-plane#269

Also a side note, I couldn't find any docs describing set_node_on_first_message_only other than this PR (istio/istio#17361) mentioning that it's a perf optimization. Any details as to how it impacts perf?

mattklein123 · 2020-03-18T19:18:17Z

@wgallagher is going to fix this. Thank you!

howardjohn · 2020-04-06T15:10:55Z

Also a side note, I couldn't find any docs describing set_node_on_first_message_only other than this PR (istio/istio#17361) mentioning that it's a perf optimization. Any details as to how it impacts perf?

The node metadata is pretty large, so sending it on every request adds a lot of overhead

howardjohn · 2020-07-15T02:09:20Z

@wgallagher are you still working on this?

…ating envoy bootstrap config When consul is restarted and envoy that had already sent DiscoveryRequests to the previous consul process sends a request to the new process it doesn't respect the setting and never populates DiscoveryRequest.Node for the life of the new consul process due to this bug: envoyproxy/envoy#9682 Fixes #8430

…ating envoy bootstrap config (#8440) When consul is restarted and an envoy that had already sent DiscoveryRequests to the previous consul process sends a request to the new process it doesn't respect the setting and never populates DiscoveryRequest.Node for the life of the new consul process due to this bug: envoyproxy/envoy#9682 Fixes #8430

howardjohn · 2020-09-10T02:30:55Z

Any updates? The node info from requests represents a sizeable amount of our network traffic

tbarrella · 2020-11-18T00:59:43Z

Is anyone still working on this? I think I found the reason and a fix, but the time-consuming part would probably be writing a test and confirming the fix works; don't want to duplicate effort...

Previously, first_stream_request_ wasn't working as intended because api_state_ is still cached from the previous stream Risk Level: Low Testing: Modified test was failing before the accompanying change Docs Changes: N/A Fixes #9682 Signed-off-by: Taylor Barrella <tabarr@google.com>

mattklein123 added area/xds bug help wanted Needs help! labels Jan 15, 2020

mattklein123 added investigate Potential bug that needs verification and removed bug help wanted Needs help! labels Jan 30, 2020

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Feb 29, 2020

stale bot closed this as completed Mar 7, 2020

howardjohn mentioned this issue Mar 9, 2020

Revert "xds: enable skip node ID" istio/istio#22004

Merged

mattklein123 reopened this Mar 9, 2020

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Mar 9, 2020

mattklein123 added the help wanted Needs help! label Mar 9, 2020

mattklein123 assigned kyessenov Mar 9, 2020

mattklein123 added bug and removed investigate Potential bug that needs verification labels Mar 9, 2020

mattklein123 added this to the 1.14.0 milestone Mar 9, 2020

mattklein123 assigned wgallagher Mar 18, 2020

mattklein123 modified the milestones: 1.14.0, 1.15.0 Apr 1, 2020

mattklein123 modified the milestones: 1.15.0, 1.16.0 Jun 24, 2020

rboyer mentioned this issue Aug 5, 2020

Segfault in 1.8.x and 1.7.5 with Consul Connect hashicorp/consul#8430

Closed

rboyer mentioned this issue Aug 5, 2020

xds: revert setting set_node_on_first_message_only to true when generating envoy bootstrap config hashicorp/consul#8440

Merged

mattklein123 mentioned this issue Sep 16, 2020

nodeID is empty when envoy reconnect to go-control-plane #13105

Closed

mattklein123 modified the milestones: 1.16.0, 1.17.0 Oct 7, 2020

dio mentioned this issue Oct 26, 2020

examples: Add dynamic configuration (control plane) sandbox #13746

Merged

tbarrella mentioned this issue Nov 18, 2020

grpc mux: fix sending node again after stream is reset #14080

Merged

htuch closed this as completed in #14080 Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

set_node_on_first_message_only seems to never set node ID #9682

set_node_on_first_message_only seems to never set node ID #9682

howardjohn commented Jan 14, 2020

kyessenov commented Jan 21, 2020

howardjohn commented Jan 21, 2020

kyessenov commented Jan 21, 2020

kyessenov commented Jan 21, 2020

lambdai commented Jan 22, 2020

stale bot commented Feb 29, 2020

stale bot commented Mar 7, 2020

kyessenov commented Mar 9, 2020

howardjohn commented Mar 9, 2020

kyessenov commented Mar 9, 2020

fredlas commented Mar 9, 2020

kyessenov commented Mar 9, 2020

mattklein123 commented Mar 9, 2020

kyessenov commented Mar 9, 2020

ahmelsayed commented Mar 17, 2020

mattklein123 commented Mar 18, 2020

howardjohn commented Apr 6, 2020

howardjohn commented Jul 15, 2020

howardjohn commented Sep 10, 2020

tbarrella commented Nov 18, 2020

set_node_on_first_message_only seems to never set node ID #9682

set_node_on_first_message_only seems to never set node ID #9682

Comments

howardjohn commented Jan 14, 2020

kyessenov commented Jan 21, 2020

howardjohn commented Jan 21, 2020

kyessenov commented Jan 21, 2020

kyessenov commented Jan 21, 2020

lambdai commented Jan 22, 2020

stale bot commented Feb 29, 2020

stale bot commented Mar 7, 2020

kyessenov commented Mar 9, 2020

howardjohn commented Mar 9, 2020

kyessenov commented Mar 9, 2020

fredlas commented Mar 9, 2020

kyessenov commented Mar 9, 2020

mattklein123 commented Mar 9, 2020

kyessenov commented Mar 9, 2020

ahmelsayed commented Mar 17, 2020

mattklein123 commented Mar 18, 2020

howardjohn commented Apr 6, 2020

howardjohn commented Jul 15, 2020

howardjohn commented Sep 10, 2020

tbarrella commented Nov 18, 2020