Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set_node_on_first_message_only seems to never set node ID #9682

Closed
howardjohn opened this issue Jan 14, 2020 · 20 comments · Fixed by #14080
Closed

set_node_on_first_message_only seems to never set node ID #9682

howardjohn opened this issue Jan 14, 2020 · 20 comments · Fixed by #14080
Assignees
Milestone

Comments

@howardjohn
Copy link
Contributor

Issue Template

Title: set_node_on_first_message_only seems to never set node ID

Description:
I turned on set_node_on_first_message_only, and now things are breaking because it appears the node id is actually never sent. Turning on verbose logging on the XDS server shows the first requests we get looks like:

version_info:"2020-01-14T23:12:41Z/1" type_url:"type.googleapis.com/envoy.api.v2.Cluster" response_nonce:"uj8hmiCQEes=2abe5090-3bef-43b8-be61-899e9e61623a"

Note the missing node id

@kyessenov

@kyessenov
Copy link
Contributor

Can you point me to a binary and confirm this is a CDS request?

@howardjohn
Copy link
Contributor Author

@kyessenov its from this PR on Istio side: istio/istio#19266.

The above is the very first request we see. its possible we are somehow dropping a request, but I was logging right after Recv so it would be a pretty low level issue

@kyessenov
Copy link
Contributor

I'm not seeing an issue with envoy on master, let me check istio build.

@kyessenov
Copy link
Contributor

Fetched latest envoy build from istio, and tested both ADS and CDS. I seem to get the node ID supplied. Can you help reproducing this?

@lambdai
Copy link
Contributor

lambdai commented Jan 22, 2020

CC: @lambdai

@mattklein123 mattklein123 added investigate Potential bug that needs verification and removed bug help wanted Needs help! labels Jan 30, 2020
@stale
Copy link

stale bot commented Feb 29, 2020

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Feb 29, 2020
@stale
Copy link

stale bot commented Mar 7, 2020

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted". Thank you for your contributions.

@kyessenov
Copy link
Contributor

Can this issue be re-opened please?

@howardjohn
Copy link
Contributor Author

It looks like what happens is:

  • Pod starts up, sends CDS request with node metadata, then all the other XDS without metadata. 👍
  • I restart the control plane pod. Client envoy logs:
[Envoy (Epoch 0)] [2020-03-09 17:17:05.214][22][debug][connection] [external/envoy/source/common/network/connection_impl.cc:192] [C5] closing socket: 0
[Envoy (Epoch 0)] [2020-03-09 17:17:05.214][22][debug][client] [external/envoy/source/common/http/codec_client.cc:91] [C5] disconnect. resetting 0 pending requests
[Envoy (Epoch 0)] [2020-03-09 17:17:05.214][22][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:265] [C5] client disconnected, failure reason:
[Envoy (Epoch 0)] [2020-03-09 17:17:05.214][22][debug][router] [external/envoy/source/common/router/router.cc:991] [C0][S9517281317529450485] upstream reset: reset reason connection failure
[Envoy (Epoch 0)] [2020-03-09 17:17:05.214][22][debug][http] [external/envoy/source/common/http/async_client_impl.cc:96] async http request response headers (end_stream=true):
':status', '200'
'content-type', 'application/grpc'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection failure'

[Envoy (Epoch 0)] [2020-03-09 17:17:05.214][22][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
  • New control plane comes up, client envoy connects. It sends a CDS request with no node id. Logs from client envoy side:
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:47] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.api.v2.DiscoveryRequest) returns (stream .envoy.api.v2.DiscoveryResponse);

[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][router] [external/envoy/source/common/router/router.cc:475] [C0][S13281822872239397772] cluster 'xds-grpc' match for URL '/envoy.service.discovery.v2.AggregatedDiscoveryService/StreamAggregatedResources'
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][router] [external/envoy/source/common/router/router.cc:615] [C0][S13281822872239397772] router decoding headers:
':method', 'POST'
':path', '/envoy.service.discovery.v2.AggregatedDiscoveryService/StreamAggregatedResources'
':authority', 'xds-grpc'
':scheme', 'https'
'te', 'trailers'
'content-type', 'application/grpc'
'x-envoy-internal', 'true'
'x-forwarded-for', '10.244.0.56'

[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:337] queueing request due to no available connections
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:47] creating a new connection
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][client] [external/envoy/source/common/http/codec_client.cc:34] [C71] connecting
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:703] [C71] connecting to 10.96.184.120:15012
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:712] [C71] connection in progress
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:890] [C71] setting stream-level initial window size to 268435456
[Envoy (Epoch 0)] [2020-03-09 17:07:10.207][23][debug][http2] [external/envoy/source/common/http/http2/codec_impl.cc:912] [C71] updating connection-level initial window size to 268435456
[Envoy (Epoch 0)] [2020-03-09 17:07:10.381][23][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:264] [C2] new stream
[Envoy (Epoch 0)] [2020-03-09 17:07:10.381][23][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:733] [C2][S15525249081796867195] request headers complete (end_stream=true):
':authority', '127.0.0.1:15000'
':path', '/stats?usedonly&filter=^(server.state|listener_manager.workers_started)'
':method', 'GET'
'user-agent', 'Go-http-client/1.1'
'accept-encoding', 'gzip'

[Envoy (Epoch 0)] [2020-03-09 17:07:10.381][23][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1285] [C2][S15525249081796867195] request end stream
[Envoy (Epoch 0)] [2020-03-09 17:07:10.381][23][debug][admin] [external/envoy/source/server/http/admin.cc:1350] [C2][S15525249081796867195] request complete: path: /stats?usedonly&filter=^(server.state|listener_manager.workers_started)
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:578] [C71] delayed connection error: 111
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][connection] [external/envoy/source/common/network/connection_impl.cc:192] [C71] closing socket: 0
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][client] [external/envoy/source/common/http/codec_client.cc:91] [C71] disconnect. resetting 0 pending requests
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][pool] [external/envoy/source/common/http/conn_pool_base.cc:265] [C71] client disconnected, failure reason:
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][router] [external/envoy/source/common/router/router.cc:991] [C0][S13281822872239397772] upstream reset: reset reason connection failure
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][http] [external/envoy/source/common/http/async_client_impl.cc:96] async http request response headers (end_stream=true):
':status', '200'
'content-type', 'application/grpc'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection failure'

[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][warning][config] [bazel-out/k8-opt/bin/external/envoy/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:92] StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][config] [external/envoy/source/common/config/grpc_subscription_impl.cc:88] gRPC update for type.googleapis.com/envoy.api.v2.RouteConfiguration failed
...
[Envoy (Epoch 0)] [2020-03-09 17:07:11.234][23][debug][config] [external/envoy/source/common/config/grpc_subscription_impl.cc:88] gRPC update for type.googleapis.com/envoy.api.v2.ClusterLoadAssignment failed
...
[Envoy (Epoch 0)] [2020-03-09 17:07:11.236][23][debug][config] [external/envoy/source/common/config/grpc_subscription_impl.cc:88] gRPC update for type.googleapis.com/envoy.api.v2.Listener failed
[Envoy (Epoch 0)] [2020-03-09 17:07:11.236][23][debug][config] [external/envoy/source/common/config/grpc_subscription_impl.cc:88] gRPC update for type.googleapis.com/envoy.api.v2.Cluster faile

So basically we just need to make sure that when connecting to a new upstream we are still sending the node meta

@mattklein123 mattklein123 reopened this Mar 9, 2020
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Mar 9, 2020
@mattklein123 mattklein123 added the help wanted Needs help! label Mar 9, 2020
@kyessenov
Copy link
Contributor

It appears that the flag was disabled in tests by @fredlas during incremental xDS work and not restored back. Something broke it in the mean time. I'll take a look.

@fredlas
Copy link
Contributor

fredlas commented Mar 9, 2020

I know the one you mean - #8478 restored it, but was later rolled back. This sounds like more reason to get that change fully in.

@kyessenov
Copy link
Contributor

@fredlas I don't have a clear picture on the state of things. Should I attempt to restore set_node_on_first_message or wait for that change to be brought back in?

@mattklein123
Copy link
Member

#8478 was abandoned and @wgallagher is working on the merge now. He is on vacation this week though. @kyessenov if this is not super urgent I would wait until next week to talk to @wgallagher. If urgent I think it's OK to figure out a fix on top of current master.

@kyessenov
Copy link
Contributor

I'm fine waiting. It's unfortunate though since the flag is considered stable.

@mattklein123 mattklein123 added bug and removed investigate Potential bug that needs verification labels Mar 9, 2020
@mattklein123 mattklein123 added this to the 1.14.0 milestone Mar 9, 2020
@ahmelsayed
Copy link

Is this by any chance related to envoyproxy/go-control-plane#269

Also a side note, I couldn't find any docs describing set_node_on_first_message_only other than this PR (istio/istio#17361) mentioning that it's a perf optimization. Any details as to how it impacts perf?

@mattklein123
Copy link
Member

@wgallagher is going to fix this. Thank you!

@mattklein123 mattklein123 modified the milestones: 1.14.0, 1.15.0 Apr 1, 2020
@howardjohn
Copy link
Contributor Author

Also a side note, I couldn't find any docs describing set_node_on_first_message_only other than this PR (istio/istio#17361) mentioning that it's a perf optimization. Any details as to how it impacts perf?

The node metadata is pretty large, so sending it on every request adds a lot of overhead

@mattklein123 mattklein123 modified the milestones: 1.15.0, 1.16.0 Jun 24, 2020
@howardjohn
Copy link
Contributor Author

@wgallagher are you still working on this?

rboyer added a commit to hashicorp/consul that referenced this issue Aug 5, 2020
…ating envoy bootstrap config

When consul is restarted and envoy that had already sent
DiscoveryRequests to the previous consul process sends a request to the
new process it doesn't respect the setting and never populates
DiscoveryRequest.Node for the life of the new consul process due to this
bug: envoyproxy/envoy#9682

Fixes #8430
rboyer added a commit to hashicorp/consul that referenced this issue Aug 5, 2020
…ating envoy bootstrap config (#8440)

When consul is restarted and an envoy that had already sent
DiscoveryRequests to the previous consul process sends a request to the
new process it doesn't respect the setting and never populates
DiscoveryRequest.Node for the life of the new consul process due to this
bug: envoyproxy/envoy#9682

Fixes #8430
hashicorp-ci pushed a commit to hashicorp/consul that referenced this issue Aug 5, 2020
…ating envoy bootstrap config (#8440)

When consul is restarted and an envoy that had already sent
DiscoveryRequests to the previous consul process sends a request to the
new process it doesn't respect the setting and never populates
DiscoveryRequest.Node for the life of the new consul process due to this
bug: envoyproxy/envoy#9682

Fixes #8430
@howardjohn
Copy link
Contributor Author

Any updates? The node info from requests represents a sizeable amount of our network traffic

@tbarrella
Copy link
Contributor

Is anyone still working on this? I think I found the reason and a fix, but the time-consuming part would probably be writing a test and confirming the fix works; don't want to duplicate effort...

htuch pushed a commit that referenced this issue Dec 17, 2020
Previously, first_stream_request_ wasn't working as intended because api_state_ is still cached from the previous stream

Risk Level: Low
Testing: Modified test was failing before the accompanying change
Docs Changes: N/A

Fixes #9682

Signed-off-by: Taylor Barrella <tabarr@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants