Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

[OSM-Restart] Sidecar Envoys are not able to connect with OSM after osm controller was restarted. #2145

Closed
fredstanley opened this issue Dec 4, 2020 · 10 comments
Labels
kind/bug Something isn't working size/XL 20 days (4 weeks)
Milestone

Comments

@fredstanley
Copy link

I am trying a restart case of osm-controller. when i restart the controller i see the services sidecars (envoys) not able to access the osm-controller due to certificate failure.

Environment:

  • OSM version (use osm version): Version: dev; Commit: c495d19; Date: 2020-09-10-19:20
  • Kubernetes version (use kubectl version):
    Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"darwin/amd64"}
    Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.3", GitCommit:"2e7996e3e2712684bc73f0dec0200d64eec7fe40", GitTreeState:"clean", BuildDate:"2020-05-20T12:43:34Z", GoVersion:"go1.13.9", Compiler:"gc", Platform:"linux/amd64"}

LOGS from the sidecar envoy:
=====================

[2020-12-04 19:27:54.988][1][debug][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:48] Establishing new gRPC bidi stream for rpc StreamAggregatedResources(stream .envoy.service.discovery.v3.DiscoveryRequest) returns (stream .envoy.service.discovery.v3.DiscoveryResponse);

[2020-12-04 19:27:54.989][1][debug][router] [source/common/router/router.cc:426] [C0][S16105689025710459465] cluster 'osm-controller' match for URL '/envoy.service.discovery.v3.AggregatedDiscoveryService/StreamAggregatedResources'
[2020-12-04 19:27:54.989][1][debug][router] [source/common/router/router.cc:583] [C0][S16105689025710459465] router decoding headers:
':method', 'POST'
':path', '/envoy.service.discovery.v3.AggregatedDiscoveryService/StreamAggregatedResources'
':authority', 'osm-controller'
':scheme', 'https'
'te', 'trailers'
'content-type', 'application/grpc'
'x-envoy-internal', 'true'
'x-forwarded-for', '10.52.185.122'

[2020-12-04 19:27:54.989][1][debug][pool] [source/common/http/conn_pool_base.cc:71] queueing request due to no available connections
[2020-12-04 19:27:54.989][1][debug][pool] [source/common/conn_pool/conn_pool_base.cc:53] creating a new connection
[2020-12-04 19:27:54.989][1][debug][client] [source/common/http/codec_client.cc:35] [C22] connecting
[2020-12-04 19:27:54.990][1][debug][connection] [source/common/network/connection_impl.cc:753] [C22] connecting to 10.50.196.193:15128
[2020-12-04 19:27:54.990][1][debug][connection] [source/common/network/connection_impl.cc:769] [C22] connection in progress
[2020-12-04 19:27:54.990][1][debug][http2] [source/common/http/http2/codec_impl.cc:1063] [C22] updating connection-level initial window size to 268435456
[2020-12-04 19:27:54.991][1][debug][connection] [source/common/network/connection_impl.cc:616] [C22] connected
[2020-12-04 19:27:54.992][1][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:190] [C22] handshake expecting read
[2020-12-04 19:27:55.007][1][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:197] [C22] handshake error: 1
[2020-12-04 19:27:55.007][1][debug][connection] [source/extensions/transport_sockets/tls/ssl_socket.cc:225] [C22] TLS error: 67108971:RSA routines:OPENSSL_internal:BLOCK_TYPE_IS_NOT_01 67109000:RSA routines:OPENSSL_internal:PADDING_CHECK_FAILED 184549382:X.509 certificate routines:OPENSSL_internal:public key routines 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2020-12-04 19:27:55.008][1][debug][connection] [source/common/network/connection_impl.cc:208] [C22] closing socket: 0
[2020-12-04 19:27:55.010][1][debug][client] [source/common/http/codec_client.cc:92] [C22] disconnect. resetting 0 pending requests
[2020-12-04 19:27:55.010][1][debug][pool] [source/common/conn_pool/conn_pool_base.cc:255] [C22] client disconnected, failure reason: TLS error: 67108971:RSA routines:OPENSSL_internal:BLOCK_TYPE_IS_NOT_01 67109000:RSA routines:OPENSSL_internal:PADDING_CHECK_FAILED 184549382:X.509 certificate routines:OPENSSL_internal:public key routines 268435581:SSL routines:OPENSSL_internal:CERTIFICATE_VERIFY_FAILED
[2020-12-04 19:27:55.010][1][debug][router] [source/common/router/router.cc:1022] [C0][S16105689025710459465] upstream reset: reset reason connection failure
[2020-12-04 19:27:55.010][1][debug][http] [source/common/http/async_client_impl.cc:99] async http request response headers (end_stream=true):
':status', '200'
'content-type', 'application/grpc'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection failure'

[2020-12-04 19:27:55.010][1][warning][config] [bazel-out/k8-opt/bin/source/common/config/_virtual_includes/grpc_stream_lib/common/config/grpc_stream.h:93] StreamAggregatedResources gRPC config stream closed: 14, upstream connect error or disconnect/reset before headers. reset reason: connection failure
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.listener.v3.Listener failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.route.v3.RouteConfiguration failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.route.v3.RouteConfiguration failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.endpoint.v3.ClusterLoadAssignment failed
[2020-12-04 19:27:55.010][1][debug][config] [source/common/config/grpc_subscription_impl.cc:87] gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed

@fredstanley fredstanley added the kind/bug Something isn't working label Dec 4, 2020
@shashankram
Copy link
Member

@fredstanley Thanks for reporting this issue. May I ask how are you restarting the controller?

I also see that you are using a forked version of OSM (Version: dev; Commit: c495d19; Date: 2020-09-10-19:20), do you see this issue in the latest release https://github.com/openservicemesh/osm/releases/tag/v0.5.0?

I recently tried restarting the controller pod using kubectl delete osm-controller-xxx -n osm-system, and upon new controller recreation things continued to work. The certificate verification errors you see seems to be related to mismatch in certs between the controller and proxy sidecar.

@fredstanley
Copy link
Author

@shashankram we still did not move to the latest version. Is the fix available only in v.50 tag ?

Yes i tried restart the same way. (By issuing the delete pod command.)

@shashankram
Copy link
Member

@shashankram we still did not move to the latest version. Is the fix available only in v.50 tag ?

Yes i tried restart the same way. (By issuing the delete pod command.)

I don't know the base upstream version you are using, but a lot has changed since the last few versions. I recommend you try the upstream v0.5 version and see if the issue is reproducible. If it is, I can look into that specific version and test in my environment. Let me know if that sounds reasonable.

@fredstanley
Copy link
Author

ok I will try to upgrade to v.05 and update there.

@addozhang
Copy link
Contributor

@fredstanley Encounter same issue before. It was fixed after specify ca bundle secret name with --ca-bundle-secret-name argument. With it, controller will save generated CA in bundle and extract it from secret in next start. Or, it will generate new CA every start.

@shashankram
Copy link
Member

@fredstanley Encounter same issue before. It was fixed after specify ca bundle secret name with --ca-bundle-secret-name argument. With it, controller will save generated CA in bundle and extract it from secret in next start. Or, it will generate new CA every start.

By default the CA bundle secret argument is always passed: https://github.com/openservicemesh/osm/blob/main/charts/osm/templates/osm-deployment.yaml#L37

@michelleN michelleN added this to the v0.7.0 milestone Dec 9, 2020
@draychev draychev added the size/XL 20 days (4 weeks) label Dec 10, 2020
@shashankram
Copy link
Member

@fredstanley are you able to reproduce this issue with the latest release?

@fredstanley
Copy link
Author

@shashankram The rebase to upstream osm was bit involved in our case (had some private changes). Give us couple of weeks. I will update you on this issue.

shashankram added a commit to shashankram/osm that referenced this issue Dec 17, 2020
This change adds an e2e test to test the connectivity between
client and server before/during/after osm-controller restarts.
Previously this was resulting in 503s due to issue openservicemesh#2131 which
has been fixed.

Resolves openservicemesh#2146 and tests openservicemesh#2145.

Signed-off-by: Shashank Ram <shashr2204@gmail.com>
shashankram added a commit that referenced this issue Dec 17, 2020
This change adds an e2e test to test the connectivity between
client and server before/during/after osm-controller restarts.
Previously this was resulting in 503s due to issue #2131 which
has been fixed.

Resolves #2146 and tests #2145.

Signed-off-by: Shashank Ram <shashr2204@gmail.com>
@shashankram
Copy link
Member

@fredstanley we added a test to ensure connectivity is not impacted after OSM controller restarts in #2212, and its passing right now.

There was a bug related to connectivity being impacted post controller restart which as been addressed: #2131

eduser25 pushed a commit to eduser25/osm that referenced this issue Dec 21, 2020
…vicemesh#2212)

This change adds an e2e test to test the connectivity between
client and server before/during/after osm-controller restarts.
Previously this was resulting in 503s due to issue openservicemesh#2131 which
has been fixed.

Resolves openservicemesh#2146 and tests openservicemesh#2145.

Signed-off-by: Shashank Ram <shashr2204@gmail.com>
@fredstanley
Copy link
Author

@shashankram I moved to v.0.6.0 and it seem to work fine. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Something isn't working size/XL 20 days (4 weeks)
Projects
None yet
Development

No branches or pull requests

5 participants