Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External servers in 1.0.0 no longer honours discovery #1742

Closed
mr-miles opened this issue Nov 18, 2022 · 22 comments
Closed

External servers in 1.0.0 no longer honours discovery #1742

mr-miles opened this issue Nov 18, 2022 · 22 comments
Assignees
Labels
type/bug Something isn't working waiting-reply Waiting on the issue creator for a response before taking further action

Comments

@mr-miles
Copy link
Contributor

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

I upgraded from 0.48 chart to 1.0.0. I am running consul servers outside k8s and have externalServers.hosts set to discover the addresses via aws lookups. The new agentless setup treated the provider string as a dns name to look up, and errored that the name did not exist.

Is this still a supported setup? Should I expect the provider/discovery templates to work or is there a different way to set this up now?

Thanks for the 1.0.0 release - appreciate all the hard work that has gone into it!

Reproduction Steps

Add to helm chart

externalServers.hosts = ["provider=aws tag_key=XX_KEY tag_value=XX_VALUE"]

where XX_KEY and XX_VALUE are the ec2 instance tags that have been used

Expected behavior

Expect the consul-dataplane component to respect the same discovery rules as the consul client.

Environment details

AWS EKS

@mr-miles mr-miles added the type/bug Something isn't working label Nov 18, 2022
@david-yu
Copy link
Contributor

Hi @mr-miles we're working on updating our docs based on the large surface area of changes we made for 1.0. For 1.0, we no longer use go-discover and use another library called go-netaddrs for discovery.

You can read about the changes here, which are still being staged for publishing on our docs page: https://github.com/hashicorp/consul/blob/328d081bc9c339cc48f95c8d4d0cce1fd3b4b81a/website/content/docs/k8s/deployment-configurations/servers-outside-kubernetes.mdx#join-external-servers-to-consul-on-kubernetes

@david-yu david-yu added the waiting-reply Waiting on the issue creator for a response before taking further action label Nov 18, 2022
@mr-miles
Copy link
Contributor Author

got it thanks .... i am too keen! have been awaiting this release so thanks again

@mr-miles
Copy link
Contributor Author

Sorry - I tried it and it nearly works but it looks like the discover binary isn't included in the container image (or I need to enable it somehow)

My connect-injector pod has this environment variable:
CONSUL_ADDRESSES : exec=discover -q addrs provider=aws tag_key=ConsulCluster tag_value=048c3bea3db2df8a

But I see this in the pod logs and they fail to reach readiness
[ERROR] consul-server-connection-manager: connection error: error="failed to discover Consul server addresses: failed to retrieve IP addresses from executable: exec: "discover": executable file not found in $PATH"

@mr-miles mr-miles reopened this Nov 18, 2022
@david-yu
Copy link
Contributor

Thanks! We'll take a look!

@curtbushko
Copy link
Contributor

curtbushko commented Nov 21, 2022

Hi @mr-miles! Thanks for finding this and helping us out!

I put up a add go-discover PR and it should end up in the 1.0.1 release in the next few days.

Please let me know if you test this beforehand as you can use the control-plane image hashicorpdev/consul-k8s-control-plane:40c1024b.

> docker run -it hashicorpdev/consul-k8s-control-plane:40c1024b /bin/discover

Usage: discover addrs key=val key=val ...
The options for discovering ip addresses are provided as a
  single string value in "key=value key=value ..." format where
  the values are URL encoded.

    provider=aws region=eu-west-1 ...

.... snip ....
 

@curtbushko curtbushko self-assigned this Nov 21, 2022
@mr-miles
Copy link
Contributor Author

mr-miles commented Nov 21, 2022

Thanks for the quick turnaround. I've tested it out and discovery is now finding the server addresses as expected! So victory.

HOWEVER, something is still amiss. My connect-injector pod and terminating-gateway-init pod are both failing like this:

2022-11-21T19:09:40.769Z [INFO] consul-server-connection-manager: trying to connect to a Consul server
2022-11-21T19:09:40.770Z [DEBUG] consul-server-connection-manager: Executing command: command=discover args=["-q", "addrs", "provider=aws", "tag_key=ConsulCluster", "tag_value=xxxx"]
2022-11-21T19:09:42.284Z [DEBUG] consul-server-connection-manager: Addresses retrieved from the executable: ip-addrs=["{10.142.xxx }", "{10.142.xxx }", "{10.142.xxx }", "{10.142.xxx }", "{10.142.xxx }"]
2022-11-21T19:09:42.284Z [INFO] consul-server-connection-manager: discovered Consul servers: addresses=[10.142.xxx:8502, 10.142.xxx:8502, 10.142.xxx:8502, 10.142.xxx:8502, 10.142.xxx:8502]
2022-11-21T19:09:42.284Z [INFO] consul-server-connection-manager: current prioritized list of known Consul servers: addresses=[10.142.xxx:8502, 10.142.xxx:8502, 10.142.xxx:8502, 10.142.xxx:8502, 10.142.xxxx:8502]
2022-11-21T19:09:42.284Z [DEBUG] consul-server-connection-manager: switching to Consul server: address=10.142.xxx:8502
2022-11-21T19:09:52.384Z [DEBUG] consul-server-connection-manager: gRPC resolver failed to update connection address: error="bad resolver state"
2022-11-21T19:09:52.384Z [ERROR] consul-server-connection-manager: connection error: error="failed to switch to Consul server "10.142.xxx:8502": target sub-connection is not ready (state=TRANSIENT_FAILURE)"
2022-11-21T19:09:52.384Z [DEBUG] consul-server-connection-manager: backoff: retry after=746.889161ms
2022-11-21T19:09:53.132Z [INFO] consul-server-connection-manager: trying to connect to a Consul server
2022-11-21T19:09:53.132Z [INFO] consul-server-connection-manager: current prioritized list of known Consul servers: addresses=[10.142.xxx:8502, 10.142.xxx:8502, 10.142.xxx:8502, 10.142.xxx:8502, 10.142.xxx:8502]
2022-11-21T19:09:53.132Z [DEBUG] consul-server-connection-manager: switching to Consul server: address=10.142.xxx:8502
2022-11-21T19:10:03.151Z [DEBUG] consul-server-connection-manager: gRPC resolver failed to update connection address: error="bad resolver state"
2022-11-21T19:10:03.151Z [ERROR] consul-server-connection-manager: connection error: error="failed to switch to Consul server "10.142.xxx:8502": target sub-connection is not ready (state=TRANSIENT_FAILURE)"
2022-11-21T19:10:03.151Z [DEBUG] consul-server-connection-manager: backoff: retry after=579.142529ms

Consul servers are 1.4.0, are the addresses it discovered, and they have ports.grpc_tls set to 8502. I've also verified connectivity to 8502 with telnet.

Connect injector has these environment variables:

CONSUL_ADDRESSES : exec=discover -q addrs provider=aws tag_key=ConsulCluster tag_value=xxx
CONSUL_API_TIMEOUT : 5s
CONSUL_CACERT_FILE : /vault/secrets/serverca.crt
CONSUL_DATACENTER : dc1
CONSUL_GRPC_PORT : 8502
CONSUL_HTTP_PORT : 8501
CONSUL_LOGIN_AUTH_METHOD : consul-k8s-component-auth-method
CONSUL_LOGIN_DATACENTER : dc1
CONSUL_LOGIN_META : component=connect-injector,pod=$(NAMESPACE)/$(POD_NAME)
CONSUL_USE_TLS : true
NAMESPACE : fieldRef(v1:metadata.namespace)
POD_NAME : fieldRef(v1:metadata.name)

Do you know if there's any other fixes pending that might be causing this? And/or if you know of a simple setting that would help me get beyond the TRANSIENT_FAILURE error message to something more detailed?

@ndhanushkodi
Copy link
Contributor

This seems to be an issue gRPC resolver failed to update connection address: error="bad resolver state" from when the consul-server-connection-manager is looping through the addresses to attempt to dial, its not able to set the address to dial due to the "bad resolver state". Currently looking into it. Maybe some kind of input into the grpc resolver being used to dial the servers is malformed or invalid.

@mr-miles
Copy link
Contributor Author

Thanks - is it expecting the grpc port to be using tls? I’ve seen a few changes in that area in consul 1.4.0: do both grpc and grpc_tls need to be open?

@david-yu
Copy link
Contributor

@mr-miles Just to double check could you perhaps send over your server config? Also are your external servers also on 1.14.0?

@mr-miles
Copy link
Contributor Author

Hi

Thanks! The server config is below. The external servers are all on 1.14.0 (I initially did upgrade the clients to 1.14 and had the servers on 1.13.3 and nothing worked; the upgrade sorted that out).

"advertise_addr": "10.142.xxx",
"bind_addr": "10.142.xxx",
"bootstrap_expect": 5,
"client_addr": "0.0.0.0",
"datacenter": "dc1",
"node_name": "i-063xxx",
"retry_join": [
"provider=aws region=eu-west-1 tag_key=ConsulCluster tag_value=xxx"
],
"server": true,
"encrypt": "yF2umsdsfdsf",
"ca_path": "/opt/consul/root-cert/ca.crt",
"cert_file": "/opt/consul/tls/agent.crt",
"key_file": "/opt/consul/tls/agent.key",
"autopilot": {
"cleanup_dead_servers": true,
"last_contact_threshold": "200ms",
"max_trailing_logs": 250,
"server_stabilization_time": "10s",
"redundancy_zone_tag": "az",
"disable_upgrade_migration": false,
"upgrade_version_tag": ""
},
"ui_config": {
"enabled": true
}
"domain": "consul",
"ports": {
"http": -1,
"https": 8501,
"dns": 8600,
"grpc_tls": 8502
},
"verify_incoming": false,
"verify_outgoing": true,
"verify_incoming_rpc": false,
"acl": {
"enabled": true,
"default_policy": "deny",
"down_policy": "deny",
"enable_token_persistence": true
},
"disable_update_check": true,
"auto_encrypt": {
"allow_tls": true
}
"acl": { "tokens": {"master": "$CONSUL_HTTP_TOKEN"} }
}

@ndhanushkodi
Copy link
Contributor

ndhanushkodi commented Nov 21, 2022

@mr-miles ah yes, you should only need one or the other (no requirement to have both). There are some features (like cluster peering using mesh gateways) that requires using the grpc TLS port. But if you are able to test it with using only the grpc port (non-tls) by just setting "grpc": 8502 we'd be able to narrow down if this is a TLS issue (as opposed to connectivity issue).

In the meanwhile also looking into a way to enable trace logs for server-connection-manager as that would help confirm if its a TLS issue.

@ndhanushkodi
Copy link
Contributor

ndhanushkodi commented Nov 21, 2022

@mr-miles can you try setting global.logLevel=trace in your cluster and post the logs from the connect-injector pod again? I think that would help us see if there's any grpc TLS issues with the connection.

@mr-miles
Copy link
Contributor Author

Well that was easier than expected ... so now I see things like this in the logs:

2022-11-21T23:21:31.687Z [TRACE] consul-server-connection-manager: clientConnWrapper.NewSubConn: addrs=["{\n "Addr": "10.142.xxx:8502",\n "ServerName": "",\n "Attributes": null,\n "BalancerAttributes": null,\n "Type": 0,\n "Metadata": null\n}"]
2022-11-21T23:21:31.696Z [TRACE] consul-server-connection-manager: balancer.UpdateSubConnState: sc="&{{0 0} 0xc0004d6580}" state="{IDLE connection error: desc = "transport: authentication handshake failed: tls: either ServerName or InsecureSkipVerify must be specified in the tls.Config"}"
2022-11-21T23:21:31.696Z [TRACE] consul-server-connection-manager: balancer.UpdateSubConnState: sc="&{{0 0} 0xc0004d6580}" state="{CONNECTING }"
2022-11-21T23:21:31.697Z [TRACE] consul-server-connection-manager: balancer.UpdateSubConnState: sc="&{{0 0} 0xc0004d6580}" state="{TRANSIENT_FAILURE connection error: desc = "transport: authentication handshake failed: tls: either ServerName or InsecureSkipVerify must be specified in the tls.Config"}"
2022-11-21T23:21:33.270Z [TRACE] consul-server-connection-manager: balancer.UpdateSubConnState: sc="&{{0 0} 0xc0004d6580}" state="{IDLE connection error: desc = "transport: authentication handshake failed: tls: either ServerName or InsecureSkipVerify must be specified in the tls.Config"}"
2022-11-21T23:21:33.271Z [TRACE] consul-server-connection-manager: balancer.UpdateSubConnState: sc="&{{0 0} 0xc0004d6580}" state="{CONNECTING }"
2022-11-21T23:21:33.272Z [TRACE] consul-server-connection-manager: balancer.UpdateSubConnState: sc="&{{0 0} 0xc0004d6580}" state="{TRANSIENT_FAILURE connection error: desc = "transport: authentication handshake failed: tls: either ServerName or InsecureSkipVerify must be specified in the tls.Config"}"
2022-11-21T23:21:35.754Z [TRACE] consul-server-connection-manager: balancer.UpdateSubConnState: sc="&{{0 0} 0xc0004d6580}" state="{IDLE connection error: desc = "transport: authentication handshake failed: tls: either ServerName or InsecureSkipVerify must be specified in the tls.Config"}"
2022-11-21T23:21:35.754Z [TRACE] consul-server-connection-manager: balancer.UpdateSubConnState: sc="&{{0 0} 0xc0004d6580}" state="{CONNECTING }"
2022

@ndhanushkodi
Copy link
Contributor

ndhanushkodi commented Nov 21, 2022

@mr-miles thank you! the way we set the ServerName is through the environment variable CONSUL_TLS_SERVER_NAME which should be added to the connect-injector pod, which is set through the externalServers.tlsServerName helm value. Do you have that set on this cluster?

I see that we should probably also update the doc for the tlsServerName value since it only references https connections but it is used for more than that

@mr-miles
Copy link
Contributor Author

mr-miles commented Nov 21, 2022

Ok - no i dont have it set. somehow it worked previously without it - should it be set to server..?

Also, if its not set and its just requesting via IP, can it not validate it using the IP SANs?

@ndhanushkodi
Copy link
Contributor

ndhanushkodi commented Nov 21, 2022

It should be server.DATACENTER.DOMAIN I'd have to look into why it doesn't work with IP SANs, I'm not sure the details of how the grpc tls cert is generated. But I see that when setting up clusters with external servers, we definitely are using that value. It may not have been required before because the 1.0.0 release had a large refactor with removing the need to use client agents so its possible there was a change that requires setting the tlsServerName

@mr-miles
Copy link
Contributor Author

Yay! Setting it to that has made everything green again - thanks for your help!

I did have a quick look around and tls server name is set to server.DATACENTER.DOMAIN by default but only if cloud.enabled is true, so maybe that should be a bit more permissive if that setting is needed to make it work.
https://github.com/hashicorp/consul-k8s/blob/main/charts/consul/templates/_helpers.tpl#L333

Anyhow, thanks again!

@ndhanushkodi
Copy link
Contributor

Sounds good, I'll get some additional opinions on whether it makes sense to have a default tls server name with external servers. It's possible that wouldn't be desirable to set a default because its hard to know what configuration external servers might have. But for sure we will at least update the values.yml documentation on that value!

@mr-miles
Copy link
Contributor Author

I had a look through the code as I agree with you that having a default name doesn't sit quite right.

Looking through the code, there is no way to configure things so it will accept any valid certificate - you either specify an explicit hostname, or set it to not verify anything.

An alternative fix could be, in consul-server-connection-manager, if no server name is specified but insecure tls is required, then use the hostname being used to initiate the connection. In the case where that hostname is an IP address, then the current logic would correctly look at the IP sans.

@david-yu
Copy link
Contributor

david-yu commented Dec 7, 2022

@mr-miles could you file the feature request in https://github.com/hashicorp/consul-server-connection-manager? Thanks I do think its better to track it there as a feature request and is a good area for UX improvement where the hostname is an IP address.

@david-yu
Copy link
Contributor

david-yu commented Dec 7, 2022

Thanks @mr-miles will go ahead and close here and track on that repo moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working waiting-reply Waiting on the issue creator for a response before taking further action
Projects
None yet
Development

No branches or pull requests

4 participants