Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Allow the ACL Replication Time to be a Configurable Value or Wait for the Replication to Happen #2626

Closed
mkentala opened this issue Jul 21, 2023 · 1 comment · Fixed by #2656
Labels
type/enhancement New feature or request

Comments

@mkentala
Copy link

mkentala commented Jul 21, 2023

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Is your feature request related to a problem? Please describe.

During installation of Consul via Helm chart, mesh-gateway pod is crash-looped with error in initContainer mesh-gateway-init

2023-05-18T11:40:53.115Z [ERROR] Unable to read ACL token; retrying: err="Unexpected response code: 403 (ACL not found)"
2023-05-18T11:40:53.115Z [ERROR] Unable to read ACL token from a Consul server; please check that your server cluster is healthy: err="Unexpected response c
ode: 403 (ACL not found)"
2023-05-18T11:40:53.115Z [ERROR] Consul login failed: error="Unexpected response code: 403 (ACL not found)"

For fetching ACL token, there is a workaround of retrying to fetch ACL tokens in mesh-gateway-init container.

numTokenReadRetries := uint64(raftReplicationTimeout.Milliseconds() / tokenReadPollingInterval.Milliseconds())

However, the retry logic only takes into consideration of local Raft replication timeout and hard-coded to 2 seconds as of now.

When using WAN Federation, via Mesh Gateway, a customer may face a timed-out issue.

In a secondary server log, the blocking query RPC call is made after the 2 seconds timeout in the mesh-gateway-init container. As can be seen from the server logs

2023-05-19T23:46:49.205-0400 [TRACE] agent.server: rpc_server_call: method=ACL.TokenList errored=false request_type=read rpc_type=net/rpc leader=true allow_stale=true blocking=true target_datacenter=staging locality=forwarded
2023-05-19T23:46:49.205-0400 [DEBUG] agent.server.replication.acl.token: finished fetching acls: amount=12251
2023-05-19T23:46:49.236-0400 [DEBUG] agent.server.replication.acl.token: acl replication: local=12250 remote=12251
2023-05-19T23:46:49.246-0400 [DEBUG] agent.server.replication.acl.token: acl replication: deletions=0 updates=1
2023-05-19T23:46:49.283-0400 [TRACE] agent.server: rpc_server_call: method=Status.Leader errored=false request_type=read rpc_type=net/rpc leader=true allow_stale=false blocking=false target_datacenter=tce9e0c4-tcapse2001-production locality=local
2023-05-19T23:46:49.405-0400 [TRACE] agent.server: rpc_server_call: method=ACL.TokenBatchRead errored=false request_type=read rpc_type=net/rpc leader=true allow_stale=true blocking=false target_datacenter=staging locality=forwarded
2023-05-19T23:46:49.405-0400 [DEBUG] agent.server.replication.acl.token: acl replication - downloaded updates: amount=1
2023-05-19T23:46:49.405-0400 [DEBUG] agent.server.replication.acl.token: acl replication - performing updates
2023-05-19T23:46:49.410-0400 [TRACE] agent.server: rpc_server_call: method=ACL.TokenSet errored=false request_type=unreported rpc_type=internal leader=true
2023-05-19T23:46:49.410-0400 [DEBUG] agent.server.replication.acl.token: acl replication - upserted batch: number_upserted=1 batch_size=323
2023-05-19T23:46:49.410-0400 [DEBUG] agent.server.replication.acl.token: acl replication - finished updates
2023-05-19T23:46:49.410-0400 [DEBUG] agent.server.replication.acl.token: ACL replication completed through remote index: index=13500598

Manual Workaround

  • Divert error stream for consul-k8s-control-plane acl-init
  • Add sleep timer (30 sec - 45 sec):

to consul-k8s/charts/consul/templates/mesh-gateway-deployment.yaml

....
consul-k8s-control-plane acl-init \
    -component-name=mesh-gateway \
    -token-sink-file=/consul/service/acl-token \
    -acl-auth-method=consul-consul-k8s-component-auth-method-tc207a81-production \
    -primary-datacenter=staging \
    -consul-api-timeout=5s \
    -log-level=info \
    -log-json=false > /tmp/aclinit.log 2>&1 &


sleep 30

consul-k8s-control-plane service-address \
    -log-level=info \
    -log-json=false \
    -k8s-namespace=consul \
    -name=consul-consul-mesh-gateway \
    -output-file=/tmp/address.txt
WAN_ADDR="$(cat /tmp/address.txt)"
WAN_PORT="8443"

Feature Description

We request to include the ACL replication time between DC into consideration of setting this timeout (e.g. have this timeout value configurable or wait for the replication to happen).

Use Case(s)

Consul on Kubernetes WAN Federation via Mesh Gateways
Consul: v1.13.3+ent
Consul-k8s: v0.49.8

@mkentala mkentala added the type/enhancement New feature or request label Jul 21, 2023
@mkentala mkentala changed the title Feature Request: All the ACL Replication Time to be a Configurable Value or Wait for the Replication to Happen Feature Request: Allow the ACL Replication Time to be a Configurable Value or Wait for the Replication to Happen Jul 21, 2023
@thisisnotashwin
Copy link
Contributor

The timeout has now been bumped from 2 seconds to 60 seconds in #2656 in order to prevent this from happening. The reason we chose not to make it configurable was because surfacing this field, which is nestled in the weeds of how we perform login felt like it would be very hard, from a UX perspective, to cleanly describe via a line item in the values file. Bumping it to 60 seconds though, should allow for ample time to ensure replication indeed has occurred successfully even in environments where the ping times are very high.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants