-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube2iam never recovers after node failures #78
Comments
Also, I'm fairly certain that this is not the same as #46 because I had a simple workaround script for that problem that would wait for the role to be present on my actual application containers (tested before and consistently worked against 1000 concurrently scheduled jobs): metadata_url="http://169.254.169.254/latest/meta-data/iam/security-credentials/arn:aws:iam::${account_id}:role/route53-kubernetes-role"
until curl -f --silent "$metadata_url" --output /dev/null; do
echo "waiting for kube2iam..."
sleep 3
done |
Looking at the output from debug, did you manually change the account id to
|
Yes, sorry. I forgot to mention that |
I can confirm that when I kill the container i.e. I am not sure why restarting a node would be any different than adding a new node. Could it be an issue with the iptables rule or something else that did not happen when the node was restarted? How do you run the daemon i.e. flags etc.? |
I'm not an expert at iptables, but my initial thought is that the routing is fine. The metadata service is definitely still resolvable from within my containers when the new node comes up; however, it no longer returns my annotated role from My only real guess is that the metadata service or the state around it is somehow different when an actual node dies/restarts. Afterwards,
You mean the |
I've been able to partially replicate this by repeatedly terminating instances and using the above templates, but the error only shows up every now and then. When it does, the errors also appear to stop within a few minutes. |
For what it's worth, I haven't tested this issue out since I reported it in early June because we didn't end up using kube2iam. However, newer versions may have fixed the problem or at least made it more self-healing. Feel free to close this if there's confidence that the issue is not prevalent. |
I'm relatively new to Kubernetes and kube2iam, but I've been using both for a little over a month now. Earlier today, I attempted to test the fault tolerance of my cluster (created via kops) by terminating nodes and letting the ASG bring them back. My test cluster had one master and two worker nodes, and I terminated two worker nodes at the same time.
However, when kube2iam's Pod's were recreated within the newly started nodes, they did not not appear to restore the right metadata back from the EC2 metadata service as it was before the outage. There are no errors reported in the log for any of the Pods either (all
debug
orinfo
). I can confirm that kube2iam was definitely forwarding credentials properly before.Below is an example config that I was using to verify before/after which will now error consistently:
Below is the config I'm using for kube2iam:
This is what the debug service was outputting at localhost:8181/debug/store, if it helps:
... and the Kubernetes cluster is running on 1.5.2.
This is all of the info I can think to provide for now, but please let me know if there's anything else I can provide to help us troubleshoot.
The text was updated successfully, but these errors were encountered: