Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods are sometimes assigned to the incorrect IAM role #46

Closed
smelchior opened this issue Jan 31, 2017 · 21 comments
Closed

Pods are sometimes assigned to the incorrect IAM role #46

smelchior opened this issue Jan 31, 2017 · 21 comments

Comments

@smelchior
Copy link

smelchior commented Jan 31, 2017

In our cluster sometimes the pods that have a proper "iam.amazonaws.com/role" annotation do not receive their role when they start up. kube2iam returns the default role to them which in our case does not have any permissions. After some time of the application requests the credentials again, it gets the proper assignment.
Relevant log messages:

level=info msg="Requesting /latest/meta-data/iam/security-credentials/"
level=warning msg="Using fallback role for IP 10.233.109.12"
level=info msg="Requesting /latest/meta-data/iam/security-credentials/kube.no-permissions"
level=warning msg="Using fallback role for IP 10.233.109.12"
.... some time later ....
level=info msg="Requesting /latest/meta-data/iam/security-credentials/"
level=info msg="Requesting /latest/meta-data/iam/security-credentials/kube.kube-system.route53-kubernetes"

I am not really sure how to debug this further, it might be related to the issues described in #32

@jtblin
Copy link
Owner

jtblin commented Feb 23, 2017

I guess it's possible that the pod tried to access its IAM role before the onAdd method is completed which retrieves the annotation via the kubernetes API. Perhaps retrying there for a short amount of time could solve this issue. Will look into this soon.

@struz
Copy link
Collaborator

struz commented Mar 14, 2017

After investigating a little bit it seems that the precision on this race condition is quite tight (i.e. sub-second).

time="2017-03-10T00:48:25Z" level=warning msg="Using fallback role for IP 10.254.30.13"
time="2017-03-10T00:48:25Z" level=warning msg="Role:arn:aws:iam::<account>:<role> on namespace: not found."
time="2017-03-10T00:48:25Z" level=debug msg="Pod OnUpdate kube2iam-test - 10.254.30.13"
time="2017-03-10T00:48:25Z" level=debug msg="Pod OnDelete kube2iam-test - "
time="2017-03-10T00:48:25Z" level=debug msg="Pod OnAdd kube2iam-test - 10.254.30.13"
time="2017-03-10T00:48:25Z" level=debug msg="- Role arn:aws:iam::<account>:role/kube-test"

There are two edge cases I can think of here:

  • pod created, role requested before OnAdd finishes, returns default role even if pod is annotated
  • pod updated / restarted, restarts and requests role before OnUpdate finishes, applies the old role that was stored

To fix both of these at once I can only think of adding a static delay of somewhere between 0 and 1 second. Exponential backoff isn't really useful to fix the second case. Hopefully it won't have too much of an impact on apps either, since they should be caching credentials.

We could add this delay to replace (or augment) the existing exponential backoff in server.getRole.

@smelchior
Copy link
Author

Thanks for looking into this. I guess adding the static delay is the best option for now. It would be great if that could be implemented.

@jtblin
Copy link
Owner

jtblin commented Mar 14, 2017

I don't think we'd want to add a static delay before getting the IAM role but maybe I misunderstand what you mean. There is another issue that getting the role on-demand (when a request for credentials comes in) triggers timeouts in some AWS clients.

The issue from @smelchior here is specifically about fallback i.e. "Using fallback role for IP 10.233.109.12" so I think an exponential backoff wrapping the block that retrieves from the rolesByIP (https://github.com/jtblin/kube2iam/blob/master/cmd/store.go#L28) map would work and cover for both cases. Perhaps we should wait that we move to IndexInformer before doing that though.

@struz
Copy link
Collaborator

struz commented Mar 14, 2017

IndexInformer will probably fix this too, so maybe we should wait.

@ash2k
Copy link
Contributor

ash2k commented Mar 14, 2017

I agree with Jerome - sleeps are not the best way.
I think a better option is to decouple requests for roles from lookup from store using a channel (a la pub-sub). Something like this:
Each credentials lookup submits a subscription request to a channel. Request itself contains a channel for response and a context to tell when it is no longer needed. Submitting goroutine blocks on read from that response channel (with a timeout of course).
There is a goroutine which reads requests from that channel and keeps record of all the IPs with pending requests. This goroutine either immediately requests credentials if pod is found or puts the requests in the queue (map ip->requests slice). Every time informer gets an event about a pod it looks up the queue to see if there are any interested parties. If there are, then execute the credentials lookup and return the result.
This is somewhat similar to what I've implemented in https://github.com/atlassian/gostatsd/blob/master/pkg/statsd/cloud_handler.go

The best course of action at the moment is to refactor it to use indexed informer, but it will not magically fix this issue (if I understand it correctly).

p.s. I'm sorry if I misunderstood the issue and this is completely off.

@ankon
Copy link
Contributor

ankon commented Mar 14, 2017

Looking at the code for updating a pod (OnAdd/OnDelete) and the store logic: isn't the easiest fix here to extend the coverage of the store mutex so that it covers the whole of OnAdd/OnDelete? Now there are points in the process where Get would see an inconsistent state.

@struz
Copy link
Collaborator

struz commented Mar 14, 2017

What @ankon said is correct, I didn't even think of that. It will work unless the request for the role is received before the api update is sent. As he said, I suspect they are received almost in tandem right now, and the mutex covering more space might always allow the api server event to complete firing first.

@ash2k
Copy link
Contributor

ash2k commented Mar 19, 2017

BTW looks like store has quite a few data races - this can be a reason for flaky behaviour as well. What I can see without digging/testing:

  • rolesByNamespace, namespaceByIP fields are read without holding a mutex in few places while written with the mutex held
  • DumpRolesBy*() methods return map without copying it under a mutex. Reads from debug endpoint will by racy

@jtblin
Copy link
Owner

jtblin commented Mar 19, 2017

Good catch! Although unlikely to be the root cause of this issue as it was opened before namespaces were added.

I don't think the store mutex is going to cover everything e.g. pod making trying to get IAM credentials before it has been added to the store but it would be a good improvement already. The exponential backoff strategy is actually probably not the right answer either because it would delay the fallback even when fallback is desired. I am not sure the channel approach would solve the underlying issue either.

I think the underlying issue is the same as #31. Prefetching credentials would solve the issue here as well, there was a PR originally but I rejected it because it could potentially DDOS the AWS API limit and create too many routines in kube2iam. We may want to review the approach and coming up with a gentler approach that prefetches credentials and refresh them.

@ash2k
Copy link
Contributor

ash2k commented Mar 20, 2017

I'm pretty sure the channel approach will at least reduce the response latency. It is instantaneous "push" as soon as information is available instead of polling with exponential backoff.

@jtblin
Copy link
Owner

jtblin commented Mar 20, 2017

There is no polling with exponential backoff here, it's reading directly from a map which is instantaneous. I was proposing to add exponential backoff instead of the sleep proposed by @struz but as said above, I don't think that's the right strategy either.

The only exponential backoff in the code is around this block for the entire store.Get i.e. it cannot be the original issue here as it would have fallen back to the default role already at this point.

@struz
Copy link
Collaborator

struz commented Mar 20, 2017

One problem is that it will fall back to the default role even when a role is specified sometimes. Because the role is asked for before the map is finished updating the system just assumes it has no role and serves it the default one with no back off

@jtblin
Copy link
Owner

jtblin commented Mar 20, 2017

It uses a mutex to read and write from the map so I don't think that's correct:

// Get returns the iam role based on IP address.
func (s *store) Get(IP string) (string, error) {
	s.mutex.RLock()
	defer s.mutex.RUnlock()
	if role, ok := s.rolesByIP[IP]; ok {
		return role, nil
	}
	if s.defaultRole != "" {
		log.Warnf("Using fallback role for IP %s", IP)
		return s.defaultRole, nil
	}
	return "", fmt.Errorf("Unable to find role for IP %s", IP)
}

func (s *store) AddRoleToIP(pod *api.Pod, role string) {
	s.mutex.Lock()
	s.rolesByIP[pod.Status.PodIP] = role
	s.mutex.Unlock()
}

@jtblin
Copy link
Owner

jtblin commented Mar 20, 2017

Sorry you're right for the update case, there could be a race condition that a mutex around delete and add could solve.

@pingles
Copy link

pingles commented Jun 1, 2017

We had this problem in one of our production clusters just over a week ago- we noticed that some processes would go unhealthy periodically and through the logs noticed they were attempting to use credentials for a different role (and thus being denied access to the AWS resources they needed). We'd been fine for a few months but we crossed some kind of cluster load that meant stuff started breaking.

I started working on moving code around to support broader locks but made enough changes (both to address this and to incorporate prefetching and other bits) it ended up easier to just start from scratch. To address this (the incorrect IAM role) issue we found a thread-safe cache inside k8s.io/client-go that can watch + sync pod state.

kiam (which I'd like to think is something like the child of kube2iam 😄 ) is at https://github.com/uswitch/kiam if its helpful.

@jtblin jtblin closed this as completed in e815c20 Jun 5, 2017
@jtblin
Copy link
Owner

jtblin commented Jun 5, 2017

Ah... I just added a lock to the update of pods and namespaces as originally pointed out by @ankon. That should be enough to fix the problem.

@pingles I'd be interested to know how many pods you have running on your cluster with the prefetch of credentials and if you're hitting issues with AWS API limits. It's one of the things I'm wary of with this, especially when the daemon starts and need to fetch the credentials for all the pods (potentially on multiple nodes), as after a cursory glance at your code, it doesn't seem you have a rate limiter or backoff with exponential retries.

@pingles
Copy link

pingles commented Jun 5, 2017

@jtblin we're running 10s of nodes in each cluster (around 15-30 depending on load) with 6,600 pods running now, although a far smaller number would have role annotations (we don't attach default roles to stuff that doesn't need AWS access).

We've not had any errors from AWS API limits as yet. As you say, the number of API calls grows as nodes * roles and the credential cache tracks around 30-40 roles currently. The prefetcher runs in its own go routine so fetches are serial per-node but with a 400ms per AssumeRole call (95th percentile) there's a kind of built-in rate-limit :) Across all clusters we see a peak API call per-second of around 50 so its not optimal but puts it under whatever limit AWS specify internally (although maybe that could be different per-account).

We were planning on wrapping the AWS client with a rate limiter/retry gateway but we needed something we could run asap and we didn't hit the problem so pushed it down the priority.

@smelchior
Copy link
Author

Thanks for fixing this @jtblin! Can you tell already when there will be a new release that includes the fix?

@sibrahim001982
Copy link

Hi

Was this fixed?

@jrnt30
Copy link
Collaborator

jrnt30 commented May 8, 2018

@sibrahim001982 The known causes for this situation have been fixed and integrated, yes. There have been a few explicit improvements to the way that Kube2iam handles the pod inventory, but there may be times when there is a very large spike in pods that would still result in some observed issues.

If you are experiencing a specific issue I would recommend opening up a ticket and providing as much information as possible about the issue.

Some things to do beforehand are:

  • Increase the logging level on the server
  • Enable the debug endpoint and see if that is telling
  • Ensure you're on the latest version
  • Provide any reproduction instructions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants