Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExpiredTokenException Errors #220

Closed
cstewart87 opened this issue Nov 2, 2016 · 32 comments
Closed

ExpiredTokenException Errors #220

cstewart87 opened this issue Nov 2, 2016 · 32 comments

Comments

@cstewart87
Copy link

Hi, is there any way to refresh the boto connection? We have some long running processes that are causing this connection to expire. Or, is there any way to pass a connection length?

@jantman
Copy link
Owner

jantman commented Nov 3, 2016

@cstewart87 I've honestly never seen this issue before...

How are you running awslimitchecker, and how are you getting your credentials? Could you please post the information that's requested in the Issue Template, specifically your awslimitchecker version, how you installed it and your Python version, and the output of a problematic run (ideally with -vv for debug-level output)?

All of the credential handling is performed by boto, and I'm not sure there's a real way to do this... but I might be able to get a better idea with the above information...

@aebie
Copy link

aebie commented Dec 6, 2016

@jantman I've been working on this issue with Curtis, and it looks like the problem may be in the _boto3_connection_kwargs method from the Connectable class. The credentials is global to the class, and once it's set I do not see a place where it gets reset, regardless of whether a new instance is created. Is there a reason that credentials is not an instance variable?

class Connectable(object):

"""
Mix-in helper class for connecting to AWS APIs. Centralizes logic of
connecting via regions and/or STS.
"""

# Class attribute to reuse credentials between calls
credentials = None

@property
def _boto3_connection_kwargs(self):
    """
    Generate keyword arguments for boto3 connection functions.
    If ``self.account_id`` is None, this will just include
    ``region_name=self.region``. Otherwise, call
    :py:meth:`~._get_sts_token` to get STS token credentials using
    `boto3.STS.Client.assume_role <https://boto3.readthedocs.org/en/
    latest/reference/services/sts.html#STS.Client.assume_role>`_ and include
    those credentials in the return value.

    :return: keyword arguments for boto3 connection functions
    :rtype: dict
    """
    kwargs = {'region_name': self.region}
    if self.account_id is not None:
        if Connectable.credentials is None:
            logger.debug("Connecting for account %s role '%s' with STS "
                         "(region: %s)", self.account_id, self.account_role,
                         self.region)
            Connectable.credentials = self._get_sts_token()
        else:
            if self.account_id == Connectable.credentials.account_id:
                logger.debug("Reusing previous STS credentials for "
                             "account %s", self.account_id)
            else:
                logger.debug("Previous STS credentials are for account %s; "
                             "getting new credentials for current account "
                             "(%s)", Connectable.credentials.account_id,
                             self.account_id)
                Connectable.credentials = self._get_sts_token()
        kwargs['aws_access_key_id'] = Connectable.credentials.access_key
        kwargs['aws_secret_access_key'] = Connectable.credentials.secret_key
        kwargs['aws_session_token'] = Connectable.credentials.session_token
    elif self.profile_name is not None:
        # use boto3.Session to get credentials from the named profile
        logger.debug("Using credentials profile: %s", self.profile_name)
        session = boto3.Session(profile_name=self.profile_name)
        credentials = session._session.get_credentials()
        kwargs['aws_access_key_id'] = credentials.access_key
        kwargs['aws_secret_access_key'] = credentials.secret_key
        kwargs['aws_session_token'] = credentials.token
    else:
        logger.debug("Connecting to region %s", self.region)
    return kwargs

def connect(self):
    """
    Connect to an AWS API via boto3 low-level client and set ``self.conn``
    to the `boto3.client <https://boto3.readthed
    ocs.org/en/latest/reference/core/boto3.html#boto3.client>`_ object
    (a ``botocore.client.*`` instance). If ``self.conn`` is not None,
    do nothing. This connects to the API name given by ``self.api_name``.

    :returns: None
    """
    if self.conn is not None:
        return
    kwargs = self._boto3_connection_kwargs
    self.conn = boto3.client(self.api_name, **kwargs)
    logger.info("Connected to %s in region %s", self.api_name,
                self.conn._client_config.region_name)

def connect_resource(self):
    """
    Connect to an AWS API via boto3 high-level resource connection and set
    ``self.resource_conn`` to the `boto3.resource <https://boto3.readthed
    ocs.org/en/latest/reference/core/boto3.html#boto3.resource>`_ object
    (a ``boto3.resources.factory.*.ServiceResource`` instance).
    If ``self.resource_conn`` is not None,
    do nothing. This connects to the API name given by ``self.api_name``.

    :returns: None
    """
    if self.resource_conn is not None:
        return
    kwargs = self._boto3_connection_kwargs
    self.resource_conn = boto3.resource(self.api_name, **kwargs)
    logger.info("Connected to %s (resource) in region %s", self.api_name,
                self.resource_conn.meta.client._client_config.region_name)

@aebie
Copy link

aebie commented Dec 7, 2016

I would add that out application is multi threaded with each thread handling a different account. I suspect maybe the credentials is getting overwritten with credentials for another account?

@jantman
Copy link
Owner

jantman commented Dec 9, 2016

@cstewart87 @aebie

I'm sorry for the issues you're experiencing.

First off, awslimitchecker was never written with multi-threaded apps in mind; I have no idea what code paths may be there that aren't thread-safe.

I can say, for certain, that this is also the default behavior of boto3/botocore, the underlying libraries; boto3 uses a global to store its session information and credentials. There may be a way to override this behavior, but I'm not sure offhand, and awslimitchecker certainly doesn't do it now.

It was also really not written with multiple connections or long-running processes in mind. Connectable.credentials is intentionally not an instance variable specifically so that every connection will use the same credentials... mainly because that's the only use case that's come up so far, and because that greatly simplifies credential handling internally.

There's a way around this, but it's also not thread-safe. The workaround I use in the integration tests, which make multiple connections in the same process, is just to reset the globals back to None between connections:

Connectable.credentials = None
boto3.DEFAULT_SESSION = None

I've honestly never thought of a use case where awslimitchecker would be doing so much work that multithreading would be required. My gut reaction would be to suggest distributing the work among a queue of separate processes, and collecting the results.

Also, I don't know if this application is purely internal or customer-facing, but it's probably worth a reminder that awslimitchecker is licensed under the Affero GPL (AGPL) v3; anyone who can interact with the program over a network or is receiving the results of awslimitchecker is entitled to a clear attribution aod source code offer.

@aebie
Copy link

aebie commented Dec 9, 2016

@jantman we have plugged awslimitchecker into an existing framework running 50+ other checks that run across multiple accounts. There are plans in place to distribute the workload using Lambda functions, but it's not going to happen soon so we would need to do something in the interim. As far as thread-safeness, there is the issue of the boto3 default session which cause occasional errors like "KeyError: 'endpoint_resolver'". The fix would be to create a session per thread rather than using the default session, but this is not a pressing issue right now for us because these are very infrequent and the framework compensates. I think the very frequent errors we are having with awslimitchecker might be due to collisions with accounts running in parallel and resetting the credentials field.

I understand the concern about making the credentials an instance var. We do all awslimitchecker calls for the same account using the same instance, so it would be OK here but other implementations could have an impact. An alternative might be to replace the single instance of credentials with a dict where the key was the account name, and to add a check on Credentials.expiration to Connectable._boto3_connection_kwargs(). I think that would resolve our issue without impacting anyone else.

We can submit a PR for this change, but want to verify the solution with you before proceeding.

@jantman
Copy link
Owner

jantman commented Dec 10, 2016

@aebie

Before I got a lot farther on this, I do want to ask again explicitly... is your use case checking for internal purposes, or is this information that's somehow used for or given to clients? (i.e. you're not with CloudCheckr/CloudHealth/etc. are you?)

Hmm.... I'm not sure how I feel about the dict idea... my gut reaction is that someone else with a credential-related issue will end up being bitten by it... That being said, I'm only one person, and I don't necessarily always have the best ideas.

Without digging into the code, just from memory, I think I like the idea of adding a check on Credentials.expiration in the _boto3_connection_kwargs() method.

So... here's a thought: Maybe, what is now Connectable.credentials should really be AwsLimitChecker.credentials. If it were an instance variable on the AwsLimitChecker object itself... I guess we'd need to pull it out of the Connectable classes (i.e. the first time we construct a Connectable class, we'd need to do something like self.credentials = someConnectableInstance.get_credentials()), and then pass it in to subsequent Connectable subclass constructors, but I think it would net the same result as now (allowing all Connectable instances to share creds, especially when those are assumed role or MFA creds that we want to use as long as possible). We'd still need to do something on the boto3 side though... I guess instead of directly calling boto3.client() and boto3.resource() which use boto3.DEFAULT_SESSION, we'd need to explicitly construct a new boto3 Session per Connectable instance and store it in the instance, and call .client() and .resource() on that instead.

Even though this project is up to almost 10k downloads/month, I'm still the only consistent contributor. I certainly don't have all the answers or best ideas, and it's really unfortunate for me to be working on something like this (it's my most visible/popular personal project by orders of magnitude) with no other humans to bounce ideas off of and sanity check me. @aebie and @cstewart87 if you think my suggestion is either crazy, unnecessary or unfortunately complicated, tell me so, and I guess we'll give the dict idea a try and see if anyone has issues with it. I'd really appreciate and be happy for the level of opinion and honesty you'd give someone on your own team.

Thanks!
Jason

@aebie
Copy link

aebie commented Dec 12, 2016

@jantman I think the idea you suggest is a more correct fix for this issue, and you make it both thread safe an allow for long running connections, but it would be a more complex change. Wondering if it would be better to construct the boto3 session in the awslimitchecker as well as the credentials in the same manner. We have someone on this side who can do the change once we nail down the design.

@jantman
Copy link
Owner

jantman commented Dec 12, 2016

Constructing the session and creds in awslimitchecker itself feels a bit strange to me, as they're pretty closely tied to the rest of the logic in Connectable. I'll try and give this some more thought later today, and see if I can work out what design feels best to me.

I've gone through this with everyone who's requested support or opened issues and appears to be using awslimitchecker at scale, but can you confirm that you and/or your team and employer understand the license this code is released under, and the implications that has if you wish to use it as part of a SaaS or hosted solution?

@aebie
Copy link

aebie commented Dec 12, 2016

@jantman sorry, I didn't answer you first question. Curtis and I are using awslimitchecker for internal purposes only. We are not with with CloudCheckr/CloudHealth/etc.

@jantman
Copy link
Owner

jantman commented Dec 13, 2016

Ok, thanks for clarifying. I'd have nothing wrong with commercial services using the code provided they abide by the license... but that doesn't seem terribly likely, and unfortunately there's a lot of confusion and misunderstanding around the AGPL. Sorry for pushing the issue, but this is purely a personal-time project for me...

Ok, so as to the question at hand, I've given this a bit more thought... and I think your last suggestion about moving the creds and boto3 session construction into AwsLimitChecker itself is the right way to go. My thoughts/theory:

Right now, the TrustedAdvisor class and all of the services.* classes (via base._AwsService) utilize the Connectable mixin, and take a whole bunch of keyword arguments related to connections (profile_name, account_id, account_role, region, external_id, mfa_serial_number and mfa_token).

All of those classes are only instantiated in one place (aside from the unit tests), checker.AwsLimitChecker.init().

So, I think this would be my plan:

  1. Get rid of Connectable.credentials alltogether.
  2. Move Connectable._boto3_connection_kwargs and Connectable._get_sts_token to checker.AwsLimitChecker itself, leaving Connectable with only the .connect() and .connect_resource() methods.
  3. Refactor services.base._AwsService and trustedadvisor.TrustedAdvisor constructors to replace all of those connection-related arguments (profile_name, account_id, account_role, region, external_id, mfa_serial_number and mfa_token) with a single boto_connection_kwargs dictionary.
  4. In checker.AwsLimitChecker.__init__(), call the new (moved from Connectable) self._boto3_connection_kwargs() method, and pass the resulting dict into each of the Services constructors and the TrustedAdvisor constructor.

So the signature of base._AwsService would change from:

    def __init__(self, warning_threshold, critical_threshold,
                 profile_name=None, account_id=None, account_role=None,
                 region=None, external_id=None, mfa_serial_number=None,
                 mfa_token=None):

to:

    def __init__(self, warning_threshold, critical_threshold,
                 boto_connection_kwargs):

If you have someone who can try the implementation of this, that's great. If there are some things that I need to fix up, I can do that, but I probably won't be able to get around to it until this weekend. That being said, I can probably cut a new release this weekend.

A few things to keep in mind, though I can do some of these if need be:

  • There may be some references to this in the written documentation that need to be updated (docs/source/*.rst; the stuff here). If not, I'll probably document this.
  • This will affect quite a lot of the tests, both unit and integration. I don't believe the integration tests run for PRs, so I'll probably need to do those.
  • Luckily, I don't believe any of the service classes themselves will need to change.
  • When someone cuts the PR, please add a line to the top of the changelog referencing this issue. Please feel free to give credit in the changelog entry to your company and/or yourself and the other individuals who worked on this. However you want to do that is up to you - real names, GitHub usernames, company name, whatever, but I try to make sure contributors are recognized there.

Thanks so much for all of your help with this!!!! If you or whoever is doing the work have any questions, feel free to ask, and if there's anything that's unclear I'm happy to rework the PR a bit if needed.

@jantman jantman added this to the 0.7.0 milestone Dec 13, 2016
@aebie
Copy link

aebie commented Dec 13, 2016

Thanks @jantman. I think we have a person to do the work on this side. I have a ticket connected to this issue and will pass it to him.

@jantman
Copy link
Owner

jantman commented Dec 14, 2016

Great, thanks so much! Please ask him to sync up with me on Friday about how the work is going, as I'm probably going to be dedicating a bunch of time this weekend to working through a bunch of issues and planning a release.

@aebie
Copy link

aebie commented Dec 15, 2016

@jantman Wanted to let you know I heard from Will today and he is in the middle of a release right now so will not be able to engage until Monday.

@jantman
Copy link
Owner

jantman commented Dec 16, 2016

@aebie Ok, no worries. I'm going to try and spend some time this weekend working on the next release (the 0.7.0 milestone), depending on how busy I am with personal stuff. I doubt I'll get through all that and also tackle this myself, but there's a slim chance.

I'm also going to spend some time debating whether the next release should just be 1.0.0, as it seems like enough people have been running this long enough that it's probably time for that...

@jantman
Copy link
Owner

jantman commented Dec 16, 2016

Also, that you and your colleagues so much for working through this with me. This is my only personal project that gets much use or attention, so it's always wonderful to know that others are making use of it, and willing to contribute back!

@jantman jantman added the ready label Dec 17, 2016
@spulec
Copy link

spulec commented Dec 20, 2016

Agh, this is what I get for trying to solve a bug myself before reading into the open issues for the project...

I hit this the other day as well. I put together a patch to have the Connectable class refresh credentials if they are expired. I agree that the approach you all have been discussing is a better solution though.

Patch below in case anyone wants it in the near-term. Thanks again for your work on this project.

diff --git a/awslimitchecker/connectable.py b/awslimitchecker/connectable.py
index 7c5c300..8d768d0 100644
--- a/awslimitchecker/connectable.py
+++ b/awslimitchecker/connectable.py
@@ -37,6 +37,7 @@ Jason Antman <jason@jasonantman.com> <http://www.jasonantman.com>
 ################################################################################
 """

+import datetime
 import logging
 import boto3

@@ -98,7 +99,16 @@ class Connectable(object):
                 Connectable.credentials = self._get_sts_token()
             else:
                 if self.account_id == Connectable.credentials.account_id:
-                    logger.debug("Reusing previous STS credentials for "
+                    creds_expire = Connectable.credentials.expiration.replace(
+                        tzinfo=None)
+                    if creds_expire < datetime.datetime.utcnow():
+                        logger.debug("Refreshing expired STS credentials. "
+                            "Connecting for account %s role '%s' with STS "
+                            "(region: %s)", self.account_id, self.account_role,
+                            self.region)
+                        Connectable.credentials = self._get_sts_token()
+                    else:
+                        logger.debug("Reusing previous STS credentials for "
                                  "account %s", self.account_id)
                 else:
                     logger.debug("Previous STS credentials are for account %s; "

@jantman
Copy link
Owner

jantman commented Dec 20, 2016

@spulec :( I've done that many times myself. Sorry you were hitting this bug as well... but I think that logic you included in that patch is probably good to include anyway, even with the refactoring we've discussed.

Thanks for sharing it!

@jantman
Copy link
Owner

jantman commented Dec 20, 2016

@spulec PS - I knew your username sounded familiar! Funny it's such a small world, I was just working on #239 last night and adding a bunch of tests that use freezegun. It's an incredibly handy tool, thanks so much for it!

@jantman
Copy link
Owner

jantman commented Dec 20, 2016

@aebie

  1. Has Will been able to start on this? If not, I may be able to in the next few days. I know the holidays are coming up, and I imagine a lot of people are taking time off or working reduced schedules.
  2. If he's going to be working on it, please make sure he's basing off of the develop branch, not master. It has quite a number of changes in it, which are slated for the 0.7.0 release.

Right now, this is the last open issue before 0.7.0. Once this is fixed and complete, I'll be cutting that release.

@aebie
Copy link

aebie commented Dec 21, 2016

@jantman I've contacted Will and he should be engaging on this thread. I work in a remote location so do not know what his vacation plans are, but I will be out after today until the 29th.

@jantman
Copy link
Owner

jantman commented Dec 21, 2016

Ok. Well enjoy your time off!

@jantman
Copy link
Owner

jantman commented Jan 3, 2017

@aebie just wanted to check up on this. Everything else for the 0.7.0 milestone is staged in the develop branch and ready for release, so once this is done the release will go out.

@aebie
Copy link

aebie commented Jan 3, 2017

@jantman Will had a fist pass at the code last Friday. I need to track him down and get him to do a PR here, I will do that today.

@willusher
Copy link

@jantman - I've got a first cut at this, but I'm having difficulty getting the tests to pass. You can take a look here.

How does the 'foo' profile get setup during tests? When I run the tests, it checks my local ~/.aws/config file, and fails because that profile_name doesn't exist.
https://github.com/jantman/awslimitchecker/blob/master/awslimitchecker/tests/test_checker.py#L218

@jantman
Copy link
Owner

jantman commented Jan 5, 2017

@willusher I'm so sorry, the notifications for this only go to my personal email, and I've been so busy at work after being on vacation for a week that I haven't even checked it until now...

The "foo" profile shouldn't get setup; the unit tests should be mocking out every call to boto/boto3.

Take a look at the tests for connectable; essentially what you'll have to do is move the tests for _get_sts_token and _boto3_connection_kwargs from the connectable test class to the checker test class, and then mock out _boto3_connection_kwargs in all of the tests for Checker.__init__.

@aebie
Copy link

aebie commented Jan 6, 2017

@jantman Was talking to Will today and testing his code and functionally it looks good but the unit test changes are pretty extensive. He's churning through them but wanted to let you the status.

@jantman
Copy link
Owner

jantman commented Jan 6, 2017

@aebie ok, thanks so much for the update!

willusher pushed a commit to willusher/awslimitchecker that referenced this issue Jan 11, 2017
Move creds and boto3 session construction into AwsLimitChecker to avoid
session timeouts in long-lived sessions.

Fixes GH issue jantman#220. See thread for detailed discussion of this change.

Thanks to @jantman for all the help on this.

----

This contribution is being made under the same license as the
awslimitchecker project (AGPL, or any subsequent version of that
license as adopted by awslimitchecker), and may perpetually be
included in and distributed with awslimitchecker.
willusher pushed a commit to willusher/awslimitchecker that referenced this issue Jan 11, 2017
Move creds and boto3 session construction into AwsLimitChecker to avoid
session timeouts in long-lived sessions.

Fixes GH issue jantman#220. See thread for detailed discussion of this change.

Thanks to @jantman for all the help on this.

----

This contribution is being made under the same license as the
awslimitchecker project (AGPL, or any subsequent version of that
license as adopted by awslimitchecker), and may perpetually be
included in and distributed with awslimitchecker.
willusher pushed a commit to willusher/awslimitchecker that referenced this issue Jan 12, 2017
Move creds and boto3 session construction into AwsLimitChecker to avoid
session timeouts in long-lived sessions.

Fixes GH issue jantman#220. See thread for detailed discussion of this change.

Thanks to @jantman for all the help on this.

----

This contribution is being made under the same license as the
awslimitchecker project (AGPL, or any subsequent version of that
license as adopted by awslimitchecker), and may perpetually be
included in and distributed with awslimitchecker.
jantman added a commit that referenced this issue Jan 15, 2017
@jantman
Copy link
Owner

jantman commented Jan 15, 2017

I've merged the PR for this (with some minor changes) to develop. That was the last issue for 0.7.0, so I'm going to go through the release checklist now, and the release should be cut this morning. I'll update here when it is, to confirm.

For clarity and the record, I'll reproduce the tail end of the changelog entry here:

I have to extend my deepest gratitude to the folks who identified and fixed this issue, specifically @cstewart87 for the initial bug report and description, @aebie for the tireless and relentlessly thorough investigation and brainstorming and for coordinating work for a fix, and @willusher for the final implementation and dealing (wonderfully) with the dizzying complexity of many of the unit tests (and even matching the existing style).

@spulec A fix for the long-lived credentials is being released in 0.7.0, which should go out today. The gist of it is that connection parameters (and STS credentials) will now be an instance variable of the AwsLimitChecker class itself, so every time you construct a new one, it will get new credentials.

Can you confirm if this fixes the bug for you? The implementation does not include your patch, so there's still no explicit checking of expiration time. If this fix works for you, cool! If not, please open a new issue for that, and in the meantime you can catch ExpiredTokenException and construct a new AwsLimitChecker class.

@jantman
Copy link
Owner

jantman commented Jan 15, 2017

This has been released in 0.7.0 and is live on pypi.

@spulec
Copy link

spulec commented Jan 18, 2017

That works for our current use case. Thanks all!

@cstewart87
Copy link
Author

Awesome, glad this made it out in the latest release! This should help us out a lot. Thanks for your support @jantman!

@aebie
Copy link

aebie commented Jan 19, 2017

And thanks you @jantman for this product.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants