Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache describe_regions using lru_cache from stdlib #803

Merged
merged 8 commits into from
Sep 23, 2024

Conversation

zmoog
Copy link
Contributor

@zmoog zmoog commented Sep 21, 2024

What does this PR do?

Caches EC2:DescribeRegion API calls response.

Why is it important?

On high-volume deployments, ESF can hit the EC2:DescribeRegions API requests limit, causing throttling errors like the following:

An error occurred (RequestLimitExceeded) when calling the DescribeRegions operation (reached max retries: 4): Request limit exceeded.

ESF needs the list of existing regions to parse incoming events from the cloudwatch-logs input. Since new AWS region additions do not happen frequently, picking up and caching the list of existing regions at function startup seems adequate.

The list of existing AWS regions is available at https://aws.amazon.com/about-aws/global-infrastructure/regions_az/

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.md

@zmoog zmoog self-assigned this Sep 21, 2024
@zmoog
Copy link
Contributor Author

zmoog commented Sep 21, 2024

I can probably add cache expiration to avoid a stale region list:

import threading
import time
from functools import wraps
from typing import Any, Callable


def cache_for(seconds: int) -> Callable:
    """
    Caches the result of a function for a specified number of seconds."""
    def decorator(func: Callable) -> Callable:
        lock = threading.Lock()
        cache = {}
        hits = misses = 0

        @wraps(func)
        def wrapper(*args: Any, **kwargs: Any) -> Any:
            nonlocal hits, misses
            with lock:
                key = str(args) + str(kwargs)
                current_time = time.time()
                
                if key in cache:
                    result, timestamp = cache[key]
                    if current_time - timestamp < seconds:
                        hits += 1
                        return result
                
                misses += 1

                result = func(*args, **kwargs)
                cache[key] = (result, current_time)

                return result

        def cache_stats() -> dict:
            """
            Returns the cache statistics.

            :return: A dictionary containing the cache statistics"""
            with lock:
                return {'hits': hits, 'misses': misses}

        wrapper.cache_stats = cache_stats

        return wrapper

    return decorator    

@cache_for(seconds=60)
def describe_regions(all_regions: bool = True) -> Any:
    """
    Fetches all regions from AWS and returns the response.

    :return: The response from the describe_regions method
    """
    return get_ec2_client().describe_regions(AllRegions=all_regions)

# Example usage with AWS regions
import boto3

def get_ec2_client():
    return boto3.client('ec2')

# Example usage to access cache statistics
if __name__ == "__main__":
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 0, 'misses': 1}

    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 1, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 2, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 3, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 4, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 5, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 6, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 7, 'misses': 1}

I get the following output:

{'hits': 0, 'misses': 1}
{'hits': 1, 'misses': 1}
{'hits': 2, 'misses': 1}
{'hits': 3, 'misses': 1}
{'hits': 4, 'misses': 1}
{'hits': 5, 'misses': 1}
{'hits': 6, 'misses': 1}
{'hits': 7, 'misses': 1}

@zmoog
Copy link
Contributor Author

zmoog commented Sep 21, 2024

Or, we can use a 3rd party library like https://cachetools.readthedocs.io/en/latest/

from typing import Any

from cachetools.func import ttl_cache


@ttl_cache(ttl=1800) # 30 minutes
def describe_regions(all_regions: bool = True) -> Any:
    """
    Fetches all regions from AWS and returns the response.

    :return: The response from the describe_regions method
    """
    print("Fetching regions from AWS...")
    return get_ec2_client().describe_regions(AllRegions=all_regions)

# Example usage with AWS regions
import boto3

def get_ec2_client():
    return boto3.client('ec2')

# Example usage to access cache statistics
if __name__ == "__main__":
    
    # print(dir(describe_regions))

    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 0, 'misses': 1}

    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 1, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 2, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 3, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 4, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 5, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 6, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 7, 'misses': 1}

Output:

Fetching regions from AWS...
CacheInfo(hits=0, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=1, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=2, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=3, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=4, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=5, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=6, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=7, misses=1, maxsize=128, currsize=1)

@constanca-m
Copy link
Contributor

constanca-m commented Sep 23, 2024

I wanted to see the details of what is happening exactly in our code and why this is an issue. I wrote this while I was testing it:

The problem comes from this line:

input_id, event_input = get_input_from_log_group_subscription_data(
config,
cloudwatch_logs_event["owner"],
cloudwatch_logs_event["logGroup"],
cloudwatch_logs_event["logStream"],
)

We need input_id so we can know where to send the data (this is inside event_input). This information was first available in the configuration provided by the user, stored in the ESF bucket, and was stored in ESF at the start upon parsing config.yaml. This input_id is unique and each input_id should map to an Input that should be specified in the configuration file.

How do we obtain the input_id for each specific trigger?

We have 4 possible triggers:

  1. cloudwatch-logs -> not available in the event that triggers ESF
  2. kinesis-data-stream -> input_id inside lambda_event (this is the event that triggers ESF)
  3. s3-sqs - available in lambda_event
  4. sqs - available in lambda_event

I wanted to know what is inside in lambda_event if it comes from a cloudwatch logs. I sent a message in a log stream to trigger it. This is the lambda_event that my ESF got.

{
   "awslogs":{
      "data":"H4sIAAAAAAAA/42QPWvDMBRF/0p4swX6lqzNUDdTJ2croTjJqyuwJaOntJSQ/17c0L3LHS6cc+HeYEGiccLD94oQ4Kk7dG8v/TB0+x4ayF8JCwSw0klvleFCaWhgztO+5OsKAc45UR3TeWQXpHOJJ2QFp5gTMaR3VpHqAxhqwXGBAJQXZHOeGD2aBuh62tC1xpye41yxEITXf6mPv+7+E1PdmBvECwRQ3gijlNbKGa9VK6y2rd3Sc6tla6zx3nLluOReWNsaL7n0HhqocUGq47JCEE467p2SngvR/B0FAcaU6weW3ZynHW7TcD/efwDlkGd8SgEAAA=="
   }
}

We decode the data field, which looks like this:

{
   "messageType":"DATA_MESSAGE",
   "owner":"627286350134",
   "logGroup":"constanca-describe-regions-esf-test",
   "logStream":"some-log-stream",
   "subscriptionFilters":[
      "constanca-describe-regions-esf-test"
   ],
   "logEvents":[
      {
         "id":"38515334437584391646961646806429565886037020816695820288",
         "timestamp":1727087328011,
         "message":"another log event"
      }
   ]
}

In our config.yaml file we need to provide the input_id for ESF as the cloudwatch ARN, in this format: arn:aws:logs:{region}:{account-id}:log-group:{log_group_name}:* or as arn:aws:logs:{region}:{account-id}:log-group:{log_group_name}:log-stream:{log-stream-name} (see official documentation). So what do we have in this data field that we can use?

  • region - No
  • account-id - Yes, field owner
  • log_group_name - Yes, field logGroup
  • log-stream-name - Yes, field logStream

So we are only missing region to obtain the input_id so we can then get the output to send the data to.

How do we obtain the input_id for cloudwatch trigger then?

Currently, we make the call EC2:DescribeRegion API every time an event from a cloudwatch logs group triggers ESF.

Here is a sample of the result of this call in my test.
{
   "Regions":[
      {
         "Endpoint":"ec2.ap-south-2.amazonaws.com",
         "RegionName":"ap-south-2",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.ap-south-1.amazonaws.com",
         "RegionName":"ap-south-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.eu-south-1.amazonaws.com",
         "RegionName":"eu-south-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.eu-south-2.amazonaws.com",
         "RegionName":"eu-south-2",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.me-central-1.amazonaws.com",
         "RegionName":"me-central-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.il-central-1.amazonaws.com",
         "RegionName":"il-central-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.ca-central-1.amazonaws.com",
         "RegionName":"ca-central-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.eu-central-1.amazonaws.com",
         "RegionName":"eu-central-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.eu-central-2.amazonaws.com",
         "RegionName":"eu-central-2",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.us-west-1.amazonaws.com",
         "RegionName":"us-west-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.us-west-2.amazonaws.com",
         "RegionName":"us-west-2",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.af-south-1.amazonaws.com",
         "RegionName":"af-south-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.eu-north-1.amazonaws.com",
         "RegionName":"eu-north-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.eu-west-3.amazonaws.com",
         "RegionName":"eu-west-3",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.eu-west-2.amazonaws.com",
         "RegionName":"eu-west-2",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.eu-west-1.amazonaws.com",
         "RegionName":"eu-west-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.ap-northeast-3.amazonaws.com",
         "RegionName":"ap-northeast-3",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.ap-northeast-2.amazonaws.com",
         "RegionName":"ap-northeast-2",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.me-south-1.amazonaws.com",
         "RegionName":"me-south-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.ap-northeast-1.amazonaws.com",
         "RegionName":"ap-northeast-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.sa-east-1.amazonaws.com",
         "RegionName":"sa-east-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.ap-east-1.amazonaws.com",
         "RegionName":"ap-east-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.ca-west-1.amazonaws.com",
         "RegionName":"ca-west-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.ap-southeast-1.amazonaws.com",
         "RegionName":"ap-southeast-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.ap-southeast-2.amazonaws.com",
         "RegionName":"ap-southeast-2",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.ap-southeast-3.amazonaws.com",
         "RegionName":"ap-southeast-3",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.ap-southeast-4.amazonaws.com",
         "RegionName":"ap-southeast-4",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.us-east-1.amazonaws.com",
         "RegionName":"us-east-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.ap-southeast-5.amazonaws.com",
         "RegionName":"ap-southeast-5",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.us-east-2.amazonaws.com",
         "RegionName":"us-east-2",
         "OptInStatus":"opt-in-not-required"
      }
   ],
   "ResponseMetadata":{
      "RequestId":"b726bdc7-7f34-4884-a1e7-1abf061593f8",
      "HTTPStatusCode":200,
      "HTTPHeaders":{
         "x-amzn-requestid":"b726bdc7-7f34-4884-a1e7-1abf061593f8",
         "cache-control":"no-cache, no-store",
         "strict-transport-security":"max-age=31536000; includeSubDomains",
         "vary":"accept-encoding",
         "content-type":"text/xml;charset=UTF-8",
         "content-length":"4846",
         "date":"Mon, 23 Sep 2024 10:45:35 GMT",
         "server":"AmazonEC2"
      },
      "RetryAttempts":0
   }
}

From this result, and for every RegionName we do:

  • Create the ARN with log stream specified
    • Look for this ARN in the configuration. Is it there? If yes, return the output we want to send the data to. If not:
    • Create the ARN with log stream specified.
      • Look for it in the configuration. Is it there? If yes, return the output we want to send the data to. If not, continue the cycle or error.

How to stop all these API calls?

  • Understand if a cloudwatch logs event can trigger ESF from a different region.
    • If it can:
      • Do the regions change? Then periodically make this API call to update the regions. This is what this PR does.
      • The regions do not change. Then maybe we can just hardcode it, right @zmoog? I do not see advantages in calling the API. Or we could just call once at the start and store the result.
  • If it can not: then obtain the region of the lambda, which will be the same as the cloudwatch logs group.

From my understanding, the region needs to be the same. So is there any reason we we would want to keep this API call @zmoog?

@zmoog
Copy link
Contributor Author

zmoog commented Sep 23, 2024

Thanks for the in-depth analysis.

I tested the cloudwatch lambda trigger on the AWS console and ESF. As of today, it seems cloudwatch lambda triggers can only work with log groups in the same region from as the lambda functions. For example, if I deploy ESF on eu-west-1, I can only process log events from log groups on eu-west-1.

Given this limit, there is no reason to keep calling the EC2:DescribeRegion API on every event.

I plan to remove this API call from ESF.

Here's my two-steps plan:

  1. Add the @lru_cache decorator from the standard library to reduce the number of API calls from one every event to just one on start. This small risk change would allow us to ship a patch release today.
  2. Go through the process of removing the API call (change the code to use the region from the function, remove the required permissions from the infrastructure, and test the whole package).

WDYT?

@zmoog zmoog marked this pull request as ready for review September 23, 2024 11:37
@constanca-m
Copy link
Contributor

I am fine with approving the PR as it is.

You need to change the version of ESF currently (I believe you need to update the changelog and version.py. After that the release workflow will be triggered, but if we push this change just like this, then nothing will happen.

@zmoog
Copy link
Contributor Author

zmoog commented Sep 23, 2024

You need to change the version of ESF currently (I believe you need to update the changelog and version.py. After that the release workflow will be triggered, but if we push this change just like this, then nothing will happen.

Thanks! On it.

Copy link
Contributor

@constanca-m constanca-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zmoog ! I would say we should keep #723 open. I have copied and pasted my comment above so we know where we stand now.

I doubt we will have issues like this again with the cache, but let's see!

@zmoog zmoog added the enhancement New feature or request label Sep 23, 2024
@zmoog
Copy link
Contributor Author

zmoog commented Sep 23, 2024

I would say we should keep #723 open.

I agree!

I doubt we will have issues like this again with the cache, but let's see!

I'll work on removing the EC2:DescribeRegions API call later this week.

@zmoog zmoog merged commit 29c08f4 into main Sep 23, 2024
5 checks passed
@zmoog zmoog deleted the zmoog/cache-describe-regions branch September 23, 2024 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants