Cache describe_regions using lru_cache from stdlib #803

zmoog · 2024-09-21T03:47:27Z

What does this PR do?

Caches EC2:DescribeRegion API calls response.

Why is it important?

On high-volume deployments, ESF can hit the EC2:DescribeRegions API requests limit, causing throttling errors like the following:

An error occurred (RequestLimitExceeded) when calling the DescribeRegions operation (reached max retries: 4): Request limit exceeded.

ESF needs the list of existing regions to parse incoming events from the cloudwatch-logs input. Since new AWS region additions do not happen frequently, picking up and caching the list of existing regions at function startup seems adequate.

The list of existing AWS regions is available at https://aws.amazon.com/about-aws/global-infrastructure/regions_az/

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have made corresponding change to the default configuration files
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.md

zmoog · 2024-09-21T04:47:25Z

I can probably add cache expiration to avoid a stale region list:

import threading
import time
from functools import wraps
from typing import Any, Callable


def cache_for(seconds: int) -> Callable:
    """
    Caches the result of a function for a specified number of seconds."""
    def decorator(func: Callable) -> Callable:
        lock = threading.Lock()
        cache = {}
        hits = misses = 0

        @wraps(func)
        def wrapper(*args: Any, **kwargs: Any) -> Any:
            nonlocal hits, misses
            with lock:
                key = str(args) + str(kwargs)
                current_time = time.time()
                
                if key in cache:
                    result, timestamp = cache[key]
                    if current_time - timestamp < seconds:
                        hits += 1
                        return result
                
                misses += 1

                result = func(*args, **kwargs)
                cache[key] = (result, current_time)

                return result

        def cache_stats() -> dict:
            """
            Returns the cache statistics.

            :return: A dictionary containing the cache statistics"""
            with lock:
                return {'hits': hits, 'misses': misses}

        wrapper.cache_stats = cache_stats

        return wrapper

    return decorator    

@cache_for(seconds=60)
def describe_regions(all_regions: bool = True) -> Any:
    """
    Fetches all regions from AWS and returns the response.

    :return: The response from the describe_regions method
    """
    return get_ec2_client().describe_regions(AllRegions=all_regions)

# Example usage with AWS regions
import boto3

def get_ec2_client():
    return boto3.client('ec2')

# Example usage to access cache statistics
if __name__ == "__main__":
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 0, 'misses': 1}

    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 1, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 2, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 3, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 4, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 5, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 6, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_stats())  # {'hits': 7, 'misses': 1}

I get the following output:

{'hits': 0, 'misses': 1}
{'hits': 1, 'misses': 1}
{'hits': 2, 'misses': 1}
{'hits': 3, 'misses': 1}
{'hits': 4, 'misses': 1}
{'hits': 5, 'misses': 1}
{'hits': 6, 'misses': 1}
{'hits': 7, 'misses': 1}

zmoog · 2024-09-21T05:41:24Z

Or, we can use a 3rd party library like https://cachetools.readthedocs.io/en/latest/

from typing import Any

from cachetools.func import ttl_cache


@ttl_cache(ttl=1800) # 30 minutes
def describe_regions(all_regions: bool = True) -> Any:
    """
    Fetches all regions from AWS and returns the response.

    :return: The response from the describe_regions method
    """
    print("Fetching regions from AWS...")
    return get_ec2_client().describe_regions(AllRegions=all_regions)

# Example usage with AWS regions
import boto3

def get_ec2_client():
    return boto3.client('ec2')

# Example usage to access cache statistics
if __name__ == "__main__":
    
    # print(dir(describe_regions))

    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 0, 'misses': 1}

    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 1, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 2, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 3, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 4, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 5, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 6, 'misses': 1}
    describe_regions(all_regions=False)
    print(describe_regions.cache_info())  # {'hits': 7, 'misses': 1}

Output:

Fetching regions from AWS...
CacheInfo(hits=0, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=1, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=2, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=3, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=4, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=5, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=6, misses=1, maxsize=128, currsize=1)
CacheInfo(hits=7, misses=1, maxsize=128, currsize=1)

constanca-m · 2024-09-23T11:04:12Z

I wanted to see the details of what is happening exactly in our code and why this is an issue. I wrote this while I was testing it:

The problem comes from this line:

elastic-serverless-forwarder/handlers/aws/handler.py

Lines 147 to 152 in 8be4fc4

    
           input_id, event_input = get_input_from_log_group_subscription_data( 
        
               config, 
        
               cloudwatch_logs_event["owner"], 
        
               cloudwatch_logs_event["logGroup"], 
        
               cloudwatch_logs_event["logStream"], 
        
           )

We need input_id so we can know where to send the data (this is inside event_input). This information was first available in the configuration provided by the user, stored in the ESF bucket, and was stored in ESF at the start upon parsing config.yaml. This input_id is unique and each input_id should map to an Input that should be specified in the configuration file.

How do we obtain the `input_id` for each specific trigger?

We have 4 possible triggers:

cloudwatch-logs -> not available in the event that triggers ESF
kinesis-data-stream -> input_id inside lambda_event (this is the event that triggers ESF)
s3-sqs - available in lambda_event
sqs - available in lambda_event

I wanted to know what is inside in lambda_event if it comes from a cloudwatch logs. I sent a message in a log stream to trigger it. This is the lambda_event that my ESF got.

{
   "awslogs":{
      "data":"H4sIAAAAAAAA/42QPWvDMBRF/0p4swX6lqzNUDdTJ2croTjJqyuwJaOntJSQ/17c0L3LHS6cc+HeYEGiccLD94oQ4Kk7dG8v/TB0+x4ayF8JCwSw0klvleFCaWhgztO+5OsKAc45UR3TeWQXpHOJJ2QFp5gTMaR3VpHqAxhqwXGBAJQXZHOeGD2aBuh62tC1xpye41yxEITXf6mPv+7+E1PdmBvECwRQ3gijlNbKGa9VK6y2rd3Sc6tla6zx3nLluOReWNsaL7n0HhqocUGq47JCEE467p2SngvR/B0FAcaU6weW3ZynHW7TcD/efwDlkGd8SgEAAA=="
   }
}

We decode the data field, which looks like this:

{
   "messageType":"DATA_MESSAGE",
   "owner":"627286350134",
   "logGroup":"constanca-describe-regions-esf-test",
   "logStream":"some-log-stream",
   "subscriptionFilters":[
      "constanca-describe-regions-esf-test"
   ],
   "logEvents":[
      {
         "id":"38515334437584391646961646806429565886037020816695820288",
         "timestamp":1727087328011,
         "message":"another log event"
      }
   ]
}

In our config.yaml file we need to provide the input_id for ESF as the cloudwatch ARN, in this format: arn:aws:logs:{region}:{account-id}:log-group:{log_group_name}:* or as arn:aws:logs:{region}:{account-id}:log-group:{log_group_name}:log-stream:{log-stream-name} (see official documentation). So what do we have in this data field that we can use?

region - No
account-id - Yes, field owner
log_group_name - Yes, field logGroup
log-stream-name - Yes, field logStream

So we are only missing region to obtain the input_id so we can then get the output to send the data to.

How do we obtain the `input_id` for cloudwatch trigger then?

Currently, we make the call EC2:DescribeRegion API every time an event from a cloudwatch logs group triggers ESF.

Here is a sample of the result of this call in my test.

{
   "Regions":[
      {
         "Endpoint":"ec2.ap-south-2.amazonaws.com",
         "RegionName":"ap-south-2",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.ap-south-1.amazonaws.com",
         "RegionName":"ap-south-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.eu-south-1.amazonaws.com",
         "RegionName":"eu-south-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.eu-south-2.amazonaws.com",
         "RegionName":"eu-south-2",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.me-central-1.amazonaws.com",
         "RegionName":"me-central-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.il-central-1.amazonaws.com",
         "RegionName":"il-central-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.ca-central-1.amazonaws.com",
         "RegionName":"ca-central-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.eu-central-1.amazonaws.com",
         "RegionName":"eu-central-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.eu-central-2.amazonaws.com",
         "RegionName":"eu-central-2",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.us-west-1.amazonaws.com",
         "RegionName":"us-west-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.us-west-2.amazonaws.com",
         "RegionName":"us-west-2",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.af-south-1.amazonaws.com",
         "RegionName":"af-south-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.eu-north-1.amazonaws.com",
         "RegionName":"eu-north-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.eu-west-3.amazonaws.com",
         "RegionName":"eu-west-3",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.eu-west-2.amazonaws.com",
         "RegionName":"eu-west-2",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.eu-west-1.amazonaws.com",
         "RegionName":"eu-west-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.ap-northeast-3.amazonaws.com",
         "RegionName":"ap-northeast-3",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.ap-northeast-2.amazonaws.com",
         "RegionName":"ap-northeast-2",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.me-south-1.amazonaws.com",
         "RegionName":"me-south-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.ap-northeast-1.amazonaws.com",
         "RegionName":"ap-northeast-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.sa-east-1.amazonaws.com",
         "RegionName":"sa-east-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.ap-east-1.amazonaws.com",
         "RegionName":"ap-east-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.ca-west-1.amazonaws.com",
         "RegionName":"ca-west-1",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.ap-southeast-1.amazonaws.com",
         "RegionName":"ap-southeast-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.ap-southeast-2.amazonaws.com",
         "RegionName":"ap-southeast-2",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.ap-southeast-3.amazonaws.com",
         "RegionName":"ap-southeast-3",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.ap-southeast-4.amazonaws.com",
         "RegionName":"ap-southeast-4",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.us-east-1.amazonaws.com",
         "RegionName":"us-east-1",
         "OptInStatus":"opt-in-not-required"
      },
      {
         "Endpoint":"ec2.ap-southeast-5.amazonaws.com",
         "RegionName":"ap-southeast-5",
         "OptInStatus":"not-opted-in"
      },
      {
         "Endpoint":"ec2.us-east-2.amazonaws.com",
         "RegionName":"us-east-2",
         "OptInStatus":"opt-in-not-required"
      }
   ],
   "ResponseMetadata":{
      "RequestId":"b726bdc7-7f34-4884-a1e7-1abf061593f8",
      "HTTPStatusCode":200,
      "HTTPHeaders":{
         "x-amzn-requestid":"b726bdc7-7f34-4884-a1e7-1abf061593f8",
         "cache-control":"no-cache, no-store",
         "strict-transport-security":"max-age=31536000; includeSubDomains",
         "vary":"accept-encoding",
         "content-type":"text/xml;charset=UTF-8",
         "content-length":"4846",
         "date":"Mon, 23 Sep 2024 10:45:35 GMT",
         "server":"AmazonEC2"
      },
      "RetryAttempts":0
   }
}

From this result, and for every RegionName we do:

Create the ARN with log stream specified
- Look for this ARN in the configuration. Is it there? If yes, return the output we want to send the data to. If not:
- Create the ARN with log stream specified.
  - Look for it in the configuration. Is it there? If yes, return the output we want to send the data to. If not, continue the cycle or error.

How to stop all these API calls?

Understand if a cloudwatch logs event can trigger ESF from a different region.
- If it can:
  - Do the regions change? Then periodically make this API call to update the regions. This is what this PR does.
  - The regions do not change. Then maybe we can just hardcode it, right @zmoog? I do not see advantages in calling the API. Or we could just call once at the start and store the result.
If it can not: then obtain the region of the lambda, which will be the same as the cloudwatch logs group.

From my understanding, the region needs to be the same. So is there any reason we we would want to keep this API call @zmoog?

zmoog · 2024-09-23T11:23:46Z

Thanks for the in-depth analysis.

I tested the cloudwatch lambda trigger on the AWS console and ESF. As of today, it seems cloudwatch lambda triggers can only work with log groups in the same region from as the lambda functions. For example, if I deploy ESF on eu-west-1, I can only process log events from log groups on eu-west-1.

Given this limit, there is no reason to keep calling the EC2:DescribeRegion API on every event.

I plan to remove this API call from ESF.

Here's my two-steps plan:

Add the @lru_cache decorator from the standard library to reduce the number of API calls from one every event to just one on start. This small risk change would allow us to ship a patch release today.
Go through the process of removing the API call (change the code to use the region from the function, remove the required permissions from the infrastructure, and test the whole package).

WDYT?

constanca-m · 2024-09-23T12:09:16Z

I am fine with approving the PR as it is.

You need to change the version of ESF currently (I believe you need to update the changelog and version.py. After that the release workflow will be triggered, but if we push this change just like this, then nothing will happen.

zmoog · 2024-09-23T12:53:53Z

You need to change the version of ESF currently (I believe you need to update the changelog and version.py. After that the release workflow will be triggered, but if we push this change just like this, then nothing will happen.

Thanks! On it.

constanca-m

Thanks @zmoog ! I would say we should keep #723 open. I have copied and pasted my comment above so we know where we stand now.

I doubt we will have issues like this again with the cache, but let's see!

zmoog · 2024-09-23T15:41:41Z

I would say we should keep #723 open.

I agree!

I doubt we will have issues like this again with the cache, but let's see!

I'll work on removing the EC2:DescribeRegions API call later this week.

Cache describe_regions using lru_cache from stdlib

361abbc

zmoog self-assigned this Sep 21, 2024

zmoog added 2 commits September 21, 2024 05:54

Fix linters complaints

77bdcde

Fix linters objections

f8e137d

zmoog added 3 commits September 23, 2024 12:09

Minor cleanup

451c8a7

Add logs for testing

443f816

Testing

87628e3

Add tests for describe_regions

e15e5bc

zmoog marked this pull request as ready for review September 23, 2024 11:37

Bump version and add change log entry

53aa50b

constanca-m mentioned this pull request Sep 23, 2024

Investigate number of API calls of ESF due to ec2:describe_regions usage #723

Closed

constanca-m approved these changes Sep 23, 2024

View reviewed changes

zmoog added the enhancement New feature or request label Sep 23, 2024

zmoog merged commit 29c08f4 into main Sep 23, 2024
5 checks passed

zmoog deleted the zmoog/cache-describe-regions branch September 23, 2024 15:43

kaiyan-sheng mentioned this pull request Sep 23, 2024

Removed '$' sign from region check in publish_lambda.sh #727

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache describe_regions using lru_cache from stdlib #803

Cache describe_regions using lru_cache from stdlib #803

zmoog commented Sep 21, 2024 •

edited

Loading

zmoog commented Sep 21, 2024 •

edited

Loading

zmoog commented Sep 21, 2024 •

edited

Loading

constanca-m commented Sep 23, 2024 •

edited

Loading

zmoog commented Sep 23, 2024 •

edited

Loading

constanca-m commented Sep 23, 2024

zmoog commented Sep 23, 2024

constanca-m left a comment

zmoog commented Sep 23, 2024

Cache describe_regions using lru_cache from stdlib #803

Cache describe_regions using lru_cache from stdlib #803

Conversation

zmoog commented Sep 21, 2024 • edited Loading

What does this PR do?

Why is it important?

Checklist

zmoog commented Sep 21, 2024 • edited Loading

zmoog commented Sep 21, 2024 • edited Loading

constanca-m commented Sep 23, 2024 • edited Loading

How do we obtain the input_id for each specific trigger?

How do we obtain the input_id for cloudwatch trigger then?

How to stop all these API calls?

zmoog commented Sep 23, 2024 • edited Loading

constanca-m commented Sep 23, 2024

zmoog commented Sep 23, 2024

constanca-m left a comment

Choose a reason for hiding this comment

zmoog commented Sep 23, 2024

zmoog commented Sep 21, 2024 •

edited

Loading

zmoog commented Sep 21, 2024 •

edited

Loading

zmoog commented Sep 21, 2024 •

edited

Loading

constanca-m commented Sep 23, 2024 •

edited

Loading

How do we obtain the `input_id` for each specific trigger?

How do we obtain the `input_id` for cloudwatch trigger then?

zmoog commented Sep 23, 2024 •

edited

Loading