Description
Describe the bug
Hey, we stumbled into a memory leak problem with our AWS lambdas and (after a painful search) tracked it down to awswrangler.athena.read_sql_query
.
How to Reproduce
-
Create an S3 path (
S3_PATH
) to store the following parquet file. -
Save a test parquet file to the created S3 path
import awswrangler
import numpy as np
import pandas as pd
S3_PATH = "s3://....."
df = pd.DataFrame({"measurements": np.random.randn(10000000)})
awswrangler.s3.to_parquet(df, path=f"{S3_PATH}/df.snappy.parquet")
-
Create a glue database
GLUE_DATABASE
and tableGLUE_TABLE
with ameasurements
column of typedouble
referring to theS3_PATH
. -
Create a Python3.8 lambda function with 1024mb memory and attach the
AWSSDKPandas-Python38
layer and insert the following athena query script.
import awswrangler
GLUE_DATABASE = ...
GLUE_TABLE = ...
def lambda_handler(query, context):
df = awswrangler.athena.read_sql_query(
f"select * from {GLUE_TABLE}",
database=GLUE_DATABASE
)
return df.shape
- Test/Execute the lambda function multiple times.
Memory Size: 1024 MB Max Memory Used: 433 MB Init Duration: 3375.03 ms
Memory Size: 1024 MB Max Memory Used: 518 MB
Memory Size: 1024 MB Max Memory Used: 605 MB
Memory Size: 1024 MB Max Memory Used: 701 MB
Memory Size: 1024 MB Max Memory Used: 786 MB
...
Expected behavior
We expected that the memory consumption of the process would not grow.
Your project
No response
Screenshots
No response
OS
AWS Lambda (x86_64 and arm64)
Python version
Tested using 3.8 and 3.9 (custom container)
AWS SDK for pandas version
2.16.1 and 3.0.0b2 tested
Additional context
No response