Skip to content

awswrangler.athena.read_sql_query is leaking memory #1678

Closed
@davnn

Description

@davnn

Describe the bug

Hey, we stumbled into a memory leak problem with our AWS lambdas and (after a painful search) tracked it down to awswrangler.athena.read_sql_query.

How to Reproduce

  1. Create an S3 path (S3_PATH) to store the following parquet file.

  2. Save a test parquet file to the created S3 path

import awswrangler
import numpy as np
import pandas as pd

S3_PATH = "s3://....."

df = pd.DataFrame({"measurements": np.random.randn(10000000)})
awswrangler.s3.to_parquet(df, path=f"{S3_PATH}/df.snappy.parquet")
  1. Create a glue database GLUE_DATABASE and table GLUE_TABLE with a measurements column of type double referring to the S3_PATH.

  2. Create a Python3.8 lambda function with 1024mb memory and attach the AWSSDKPandas-Python38 layer and insert the following athena query script.

import awswrangler

GLUE_DATABASE = ...
GLUE_TABLE = ...

def lambda_handler(query, context):
    df = awswrangler.athena.read_sql_query(
        f"select * from {GLUE_TABLE}",
        database=GLUE_DATABASE
    )
    return df.shape
  1. Test/Execute the lambda function multiple times.

Memory Size: 1024 MB Max Memory Used: 433 MB Init Duration: 3375.03 ms
Memory Size: 1024 MB Max Memory Used: 518 MB
Memory Size: 1024 MB Max Memory Used: 605 MB
Memory Size: 1024 MB Max Memory Used: 701 MB
Memory Size: 1024 MB Max Memory Used: 786 MB
...

Expected behavior

We expected that the memory consumption of the process would not grow.

Your project

No response

Screenshots

No response

OS

AWS Lambda (x86_64 and arm64)

Python version

Tested using 3.8 and 3.9 (custom container)

AWS SDK for pandas version

2.16.1 and 3.0.0b2 tested

Additional context

No response

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions