Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

awswrangler.athena.read_sql_query is leaking memory #1678

Closed
davnn opened this issue Oct 11, 2022 · 13 comments · Fixed by #1688
Closed

awswrangler.athena.read_sql_query is leaking memory #1678

davnn opened this issue Oct 11, 2022 · 13 comments · Fixed by #1688
Assignees
Labels
bug Something isn't working ready to release
Milestone

Comments

@davnn
Copy link

davnn commented Oct 11, 2022

Describe the bug

Hey, we stumbled into a memory leak problem with our AWS lambdas and (after a painful search) tracked it down to awswrangler.athena.read_sql_query.

How to Reproduce

  1. Create an S3 path (S3_PATH) to store the following parquet file.

  2. Save a test parquet file to the created S3 path

import awswrangler
import numpy as np
import pandas as pd

S3_PATH = "s3://....."

df = pd.DataFrame({"measurements": np.random.randn(10000000)})
awswrangler.s3.to_parquet(df, path=f"{S3_PATH}/df.snappy.parquet")
  1. Create a glue database GLUE_DATABASE and table GLUE_TABLE with a measurements column of type double referring to the S3_PATH.

  2. Create a Python3.8 lambda function with 1024mb memory and attach the AWSSDKPandas-Python38 layer and insert the following athena query script.

import awswrangler

GLUE_DATABASE = ...
GLUE_TABLE = ...

def lambda_handler(query, context):
    df = awswrangler.athena.read_sql_query(
        f"select * from {GLUE_TABLE}",
        database=GLUE_DATABASE
    )
    return df.shape
  1. Test/Execute the lambda function multiple times.

Memory Size: 1024 MB Max Memory Used: 433 MB Init Duration: 3375.03 ms
Memory Size: 1024 MB Max Memory Used: 518 MB
Memory Size: 1024 MB Max Memory Used: 605 MB
Memory Size: 1024 MB Max Memory Used: 701 MB
Memory Size: 1024 MB Max Memory Used: 786 MB
...

Expected behavior

We expected that the memory consumption of the process would not grow.

Your project

No response

Screenshots

No response

OS

AWS Lambda (x86_64 and arm64)

Python version

Tested using 3.8 and 3.9 (custom container)

AWS SDK for pandas version

2.16.1 and 3.0.0b2 tested

Additional context

No response

@davnn davnn added the bug Something isn't working label Oct 11, 2022
@a-slice-of-py
Copy link
Contributor

a-slice-of-py commented Oct 12, 2022

Same here with a very similar setting.

I also tested some possible solutions acting directly into the lambda_handler, just before its end:

  • del df (in my case, no need to return it)
  • manual (and probably redundant) gc.collect()
  • explicit creation of a boto3_session = boto3.Session(...), feed it to awswrangler methods and then del boto3_session after usage (just in case I hit something similar to what reported here Excessive memory usage on multithreading boto/boto3#1670 - where very similar symptoms have been found in a multithreading setup)

but none of the above actually worked.

Also switching to AWS Fargate after a code refactoring (sequential lambda executions -> iterations within a for loop) didn't solve the issue, suggesting it is service-independent (just takes more iterations to exhaust memory thanks to Fargate better specs).

However, I found interesting what highlighted in this AWS Compute Blog:

When you use third-party libraries across multiple invocations in the same execution environment, be sure to check their documentation for usage in a serverless compute environment. Some database connection and logging libraries may save intermediate invocation results and other data. This causes the memory usage of these libraries to grow with subsequent warm invocations. In cases where memory grows rapidly, you may find the Lambda function runs out of memory, even if your custom code is disposing of variables correctly.

Thanks @davnn for the initiative!

@AlexMetsai
Copy link

Not sure if this will help, but what is your pandas version, is it 1.6? I am having a similar memory leakage issue with awswrangler.athena.read_sql_query and downgrading to version 1.5.0 seems to tackle it for now. It could be that with version 1.6.0 some bug got introduced, either from pandas or awswrangler's side.

@a-slice-of-py
Copy link
Contributor

what is your pandas version, is it 1.6?

I experienced the above behaviour with awswrangler-2.17.0 and pandas-1.5.0. Moreover, I cannot see pandas-1.6.0 as released yet, assuming the usage of pip as package manager.

@AlexMetsai
Copy link

AlexMetsai commented Oct 14, 2022

Oops, my mistake, the issue is with pandas-1.5.0 and I specifically downgraded to pandas-1.4.4.

@a-slice-of-py
Copy link
Contributor

@AlexMetsai I tried the downgrade to pandas-1.4.4 as suggested and I can confirm that the memory leaks seem to be gone (my current setup involves AWS Fargate, but as stated above I think the same holds for AWS Lambda - maybe @davnn can confirm): thanks for the hint!

Do you have any insights about the (maybe responsible) changes involved in pandas-1.5.0? I mean, did you go straight with the downgrade or were you following any clues?

@AlexMetsai
Copy link

@a-slice-of-py You 're welcome. :)

Unfortunately I don't have any insights, we just fell back to a previous configuration that didn't showcase the issue (which had the previous pandas version). I would like to take a closer look into it, but so far I was busy with deployments (especially trying to allocate a no-larger-than-really-needed container), so I was mostly hoping that somebody from pandas or aws will fix this. 😆

@davnn
Copy link
Author

davnn commented Oct 14, 2022

Can confirm that a downgrade to pandas-1.4.4 solves the memory leak on AWS Lambda.

@malachi-constant
Copy link
Contributor

malachi-constant commented Oct 14, 2022

Confirmed this is an issue with pandas 1.5.0 as a dependency.

Replicated on Lambda
- Memory: 1024mb
- Python Runtime: 3.9 
- SDK for pandas Version: 2.17.0


1: Duration: 14720.04 ms	Billed Duration: 14721 ms	Memory Size: 1024 MB	Max Memory Used: 436 MB	
2: Duration: 14497.47 ms	Billed Duration: 14498 ms	Memory Size: 1024 MB	Max Memory Used: 521 MB
3: Duration: 14331.70 ms	Billed Duration: 14332 ms	Memory Size: 1024 MB	Max Memory Used: 615 MB
4: Duration: 14681.11 ms	Billed Duration: 14682 ms	Memory Size: 1024 MB	Max Memory Used: 697 MB
5: Duration: 14014.48 ms	Billed Duration: 14015 ms	Memory Size: 1024 MB	Max Memory Used: 775 MB

I'll go ahead and update our deps to pandas<= 1.4.4 for now as we monitor the issue from our side.

@aws/aws-sdk-pandas We may want to consider a micro release after this updated to main:latest

@malachi-constant malachi-constant linked a pull request Oct 14, 2022 that will close this issue
@phofl
Copy link

phofl commented Oct 14, 2022

We will need more info to fix this on the pandas side. The issue you linked is a false positive. It does not describe a memory leak

@malachi-constant
Copy link
Contributor

Thanks @phofl I'll scratch that from the above comment. I'll see if I can dig into where exactly in the method we are encountering the memory issues.

@malachi-constant
Copy link
Contributor

Working to reproduce error using build from latest pandas. Possibly related: Issue

@phofl
Copy link

phofl commented Oct 18, 2022

Does your code show future warnings? If yes, you could wait for 1.5.1, should come out in the next few days

@malachi-constant
Copy link
Contributor

I have confirmed this is no longer an issue with pandas==1.5.1 and awswrangler==2.16.1

Lambda Invocations:
Duration: 2876.48 ms	Billed Duration: 2877 ms	Memory Size: 1048 MB	Max Memory Used: 252 MB
Duration: 3674.37 ms	Billed Duration: 3675 ms	Memory Size: 1048 MB	Max Memory Used: 258 MB
Duration: 2739.12 ms	Billed Duration: 2740 ms	Memory Size: 1048 MB	Max Memory Used: 258 MB
Duration: 3224.09 ms	Billed Duration: 3225 ms	Memory Size: 1048 MB	Max Memory Used: 258 MB
Duration: 3002.13 ms	Billed Duration: 3003 ms	Memory Size: 1048 MB	Max Memory Used: 258 MB

Keep in mind python version will need to meet python = ">=3.8.1, <3.11" with this version of pandas.

@kukushking kukushking added this to the 2.18.0 milestone Dec 2, 2022
pierresouchay pushed a commit to pierresouchay/aws-sdk-pandas that referenced this issue Mar 21, 2023
Pandas 1.5.0 had a memory leak with awswrangler,
but according to aws#1678
it is no longer the case.

The current dependency expressed in pyproject.toml locks
dependents to use only 1.5.1. Since the bug has been fixed in pandas
we can now remove the lock.
pierresouchay pushed a commit to pierresouchay/aws-sdk-pandas that referenced this issue Mar 21, 2023
Pandas 1.5.0 had a memory leak with awswrangler,
but according to aws#1678
it is no longer the case.

The current dependency expressed in pyproject.toml locks
dependents to use only 1.5.1. Since the bug has been fixed in pandas
we can now remove the lock.
pierresouchay pushed a commit to pierresouchay/aws-sdk-pandas that referenced this issue Mar 21, 2023
Pandas 1.5.0 had a memory leak with awswrangler,
but according to aws#1678
it is no longer the case.

The current dependency expressed in pyproject.toml locks
dependents to use only 1.5.1. Since the bug has been fixed in pandas
we can now remove the lock.
pierresouchay pushed a commit to pierresouchay/aws-sdk-pandas that referenced this issue Mar 21, 2023
Pandas 1.5.0 had a memory leak with awswrangler,
but according to aws#1678
it is no longer the case.

The current dependency expressed in pyproject.toml locks
dependents to use only 1.5.1. Since the bug has been fixed in pandas
we can now remove the lock.
kukushking pushed a commit that referenced this issue Mar 22, 2023
Pandas 1.5.0 had a memory leak with awswrangler,
but according to #1678
it is no longer the case.

The current dependency expressed in pyproject.toml locks
dependents to use only 1.5.1. Since the bug has been fixed in pandas
we can now remove the lock.

Co-authored-by: Pierre Souchay <pierre.souchay@axaclimate.com>
Co-authored-by: jaidisido <jaidisido@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready to release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants