Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak using python3.8 and boto3 client #3088

Closed
saturnt opened this issue Dec 7, 2021 · 10 comments
Closed

Memory leak using python3.8 and boto3 client #3088

saturnt opened this issue Dec 7, 2021 · 10 comments
Assignees
Labels
bug This issue is a confirmed bug. response-requested Waiting on additional information or feedback.

Comments

@saturnt
Copy link

saturnt commented Dec 7, 2021

Describe the bug
We upgraded from python2.7 to python3.8, we are using boto3-1.18.31.dist-info, botocore-1.23.20.dist-info
uname -a output is: Linux 3.10.0-1160.45.1.el7.x86_64 #1 SMP Wed Oct 13 17:20:51 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux.

Steps to reproduce
Below is the snippet we are using:

import boto3

def poll_queue(client, qurl):
        messages = client.receive_message(QueueUrl=qurl, MaxNumberOfMessages=1, WaitTimeSeconds=0)
        if 'ResponseMetadata' in messages:
                response_meta = messages['ResponseMetadata']
        else:
                return False


def main():
        sqs = boto3.resource('sqs', region_name='us-west-2')
        client = boto3.client('sqs', region_name='us-west-2')
        onboard_tmp = (sqs.get_queue_by_name(QueueName='test1_q.fifo')).url
        orch_tmp = (sqs.get_queue_by_name(QueueName='test2_q.fifo')).url
        inst_tmp = (sqs.get_queue_by_name(QueueName='test3_q.fifo')).url

        while True:
                poll_queue(client, onboard_tmp)
                poll_queue(client, orch_tmp)
                poll_queue(client, inst_tmp)
main()

On examining the process memory usage, it continuously leaks, we are doing the following command to check(Note that the leak is slow and ultimately linux kernel oom kills the process.)

while true; do ps -o pid,user,%mem,command ax | sort -b -k3 -r | grep <pid> ;date ; sleep 5; done
>>>  51384 root      5.1 python3 sqs_leak_test.py

The current memory usage as shown above is 5.1, this started from 0.1 and over the course of few hours it reached 5.1, ultimately it would go to 100.

Expected behavior
Confirmed with python2.7 on the same host that issue does not happen, the "receive_message" is the culprit here.

Debug logs
Let me know if more info is needed.

@saturnt saturnt added the needs-triage This issue or PR still needs to be triaged. label Dec 7, 2021
@saturnt
Copy link
Author

saturnt commented Dec 7, 2021

This is the stack trace on tracemalloc

41292 memory blocks: 3610.9 KiB
  File "sqs_leak_test_tracemalloc.py", line 41
    main()
  File "sqs_leak_test_tracemalloc.py", line 23
    sqs = boto3.resource('sqs', region_name='us-west-2')
  File "/home/ec2-user/.local/lib/python3.8/site-packages/boto3/__init__.py", line 102
    return _get_default_session().resource(*args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.8/site-packages/boto3/session.py", line 396
    client = self.client(
  File "/home/ec2-user/.local/lib/python3.8/site-packages/boto3/session.py", line 270
    return self._session.create_client(
  File "/home/ec2-user/.local/lib/python3.8/site-packages/botocore/session.py", line 834
    endpoint_resolver = self._get_internal_component('endpoint_resolver')
  File "/home/ec2-user/.local/lib/python3.8/site-packages/botocore/session.py", line 707
    return self._internal_components.get_component(name)
  File "/home/ec2-user/.local/lib/python3.8/site-packages/botocore/session.py", line 944
    self._components[name] = factory()
  File "/home/ec2-user/.local/lib/python3.8/site-packages/botocore/session.py", line 161
    endpoints = loader.load_data('endpoints')
  File "/home/ec2-user/.local/lib/python3.8/site-packages/botocore/loaders.py", line 132
    data = func(self, *args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.8/site-packages/botocore/loaders.py", line 421
    found = self.file_loader.load_file(possible_path)
  File "/home/ec2-user/.local/lib/python3.8/site-packages/botocore/loaders.py", line 176
    return json.loads(payload, object_pairs_hook=OrderedDict)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/json/__init__.py", line 370
    return cls(**kw).decode(s)
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/json/decoder.py", line 337
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/rh/rh-python38/root/usr/lib64/python3.8/json/decoder.py", line 353
    obj, end = self.scan_once(s, idx)

@kdaily kdaily self-assigned this Dec 8, 2021
@kdaily kdaily added bug This issue is a confirmed bug. investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed needs-triage This issue or PR still needs to be triaged. labels Dec 8, 2021
@kdaily
Copy link
Member

kdaily commented Dec 8, 2021

Hi @saturnt,

Thanks for your post, I'm looking into this. I'm not sure why you would experience this switching from Python 2 to 3.

Your tracemalloc though doesn't show anything strange - the place it's highlighting is where the endpoints.json file is loaded when instantiating the SQS resource, and this file is ~500KB. Since that part is not inside your while statement, I don't think that it's the cause.

I couldn't repro with Python 3.8 and your stated version of boto3. It is pretty dated, have you tried with a more current version?

@saturnt
Copy link
Author

saturnt commented Dec 8, 2021

Hi @kdaily,

I also updated to the latest version '1.20.20' and tried and see the same issue. Let me change the receive_message to some other sqs call and see if it happens. Also, btw how long did you try to be in the state, you would need to wait atleast 30-60 min to see the issue.

@kdaily
Copy link
Member

kdaily commented Dec 8, 2021

Thanks for checking! I'll let it run and see. I'm using pmap to look at the process size, and it stays at around 250MB, which is what I expect for the overhead of a Python interpreter, the module dependencies that boto3/botocore load, and loading the API model and endpoint JSON sources.

@kdaily kdaily added response-requested Waiting on additional information or feedback. and removed investigating This issue is being investigated and/or work is in progress to resolve the issue. labels Dec 8, 2021
@saturnt
Copy link
Author

saturnt commented Dec 8, 2021

ok so i replaced client.receive_message to client.list_queues() and still see the issue, below is the script in my test.

import boto3
import tracemalloc
tracemalloc.start(25)
import time

def poll_queue(client, qurl, take_trace):
        if take_trace:
                print("Taking Trace")
                snapshot = tracemalloc.take_snapshot()
                top_stats = snapshot.statistics('traceback')

        messages = client.list_queues()  #receive_message(QueueUrl=qurl, MaxNumberOfMessages=1, WaitTimeSeconds=0)
        if take_trace:
                print("Printing Trace")
                stat = top_stats[0]
                print("%s memory blocks: %.1f KiB" % (stat.count, stat.size / 1024))
                for line in stat.traceback.format():
                        print(line)
        return False


def main():
        sqs = boto3.resource('sqs', region_name='us-west-2')
        client = boto3.client('sqs', region_name='us-west-2')
        onboard_tmp = (sqs.get_queue_by_name(QueueName='test1.fifo')).url
        orch_tmp = (sqs.get_queue_by_name(QueueName='test2.fifo')).url
        inst_tmp = (sqs.get_queue_by_name(QueueName='test3.fifo')).url
        timestamp = int(time.time())
        min_5 = 60 * 5
        take_trace = False
        while True:
                if (int(time.time()) > (timestamp + min_5)):
                        take_trace = True
                        timestamp = int(time.time())
                else:
                        take_trace = False
                poll_queue(client, onboard_tmp, take_trace)
                poll_queue(client, orch_tmp, take_trace)
                poll_queue(client, inst_tmp, take_trace)
main()

In my case it seems that the anon is increasing, below is the diff.

--- pmap.2      2021-12-07 17:07:56.000000000 -0800
+++ pmap.1      2021-12-07 17:07:21.000000000 -0800
@@ -5,7 +5,7 @@
 0000000000402000      4K r---- python3.8
 0000000000403000      4K r---- python3.8
 0000000000404000      4K rw--- python3.8
-0000000002367000  90320K rw---   [ anon ]
+0000000002367000  90848K rw---   [ anon ]
 00007fde9b3ad000   5124K rw---   [ anon ]
 00007fde9b8ee000   2048K rw---   [ anon ]
 00007fde9bb2e000    512K rw---   [ anon ]
@@ -317,4 +317,4 @@
 00007ffd810e9000    132K rw---   [ stack ]
 00007ffd811ab000      8K r-x--   [ anon ]
 ffffffffff600000      4K r-x--   [ anon ]
- total           298652K
+ total           299180K

This happened on changing the client api call as well, so may be we cant refute the what tracemalloc says ?

@saturnt saturnt changed the title Memory leak using python3.8 and boto3 client sqs receive_message Memory leak using python3.8 and boto3 client Dec 8, 2021
@kdaily
Copy link
Member

kdaily commented Dec 8, 2021

Hi @saturnt,

Thanks for the updates. Running overnight on an AL2 EC2 and I'm static at 256224K of process memory usage. I'll update my test case with your new example. I'm not too familiar with the output of tracemalloc so bear with me while I review.

I'm wondering if you could tell me a couple more things:

  1. It looks like you're running this on an EC2, can you give me some more details on that? Looks like maybe RedHat 7?
  2. If so, is there an AMI that I could use myself to reproduce?
  3. Are you running inside a virtual image/container on the EC2 as well?
  4. An output of pip freeze could be useful. We did note that the version difference of your original post was quite wide. boto3-1.18.31.dist-info is five minor versions away from botocore-1.23.20.dist-info; the version range for botocore for that boto3 is botocore>=1.21.31,<1.22.0.
  5. Similarly, some more info about how you're installing Python and/or boto3 would be helpful.

If possible I'd like to reproduce this myself instead of give you a barrage of things to try! However, here's the things I would do next (in isolation):

  • Remove the usage of the resource. You can get the SQS URL using the client (client.get_queue_url(QueueName='test1.fifo')['QueueUrl'] for example).
  • Try with a completely unrelated service (maybe something simple like STS get_caller_identity).
  • Explicitly garbage collect inside the while loop, maybe similar to how you're doing it now (every 5 minutes). I don't think that's a solution but could point more strongly to boto3 being the culprit.

Thanks for working with me on this.

@kdaily kdaily added response-requested Waiting on additional information or feedback. and removed response-requested Waiting on additional information or feedback. labels Dec 8, 2021
@saturnt
Copy link
Author

saturnt commented Dec 8, 2021

Hi @kdaily, Thanks for looking into this.

Below is the info i was able to gather from our devops team.
We are using centos 7 image, on top if it we have installed python3.8

yum install rh-python38-python.x86_64 rh-python38-python-pip.noarch

Below are rest of the packages on the system installed using pip(Note we already upgraded boto3 and botocore to the latest):

boto3==1.18.31
cryptography==3.4.8
geopy==2.2.0
google-api-python-client==2.17.0
google-auth==2.0.1
google-cloud-bigquery==2.25.0
google-cloud-spanner==3.10.0
jsonpickle==2.0.0
jwcrypto==1.0
influxdb==5.3.1
moto==2.2.7
mysql-connector==2.2.9
netaddr==0.8.0
ns1-python==0.16.1
pycurl==7.44.1
pycryptodome==3.10.3
PyJWT==2.1.0
PyMySQL==1.0.2
pem==21.2.0
PyYAML==5.4.1
pyOpenSSL==20.0.1
redis-py-cluster==2.1.3
retry==0.9.2
xmltodict==0.12.0

We are running inside ec2 instance. Its not a container.

In the last few hours i did some extra tests as well:

I tried the client.get_queue_url suggestion, problem still persists.
I tried the gc.collect(), issue is still seen, I am thinking some library is allocating and then losing reference.
I tried sts client and it still leaks, This to me might indicate something related to boto3 client. So i am think the tracemalloc might be right ? Let me know what you think.

Tejas

@kdaily
Copy link
Member

kdaily commented Dec 9, 2021

Thanks! Knowing that it isn't limited to the SQS client is useful information.

Are you using one of the CentOS AMIs listed here? That would help me replicate better.

https://wiki.centos.org/Cloud/AWS#Official_and_current_CentOS_Public_Images

@kdaily kdaily added response-requested Waiting on additional information or feedback. and removed response-requested Waiting on additional information or feedback. labels Dec 9, 2021
@saturnt
Copy link
Author

saturnt commented Dec 9, 2021

Let me ask our devops team, will get back. I created a new instance with ubuntu and it does not seem to leak with the same program. Will try with different variations now. Let me get back.

@tim-finnigan tim-finnigan assigned tim-finnigan and unassigned kdaily Jul 7, 2022
@tim-finnigan
Copy link
Contributor

Since we haven't heard back for a few months I'm going to close this. If you're still experiencing the issue please let us know. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a confirmed bug. response-requested Waiting on additional information or feedback.
Projects
None yet
Development

No branches or pull requests

3 participants