Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When I publish a data to pubsub topic, I have a memory leak on kubernetes pod. #395

Closed
fodrh1201 opened this issue Apr 23, 2021 · 9 comments · Fixed by #406
Closed

When I publish a data to pubsub topic, I have a memory leak on kubernetes pod. #395

fodrh1201 opened this issue Apr 23, 2021 · 9 comments · Fixed by #406
Assignees
Labels
api: pubsub Issues related to the googleapis/python-pubsub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@fodrh1201
Copy link

fodrh1201 commented Apr 23, 2021

Each time a public request is made, the memory in the pod of Kubernetes increases. Removing the publishing part does not increase memory.
The pod is continuously restarted due to OOM.

Environment details

  • OS type and version: Debian GNU/Linux 10
  • Python version: 3.8.7
  • pip version: 21.0.1
  • google-cloud-pubsub version: 2.4.1

Code example

class PublishToPubsubCallback(Callback):
    def __init__(self, pod_name, project_id, topic_id, preprocess=[]):
        super(PublishToPubsubCallback, self).__init__(pod_name, preprocess=preprocess)
        self.project_id = project_id
        self.topic_id = topic_id
        self.publisher = pubsub_v1.PublisherClient()
        self.topic_path = self.publisher.topic_path(project_id, topic_id)
        
    def __call__(self, data):
        pub_data = json.dumps(data).encode('utf-8')
        self.publisher.publish(self.topic_path, pub_data)

callback = PublishToPubsubCallback(pod_name, project_id, topic_id)

async def run():
    while True:
        data = await watch_websocket_data()
        callback(data)

loop = asyncio.get_event_loop()
loop.run_until_complete(run())

스크린샷 2021-04-23 오후 4 58 16

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/python-pubsub API. label Apr 23, 2021
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Apr 25, 2021
@plamut plamut added type: question Request for information or clarification. Not an issue. and removed triage me I really want to be triaged. labels Apr 26, 2021
@plamut
Copy link
Contributor

plamut commented Apr 26, 2021

@fodrh1201 Thanks for the report. Would it be possible to provide a minimal runnable self-contained code sample? We are not aware of any publisher memory leaks, thus we would first like to exclude any code outside of the client as a possible reason.

BTW, one thing that caught my eye is the mutable default argument to PublishToPubsubCallback's __init__() method, i.e. preprocess=[]. That empty list is instantiated only once at method definition time, and is shared among all instances of PublishToPubsubCallback. If the instances append anything to that list, it will continuously grow, consuming more and more memory.

In the code sample the callback is actually instantiated only once, but might nevertheless be worth checking out if that's also the case in the actual code.

@fodrh1201
Copy link
Author

This is python code.

import json
from google.cloud import pubsub_v1
import asyncio
import logging
from datetime import datetime


class Callback:
    
    def __init__(self, pod_name, is_work_at_stop=False, preprocess=[]):
        self.is_work_at_stop = is_work_at_stop
        self.pod_name = pod_name
        self.preprocess = preprocess
    
    def __call__(self, data):
        for preprocess in self.preprocess:
            data = preprocess(data)
        return data


class PublishToPubsubCallback(Callback):
    
    def __init__(self, pod_name, project_id, topic_id, preprocess=[]):
        super(PublishToPubsubCallback, self).__init__(pod_name, preprocess=preprocess)
        self.project_id = project_id
        self.topic_id = topic_id
        self.publisher = pubsub_v1.PublisherClient()
        self.topic_path = self.publisher.topic_path(project_id, topic_id)
        
    # @profile
    def __call__(self, data):
        data = super().__call__(data)
        pub_data = json.dumps(data).encode('utf-8')
        self.publisher.publish(self.topic_path, pub_data,
                            origin=self.pod_name, data_type=data['data_type'],
                            symbol=data['symbol'], client_id=data['client_id'],
                            timestamp=str(data['timestamp']),
                            market_type=data['market_type'])

data = {
    'data_type': 'orderbook',
    'client_id': 'binance',
    'symbol': 'BTC/USDT',
    'timestamp': '1619494646034',
    'market_type': 'spot',
    'asks': [[53482.36, 2e-06], [53493.34, 0.025379], [53493.64, 0.01412]],
    'bids': [[53482.35, 2.882399], [53479.18, 0.000373], [53477.2, 0.406]]
}

callback = PublishToPubsubCallback('test', 'bx-datafeed', 'test')
async def run():
    while True:
        callback(data)
        await asyncio.sleep(0.1)

loop = asyncio.get_event_loop()
loop.run_until_complete(run())

this is the result of above code.

스크린샷 2021-04-28 오후 1 18 14

my docker file is below.

FROM python:3.8
WORKDIR /app
ADD . .
RUN pip install argparse numpy google-cloud-pubsub==2.4.1 six==1.15.0

ENV PYTHONBUFFERED 1

@plamut
Copy link
Contributor

plamut commented Apr 29, 2021

I ran a slightly modified version of the example locally and measured memory consumption. I put the main code into the main() function (which was profiled), and changed the loop to run for 30_000 iterations. This is the graph:

publisher_mem_usage

It appears that the memory is indeed slowly but surely leaking. Can't yet say why, will investigate further.

A possible workaround is to modify the code to run in a subprocess (for a bounded number of iterations), and creating a new subprocess when the previous one finishes. Not ideal, of course, but at least the memory will be released at regular intervals.

@plamut plamut added priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. and removed type: question Request for information or clarification. Not an issue. labels Apr 29, 2021
@plamut
Copy link
Contributor

plamut commented Apr 29, 2021

It turns out the issue is reproducible even with a straightforward example that repeatedly publishes a trivial message in a loop.

import time

from google.cloud import pubsub_v1


PROJECT_ID = "set-project-id"
TOPIC_ID = "set-topic-id"


def main():
    publisher = pubsub_v1.PublisherClient()

    topic_path = publisher.topic_path(PROJECT_ID, TOPIC_ID)
    print("Starting the main loop...")

    for i in range(15_000):
        if not i % 1000:
            print(f"{i: 5}: publishing a message")

        publisher.publish(topic_path, b"x")
        time.sleep(0.01)

    print("Main loop complete!")


if __name__ == "__main__":
    main()

Using tracemalloc and taking a snapshot every 1000 iterations showed that by far the most memory was allocated in the threading module, but the (Python) traceback was limited:

130.9921875 new KiB, 1991.53125 total KiB, 3724 new, 56645 total memory blocks: 
  File "/usr/local/lib/python3.8/threading.py", line 890
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 918
    self._set_tstate_lock()
  File "/usr/local/lib/python3.8/threading.py", line 908
    self._tstate_lock = _set_sentinel()

It appears that the these allocations are initiated by something outside of Python , maybe gRPC?

@lidizheng Do you know if there are currently any confirmed or suspected leaks in gRPC? Do you have any free cycles to take a look in the near future? Is there anything else I can provide on top of the above minimal example?

Versions:

Python 3.8
Ubuntu 18.04.4 LTS

google-cloud-pubsub==2.4.1 
grpcio==1.36.1
google-api-core==1.26.0
google-auth==1.28.0

@lidizheng
Copy link

lidizheng commented Apr 29, 2021

@plamut I will take a look today or tomorrow.

@plamut
Copy link
Contributor

plamut commented Apr 29, 2021

@lidizheng This is the memoryprofiler run with the @profile decorator on the main() function:

mem_usage

The memory usage initially jumps to almost 70 MiB and then gradually increases to almost 79 MiB. I'm attaching the raw profiling data below (renamed the .dat suffix to .txt):

mprofile_data.txt

Update:
Hmm... an interesting observation. If I comment out the actual RPC call from the generated client, the memory still gradually increases!

diff --git google/pubsub_v1/services/publisher/client.py google/pubsub_v1/services/publisher/client.py
index f74e85a..d560307 100644
--- google/pubsub_v1/services/publisher/client.py
+++ google/pubsub_v1/services/publisher/client.py
@@ -605,7 +605,8 @@ class PublisherClient(metaclass=PublisherClientMeta):
         )

         # Send the request.
-        response = rpc(request, retry=retry, timeout=timeout, metadata=metadata,)
+        # response = rpc(request, retry=retry, timeout=timeout, metadata=metadata,)
+        response = pubsub.PublishResponse(message_ids=["123"] * len(request.messages))

         # Done; return the response.
         return response

It might be something in the hand-written layer that leaks memory, perhaps some rouge references to all the publish threads created when publishing each batch of messages... quite interesting.

@plamut
Copy link
Contributor

plamut commented Apr 30, 2021

Found a possible fix! If the client commit thread is marked as a daemon, the leak disappears. Can't say why, though.

The patch:

diff --git google/cloud/pubsub_v1/publisher/_batch/thread.py google/cloud/pubsub_v1/publisher/_batch/thread.py
index 36dd3b9..d0d648e 100644
--- google/cloud/pubsub_v1/publisher/_batch/thread.py
+++ google/cloud/pubsub_v1/publisher/_batch/thread.py
@@ -207,9 +207,10 @@ class Batch(base.Batch):
 
     def _start_commit_thread(self):
         """Start a new thread to actually handle the commit."""
-
+        # NOTE: If the thread is *not* a daemon, a memory leak exists for some reason...
+        # https://github.com/googleapis/python-pubsub/issues/395#issuecomment-829910303
         commit_thread = threading.Thread(
-            name="Thread-CommitBatchPublisher", target=self._commit
+            name="Thread-CommitBatchPublisher", target=self._commit, daemon=True
         )
         commit_thread.start()
 
diff --git google/cloud/pubsub_v1/publisher/client.py google/cloud/pubsub_v1/publisher/client.py
index 7e6801d..330fecb 100644
--- google/cloud/pubsub_v1/publisher/client.py
+++ google/cloud/pubsub_v1/publisher/client.py
@@ -375,8 +375,12 @@ class Client(object):
 
     def _start_commit_thread(self):
         """Start a new thread to actually wait and commit the sequencers."""
+        # NOTE: If the thread is *not* a daemon, a memory leak exists for some reason...
+        # https://github.com/googleapis/python-pubsub/issues/395#issuecomment-829910303
         self._commit_thread = threading.Thread(
-            name="Thread-PubSubBatchCommitter", target=self._wait_and_commit_sequencers
+            name="Thread-PubSubBatchCommitter",
+            target=self._wait_and_commit_sequencers,
+            daemon=True,
         )
         self._commit_thread.start()

Graph before:
mem_before_fix

Graph after:
mem_after_fix

@plamut
Copy link
Contributor

plamut commented Apr 30, 2021

For additional context, I found this CPython issue and the tracemalloc traceback from the reproduction looks very familiar (see one of the previous comments):

New Bytes: 36000	Total Bytes 36000	New blocks: 1000	Total blocks: 1000: 
  File "/usr/local/lib/python3.7/threading.py", line 890
    self._bootstrap_inner()
  File "/usr/local/lib/python3.7/threading.py", line 914
    self._set_tstate_lock()
  File "/usr/local/lib/python3.7/threading.py", line 904
    self._tstate_lock = _set_sentinel()

The issue has been closed as a duplicate of another issue, which is currently still open, and reportedly affects Python 3.8+.

I also did a test run with Python 3.6 and the memory usage was more or less constant - it appears that the linked CPython issue is indeed the root cause.

@plamut plamut added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. labels Apr 30, 2021
@yoshi-automation yoshi-automation added the 🚨 This issue needs some love. label Apr 30, 2021
@fodrh1201
Copy link
Author

@plamut Thank you!! I solved the problem thanks to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/python-pubsub API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. 🚨 This issue needs some love. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants