Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python PubSub consumer high cpu usage after approx 1 hour #3965

Closed
thijsterlouw opened this issue Sep 15, 2017 · 12 comments
Closed

Python PubSub consumer high cpu usage after approx 1 hour #3965

thijsterlouw opened this issue Sep 15, 2017 · 12 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. performance priority: p2 Moderately-important priority. Fix may not be included in next release.

Comments

@thijsterlouw
Copy link

thijsterlouw commented Sep 15, 2017

  1. OS type and version:
    Debian GNU/Linux 8 (jessie)

  2. Python version and virtual environment information python --version
    Python 3.6.0

  3. google-cloud-python version pip show google-cloud, pip show google-<service> or pip freeze

aniso8601==1.2.0
cachetools==2.0.1
Cerberus==0.9.2
click==6.7
confluent-kafka==0.9.2
couchbase==2.2.0
coverage==4.2
dill==0.2.7.1
flake8==3.4.1
Flask==0.12
Flask-RESTful==0.3.5
future==0.16.0
google-auth==1.1.0
google-cloud-core==0.27.1
google-cloud-pubsub==0.28.3
google-gax==0.15.14
googleapis-common-protos==1.5.2
grpc-google-iam-v1==0.11.3
grpcio==1.4.0
httplib2==0.10.3
itsdangerous==0.24
Jinja2==2.9.5
jsonschema==2.5.1
MarkupSafe==0.23
mccabe==0.6.1
microservice==0.7.0
mysql-connector-python==2.2.1
oauth2client==3.0.0
pep8==1.7.0
ply==3.8
protobuf==3.4.0
psutil==5.3.1
py==1.4.32
pyasn1==0.3.4
pyasn1-modules==0.1.4
pycodestyle==2.3.1
pyflakes==1.5.0
pytest==2.8.5
pytest-mock==1.5.0
python-dateutil==2.6.0
pytz==2016.10
requests==2.13.0
rsa==3.4.2
six==1.10.0
statsd==3.2.1
uWSGI==2.0.14
Werkzeug==0.11.15
  1. Stacktrace if available:
    N/A

  2. Steps to reproduce

  • I run my python pubsub-consumer (in uwsgi with 1 thread) for a bit more than one hour
  • no special log messages about the application restarting
  • observe high cpu usage

note: we have multiple consumers running. When I start for example 20 at the same time, they all run fine for slightly more than 1 hour and then sometime between 1h and 1h20m (I was not keeping an eye on the cpu usage explosion all the time) the cpu usage of almost all 20 increased. Sometimes it doesn't happen on one of the consumers (but there might be another reason for that)

  1. Code example
    summary of relevant parts (I cannot paste the entire application):
        subscriber = pubsub.SubscriberClient()
        project_path = subscriber.project_path(project_name)
        subscription_path = subscriber.subscription_path(project_name, subscription_name)
        topic_path = subscriber.topic_path(project_name, topic_name)

        self.pubsub_consumer_sub = subscriber.subscribe(subscription_path)
        self.pubsub_consumer_sub.on_exception = self._error_cb

        self.pubsub_consumer_sub.open(self._message_callback)

where _message_callback tries to do message.ack() if all was ok in application code and otherwise raises.

Relevant debugging:

 top -b -n1 -H -p 35    # pid 35 shows up as high cpu in "top"

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
  711 root      20   0 1580028  80752  15568 R 98.6  0.3  11:36.88 uwsgi
   35 root      20   0 1580028  80752  15568 S  0.0  0.3   0:01.43 uwsgi
   37 root      20   0 1580028  80752  15568 S  0.0  0.3   0:01.24 uwsgi
   43 root      20   0 1580028  80752  15568 S  0.0  0.3   0:01.13 uwsgi

then take the process with the high cpu and strace it:

strace -e trace=all -p 711

(part of the output)
....
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 81) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 485753027}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 81) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 485879159}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486030327}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486165028}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486290705}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486428764}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486548305}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486670569}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486776253}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 80) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 486896929}) = 0
poll([{fd=15, events=POLLIN}, {fd=17, events=0}, {fd=20, events=0}, {fd=21, events=POLLIN}], 4, 79) = 2 ([{fd=17, revents=POLLHUP}, {fd=20, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {690178, 487011159}) = 0
....
and hundreds more of these

Taking a look at the open sockets, something jumps out:

ss -t -a
State      Recv-Q Send-Q                                                          Local Address:Port                                                              Peer Address:Port 
CLOSE-WAIT 1      0                                                                 10.0.20.188:59188                                                            74.125.206.84:https
CLOSE-WAIT 1      0                                                                 10.0.20.188:38434                                                            74.125.206.84:https
... (others removed)...

Note that these target IP addresses (74.125.206.84) are probably PubSub related, but I cannot really confirm that. Debugging over https obviously is a bit difficult as well.

The loop of clock_gettime + poll (with POLLHUP) in combination with the two sockets in CLOSE-WAIT with a Recv-Q of 1 leads me to believe there is somehow a bug in the Google PubSub libraries where an invalid/incomplete message is hanging in the receive queue, the library keeps trying to consume the message and then retries etc etc, causing a hot loop taking quite a lot of cpu.

Note that the application itself keeps consuming messages, but that might simply be because a uwsgi process is doing the work (not investigated).

I also did a gdb backtrace, which points to grpc:

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007fe5ea025c7e in now_impl (clock_type=<optimized out>) at src/core/lib/support/time_posix.c:92
#2  0x00007fe5ea025cfa in gpr_now (clock_type=clock_type@entry=GPR_CLOCK_MONOTONIC) at src/core/lib/support/time_posix.c:156
#3  0x00007fe5ea054a27 in cq_next (cc=0x2771790, deadline=..., reserved=<optimized out>) at src/core/lib/surface/completion_queue.c:833
#4  0x00007fe5ea05571e in grpc_completion_queue_next (cc=<optimized out>, deadline=..., reserved=reserved@entry=0x0) at src/core/lib/surface/completion_queue.c:873
#5  0x00007fe5e9ff9261 in __pyx_pf_4grpc_7_cython_6cygrpc_15CompletionQueue_2poll (__pyx_v_deadline=<optimized out>, __pyx_v_self=0x7fe5e8f6ab70)
    at src/python/grpcio/grpc/_cython/cygrpc.c:10553
#6  __pyx_pw_4grpc_7_cython_6cygrpc_15CompletionQueue_3poll (__pyx_v_self=0x7fe5e8f6ab70, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at src/python/grpcio/grpc/_cython/cygrpc.c:10413
#7  0x00007fe5f191ce8e in _PyCFunction_FastCallDict (func_obj=0x7fe5e827bfc0, args=0x7fe5e8262db0, nargs=<optimized out>, kwargs=kwargs@entry=0x0) at Objects/methodobject.c:231
#8  0x00007fe5f191d1a7 in _PyCFunction_FastCallKeywords (func=func@entry=0x7fe5e827bfc0, stack=stack@entry=0x7fe5e8262db0, nargs=<optimized out>, kwnames=kwnames@entry=0x0)
    at Objects/methodobject.c:295
#9  0x00007fe5f19b46d3 in call_function (pp_stack=pp_stack@entry=0x7fe5e1ffa5f0, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4788
#10 0x00007fe5f19b9e61 in _PyEval_EvalFrameDefault (f=0x7fe5e8262c18, throwflag=<optimized out>) at Python/ceval.c:3275
#11 0x00007fe5f19b434a in _PyEval_EvalCodeWithName (_co=0x1, globals=0x7fe5e1ffa290, locals=0x2771820, locals@entry=0x0, args=0x7fe5f3683060, argcount=9223372036854775807, 
    kwnames=0xa7b03, kwnames@entry=0x7fe5f3683060, kwargs=0x7fe5f3683068, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7fe5e8f64f60, name=0x0, qualname=0x0)
    at Python/ceval.c:4119
#12 0x00007fe5f19b490f in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7fe5f3683060, argcount=<optimized out>, 
    kws=kws@entry=0x7fe5f3683060, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7fe5e8f64f60) at Python/ceval.c:4140
#13 0x00007fe5f18f645e in function_call (func=0x7fe5e82987b8, arg=0x7fe5f3683048, kw=0x7fe5e86d6708) at Objects/funcobject.c:604
#14 0x00007fe5f18c492a in PyObject_Call (func=0x7fe5e82987b8, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
#15 0x00007fe5f19b98f6 in do_call_core (kwdict=0x7fe5e86d6708, callargs=0x7fe5f3683048, func=0x7fe5e82987b8) at Python/ceval.c:5057
#16 _PyEval_EvalFrameDefault (f=0x7fe5e86f6dd8, throwflag=<optimized out>) at Python/ceval.c:3357
#17 0x00007fe5f19b39a0 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Python/ceval.c:4870
#18 0x00007fe5f19b48ad in fast_function (kwnames=0x0, nargs=<optimized out>, stack=<optimized out>, func=0x7fe5f04742f0) at Python/ceval.c:4905
#19 call_function (pp_stack=pp_stack@entry=0x7fe5e1ffaae0, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4809
#20 0x00007fe5f19b9e61 in _PyEval_EvalFrameDefault (f=0x7fe5cc0008d8, throwflag=<optimized out>) at Python/ceval.c:3275
#21 0x00007fe5f19b39a0 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Python/ceval.c:4870
#22 0x00007fe5f19b48ad in fast_function (kwnames=0x0, nargs=<optimized out>, stack=<optimized out>, func=0x7fe5f0474510) at Python/ceval.c:4905
#23 call_function (pp_stack=pp_stack@entry=0x7fe5e1ffacb0, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4809
#24 0x00007fe5f19b9e61 in _PyEval_EvalFrameDefault (f=0x7fe5e826dbb8, throwflag=<optimized out>) at Python/ceval.c:3275
#25 0x00007fe5f19b39a0 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Python/ceval.c:4870
#26 0x00007fe5f19bd210 in _PyFunction_FastCallDict (func=func@entry=0x7fe5f0474378, args=args@entry=0x7fe5e1ffae60, nargs=1, kwargs=kwargs@entry=0x0) at Python/ceval.c:4972
#27 0x00007fe5f18c4b8e in _PyObject_FastCallDict (func=func@entry=0x7fe5f0474378, args=args@entry=0x7fe5e1ffae60, nargs=nargs@entry=1, kwargs=kwargs@entry=0x0)
    at Objects/abstract.c:2295
#28 0x00007fe5f18c4c91 in _PyObject_Call_Prepend (func=0x7fe5f0474378, obj=<optimized out>, args=0x7fe5f3683048, kwargs=0x0) at Objects/abstract.c:2358
#29 0x00007fe5f18c492a in PyObject_Call (func=0x7fe5e86be988, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
---Type <return> to continue, or q <return> to quit---
#30 0x00007fe5f19b4d91 in PyEval_CallObjectWithKeywords (func=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at Python/ceval.c:4709
#31 0x00007fe5f1a07db2 in t_bootstrap (boot_raw=0x7fe5e8f1bb48) at ./Modules/_threadmodule.c:998
#32 0x00007fe5f32e8064 in start_thread (arg=0x7fe5e1ffb700) at pthread_create.c:309
#33 0x00007fe5f135862d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I did not yet dive into the details of grpc, but to me all clues point in that direction. We did not encounter issues like this with the old pubsub library (we used google-cloud-pubsub==0.26 before)

@dhermes dhermes added api: pubsub Issues related to the Pub/Sub API. performance labels Sep 15, 2017
@dhermes
Copy link
Contributor

dhermes commented Sep 15, 2017

The 1 hour mark could also be a sign that an auth token is trying to be refreshed but fails (token lifetime is 1 hour). E.g. you could have all 20 workers sharing the same Credentials object and then they keep fighting over the state during the refresh. @jonparrott WDYT?

@theacodes
Copy link
Contributor

theacodes commented Sep 15, 2017

Hrm, in that case there would be a spike but not a sustained 20% increase in CPU, but maybe it causes a cascade of sorts?

@thijsterlouw can you turn on debug-level logging and see if there's a lot of logs from google.auth? Might as well go ahead and log your pubsub client's credentials - especially the expiry vs datetime.utcnow().

@dhermes
Copy link
Contributor

dhermes commented Sep 15, 2017

@jonparrott Could you provide a code snippet explaining how one might "turn on debug-level logging"

@dhermes
Copy link
Contributor

dhermes commented Sep 15, 2017

@thijsterlouw Also worth noting that this is just a working theory. It's also possible that the workers are thrashing on some other task.

@theacodes
Copy link
Contributor

import logging
logging.basicConfig(level=logging.DEBUG)

@thijsterlouw
Copy link
Author

thijsterlouw commented Sep 18, 2017

I added debug logging, but that did not show anything useful. After going through many debug logs on production (the application does quite a lot of traffic), I only noticed this:

10:39:38.000 DEBUG:root:Message will be acknowledged. message=Message {
10:37:24.000 DEBUG - Received response: received_messages {

There were no log messages at all between 10:37:24 and 10:39:38. I did not find any messages about auth-token refreshes. This is the same time I saw the CPU usage jump; I cannot tell when the cpu usage exactly jumped.

On staging the CPU usage also jumped, even though there are almost no messages to process. So there the pattern is a bit more clear in the log messages:
released to stg ~09:16:44

09:16:44.000 = DEBUG:google.auth.transport.requests:Making request: POST https://accounts.google.com/o/oauth2/token
09:16:44.000 =  DEBUG:google.cloud.pubsub_v1.subscriber._consumer:Sending initial request: subscription: "projects/REDACTED"
09:16:44.000 = DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:The current p99 value is 10 seconds.

then a steady stream of events like this:

I  DEBUG - Snoozing lease management for 4.007550 seconds. [in /usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/policy/base.py:315]
I  DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Snoozing lease management for 4.007550 seconds.
I  DEBUG - Renewing lease for 0 ack IDs. [in /usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/policy/base.py:300]
I  DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:Renewing lease for 0 ack IDs.
I  DEBUG - The current p99 value is 10 seconds. [in /usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/policy/base.py:294]
I  DEBUG:google.cloud.pubsub_v1.subscriber.policy.base:The current p99 value is 10 seconds.

(every few seconds a few messages like this)
Then fast-forward to 10:16:44 I see a DEADLINE_EXCEEDED error

10:16:46.000 = I  DEBUG - The current p99 value is 10 seconds. [in /usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/policy/base.py:294]
10:16:44.000 = I   [in /usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/_consumer.py:205]
10:16:44.000 = I  stream_ack_deadline_seconds: 10
10:16:44.000 = I  DEBUG - Sending initial request: subscription: "projects/spil-native-generic-stg/subscriptions/nativeuserconsumergroup"
10:16:44.000 = I  
10:16:44.000 = I  stream_ack_deadline_seconds: 10
10:16:44.000 = I  DEBUG - Making request: POST https://accounts.google.com/o/oauth2/token [in /usr/local/lib/python3.6/site-packages/google/auth/transport/requests.py:109]
10:16:44.000 = I  ERROR - Pubsub consumer error callback: args=(<_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded)>,) kwargs={} [in /usr/local/lib/python3.6/site-packages/microservice/pubsub/__init__.py:141]
10:16:42.000 = I  DEBUG - Snoozing lease management for 4.063891 seconds. [in /usr/local/lib/python3.6/site-packages/google/cloud/pubsub_v1/subscriber/policy/base.py:315]

Note that our _error_cb just prints all errors, but doesn't handle them. I see the new Pubsub code handles DEADLINE_EXCEEDED explicitly with a retry.
https://github.com/GoogleCloudPlatform/google-cloud-python/blob/6b0b256a07c55df11bcda80189cfff9b529cc5a1/pubsub/google/cloud/pubsub_v1/subscriber/policy/thread.py#L131

Our _error_cb:

    def _error_cb(self, *args, **kwargs):
        logging.error("Pubsub consumer error callback: args={0} kwargs={1}".format(args, kwargs))

These errors however seem to occur all the time, also before high cpu usage occurs:

10:16:44.000 = I  ERROR:root:Pubsub consumer error callback: args=(<_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded)>,) kwargs={}
10:01:44.000 = I  ERROR:root:Pubsub consumer error callback: args=(<_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded)>,) kwargs={}
09:46:44.000 = I  ERROR:root:Pubsub consumer error callback: args=(<_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded)>,) kwargs={}
09:31:44.000 = I  ERROR:root:Pubsub consumer error callback: args=(<_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded)>,) kwargs={}

(every 15 minutes apparently, perhaps coincidence that the 4th attempt coincides with the high cpu usage). The error logger did not log any errors besides this DEADLINE_EXCEEDED code

Edit: I did a small test on STG to see what happens when the CPU spike occurs. In this run it took more than one hour (again just after a DEADLINE_EXCEEDED).

Still OK:

Mon Sep 18 12:15:15 UTC 2017
top - 12:15:15 up 12 days, 30 min,  0 users,  load average: 0.31, 0.26, 0.27
Threads:  21 total,   0 running,  21 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.2 us, 39.7 sy,  0.1 ni, 32.9 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:  15406180 total, 13491028 used,  1915152 free,  1824060 buffers
KiB Swap:        0 total,        0 used,        0 free.  6340756 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
   32 root      20   0 1727488  61868  15892 S  0.0  0.4   0:02.70 uwsgi
   34 root      20   0 1727488  61868  15892 S  0.0  0.4   0:00.52 uwsgi
   ...

And Kaboom:

Mon Sep 18 12:15:19 UTC 2017
top - 12:15:19 up 12 days, 30 min,  0 users,  load average: 0.28, 0.25, 0.27
Threads:  22 total,   1 running,  21 sleeping,   0 stopped,   0 zombie
%Cpu(s): 27.2 us, 39.7 sy,  0.1 ni, 32.9 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem:  15406180 total, 13497208 used,  1908972 free,  1824060 buffers
KiB Swap:        0 total,        0 used,        0 free.  6340756 cached Mem

  PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND
 1820 root      20   0 1735684  61872  15892 R 99.4  0.4   0:00.28 uwsgi
   32 root      20   0 1735684  61872  15892 S  0.0  0.4   0:02.71 uwsgi
   34 root      20   0 1735684  61872  15892 S  0.0  0.4   0:00.52 uwsgi
   39 root      20   0 1735684  61872  15892 S  0.0  0.4   0:00.39 uwsgi
   ...

things to note:

  • happened just after DEADLINE_EXCEEDED again:
14:15:18.000 = ERROR - Pubsub consumer error callback: args=(<_Rendezvous of RPC that terminated with (StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded)>,)
  • PID 32 (uwsgi in ps aux) is not the cause of the high cpu, it is PID 1820
  • PID 1820 was not visible before; it appears together with PID 1821

When doing an strace on PID1820 I see the same hot loop as before:

poll([{fd=17, events=POLLIN}, {fd=10, events=0}, {fd=20, events=POLLIN}], 3, 27) = 1 ([{fd=10, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {1039324, 819860655}) = 0
poll([{fd=17, events=POLLIN}, {fd=10, events=0}, {fd=20, events=POLLIN}], 3, 26) = 1 ([{fd=10, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {1039324, 819988550}) = 0
poll([{fd=17, events=POLLIN}, {fd=10, events=0}, {fd=20, events=POLLIN}], 3, 26) = 1 ([{fd=10, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {1039324, 820114613}) = 0
poll([{fd=17, events=POLLIN}, {fd=10, events=0}, {fd=20, events=POLLIN}], 3, 26) = 1 ([{fd=10, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {1039324, 820294351}) = 0
poll([{fd=17, events=POLLIN}, {fd=10, events=0}, {fd=20, events=POLLIN}], 3, 26) = 1 ([{fd=10, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {1039324, 820454111}) = 0
poll([{fd=17, events=POLLIN}, {fd=10, events=0}, {fd=20, events=POLLIN}], 3, 26) = 1 ([{fd=10, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {1039324, 820624857}) = 0
poll([{fd=17, events=POLLIN}, {fd=10, events=0}, {fd=20, events=POLLIN}], 3, 26) = 1 ([{fd=10, revents=POLLHUP}])
clock_gettime(CLOCK_MONOTONIC, {1039324, 820838935}) = 0
....
ls -l /proc/32/fd/

...
10 -> socket:[83678161]
...

And again the open ports:

CLOSE-WAIT 1      0                                                                 10.88.0.192:56006                                                              66.102.1.84:https   
CLOSE-WAIT 1      0                                                                 10.88.0.192:49180                                                              66.102.1.84:https   

After a while POD 1820 etc were gone and PID32 was just stuck in high cpu, even though I did not see anything special there in strace (just checking our uwsg_reload script at one second intervals)

@theacodes
Copy link
Contributor

theacodes commented Sep 18, 2017

This doesn't seem like auth thrashing - deferring to @lukesneeringer.

@lukesneeringer
Copy link
Contributor

Another possibility would be a connection timing out, except I do not think an hour is a connection timeout anywhere.

@thijsterlouw
Copy link
Author

thijsterlouw commented Sep 20, 2017

As shown in the last example, the problem sometimes occurs after a longer duration (eg 1h15m). It always seems to happen just after DEADLINE_EXCEEDED but not after all DEADLINE_EXCEEDED status codes the problems occurs. I cannot tell if the real cause is DEADLINE_EXCEEDED or some other action that simply happens as well in the background.

@thijsterlouw
Copy link
Author

Because we have no idea what is going on, we decided to rip out uwsgi and just do "python app.py"
After a while this again resulted in problems:

similar backtrace as before:

(gdb) bt
#0  0x00007f84c6adfaed in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f84bf6d0ff2 in pollset_work (exec_ctx=0x7f846fffe370, pollset=0x23dad08, worker_hdl=0x0, now=..., deadline=...) at src/core/lib/iomgr/ev_poll_posix.c:979
#2  0x00007f84bf6f1a8f in cq_next (cc=0x23daba0, deadline=..., reserved=<optimized out>) at src/core/lib/surface/completion_queue.c:844
#3  0x00007f84bf6f271e in grpc_completion_queue_next (cc=<optimized out>, deadline=..., reserved=reserved@entry=0x0) at src/core/lib/surface/completion_queue.c:873
#4  0x00007f84bf696261 in __pyx_pf_4grpc_7_cython_6cygrpc_15CompletionQueue_2poll (__pyx_v_deadline=<optimized out>, __pyx_v_self=0x7f84be5a87b0)
    at src/python/grpcio/grpc/_cython/cygrpc.c:10553
#5  __pyx_pw_4grpc_7_cython_6cygrpc_15CompletionQueue_3poll (__pyx_v_self=0x7f84be5a87b0, __pyx_args=<optimized out>, __pyx_kwds=<optimized out>)
    at src/python/grpcio/grpc/_cython/cygrpc.c:10413
#6  0x00007f84c779ae8e in _PyCFunction_FastCallDict (func_obj=0x7f84bc6c8750, args=0x7f84bc707fa8, nargs=<optimized out>, kwargs=kwargs@entry=0x0) at Objects/methodobject.c:231
#7  0x00007f84c779b1a7 in _PyCFunction_FastCallKeywords (func=func@entry=0x7f84bc6c8750, stack=stack@entry=0x7f84bc707fa8, nargs=<optimized out>, kwnames=kwnames@entry=0x0)
    at Objects/methodobject.c:295
#8  0x00007f84c78326d3 in call_function (pp_stack=pp_stack@entry=0x7f846fffe5f0, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4788
#9  0x00007f84c7837e61 in _PyEval_EvalFrameDefault (f=0x7f84bc707e10, throwflag=<optimized out>) at Python/ceval.c:3275
#10 0x00007f84c783234a in _PyEval_EvalCodeWithName (_co=0x7f846fffd080, globals=0x3, locals=0x5a, locals@entry=0x0, args=0x7f84c7c3c060, argcount=0, kwnames=0x7fffffffffffffff, 
    kwnames@entry=0x7f84c7c3c060, kwargs=0x7f84c7c3c068, kwcount=0, kwstep=2, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7f84bc637160, name=0x0, qualname=0x0) at Python/ceval.c:4119
#11 0x00007f84c783290f in PyEval_EvalCodeEx (_co=<optimized out>, globals=<optimized out>, locals=locals@entry=0x0, args=args@entry=0x7f84c7c3c060, argcount=<optimized out>, 
    kws=kws@entry=0x7f84c7c3c060, kwcount=0, defs=0x0, defcount=0, kwdefs=0x0, closure=0x7f84bc637160) at Python/ceval.c:4140
#12 0x00007f84c777445e in function_call (func=0x7f84bc63b8c8, arg=0x7f84c7c3c048, kw=0x7f84bc6c83a8) at Objects/funcobject.c:604
#13 0x00007f84c774292a in PyObject_Call (func=0x7f84bc63b8c8, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
#14 0x00007f84c78378f6 in do_call_core (kwdict=0x7f84bc6c83a8, callargs=0x7f84c7c3c048, func=0x7f84bc63b8c8) at Python/ceval.c:5057
#15 _PyEval_EvalFrameDefault (f=0x7f84bc6febe8, throwflag=<optimized out>) at Python/ceval.c:3357
#16 0x00007f84c78319a0 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Python/ceval.c:4870
#17 0x00007f84c78328ad in fast_function (kwnames=0x0, nargs=<optimized out>, stack=<optimized out>, func=0x7f84c65882f0) at Python/ceval.c:4905
#18 call_function (pp_stack=pp_stack@entry=0x7f846fffeae0, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4809
#19 0x00007f84c7837e61 in _PyEval_EvalFrameDefault (f=0x7f849c0008d8, throwflag=<optimized out>) at Python/ceval.c:3275
#20 0x00007f84c78319a0 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Python/ceval.c:4870
---Type <return> to continue, or q <return> to quit---
#21 0x00007f84c78328ad in fast_function (kwnames=0x0, nargs=<optimized out>, stack=<optimized out>, func=0x7f84c6588510) at Python/ceval.c:4905
#22 call_function (pp_stack=pp_stack@entry=0x7f846fffecb0, oparg=<optimized out>, kwnames=kwnames@entry=0x0) at Python/ceval.c:4809
#23 0x00007f84c7837e61 in _PyEval_EvalFrameDefault (f=0x7f84bc70e9d0, throwflag=<optimized out>) at Python/ceval.c:3275
#24 0x00007f84c78319a0 in _PyFunction_FastCall (co=<optimized out>, args=<optimized out>, nargs=1, globals=<optimized out>) at Python/ceval.c:4870
#25 0x00007f84c783b210 in _PyFunction_FastCallDict (func=func@entry=0x7f84c6588378, args=args@entry=0x7f846fffee60, nargs=1, kwargs=kwargs@entry=0x0) at Python/ceval.c:4972
#26 0x00007f84c7742b8e in _PyObject_FastCallDict (func=func@entry=0x7f84c6588378, args=args@entry=0x7f846fffee60, nargs=nargs@entry=1, kwargs=kwargs@entry=0x0)
    at Objects/abstract.c:2295
#27 0x00007f84c7742c91 in _PyObject_Call_Prepend (func=0x7f84c6588378, obj=<optimized out>, args=0x7f84c7c3c048, kwargs=0x0) at Objects/abstract.c:2358
#28 0x00007f84c774292a in PyObject_Call (func=0x7f84bc6cdac8, args=<optimized out>, kwargs=<optimized out>) at Objects/abstract.c:2246
#29 0x00007f84c7832d91 in PyEval_CallObjectWithKeywords (func=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at Python/ceval.c:4709
#30 0x00007f84c7885db2 in t_bootstrap (boot_raw=0x7f84be5a87d8) at ./Modules/_threadmodule.c:998
#31 0x00007f84c74bb064 in start_thread (arg=0x7f846ffff700) at pthread_create.c:309
#32 0x00007f84c6ae862d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

@hanshengchiu
Copy link

hanshengchiu commented Nov 5, 2017

I have the same issue when I use a customized policy to ignore UNAVAILABLE errors. Code snippet as follows:

class WorkaroundPolicy(thread.Policy):
    def on_exception(self, exception):
        # If this is UNAVAILABLE, then we want to retry.
        # That entails just returning None.
        unavailable = grpc.StatusCode.UNAVAILABLE
        if getattr(exception, 'code', lambda: None)() == unavailable:
            LOGGER.info('Ignoring grpc.StatusCode.UNAVAILABLE')
            return

        # For anything else, fallback on super.
        super(WorkaroundPolicy, self).on_exception(exception)

ps_subscriber_client = pubsub.SubscriberClient(policy_class=WorkaroundPolicy)

If we don't ignore the UNAVAILABLE errors, messages won't come in after the error. If we ignore the errors, high CPU usage forces our cluster to scale up and prevent it from scaling down.

Will this be fixed anytime soon?

@dhermes
Copy link
Contributor

dhermes commented Dec 15, 2017

I can confirm this is a real issue (and even the "about an hour" part of the phenomenon). I have created #4600 for tracking this CPU thrashing issue. See the description there for potential mitigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. performance priority: p2 Moderately-important priority. Fix may not be included in next release.
Projects
None yet
Development

No branches or pull requests

5 participants