-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python PubSub consumer high cpu usage after approx 1 hour #3965
Comments
The 1 hour mark could also be a sign that an auth token is trying to be refreshed but fails (token lifetime is 1 hour). E.g. you could have all 20 workers sharing the same |
Hrm, in that case there would be a spike but not a sustained 20% increase in CPU, but maybe it causes a cascade of sorts? @thijsterlouw can you turn on debug-level logging and see if there's a lot of logs from |
@jonparrott Could you provide a code snippet explaining how one might "turn on debug-level logging" |
@thijsterlouw Also worth noting that this is just a working theory. It's also possible that the workers are thrashing on some other task. |
import logging
logging.basicConfig(level=logging.DEBUG) |
I added debug logging, but that did not show anything useful. After going through many debug logs on production (the application does quite a lot of traffic), I only noticed this:
There were no log messages at all between 10:37:24 and 10:39:38. I did not find any messages about auth-token refreshes. This is the same time I saw the CPU usage jump; I cannot tell when the cpu usage exactly jumped. On staging the CPU usage also jumped, even though there are almost no messages to process. So there the pattern is a bit more clear in the log messages:
then a steady stream of events like this:
(every few seconds a few messages like this)
Note that our _error_cb just prints all errors, but doesn't handle them. I see the new Pubsub code handles DEADLINE_EXCEEDED explicitly with a retry. Our _error_cb:
These errors however seem to occur all the time, also before high cpu usage occurs:
(every 15 minutes apparently, perhaps coincidence that the 4th attempt coincides with the high cpu usage). The error logger did not log any errors besides this DEADLINE_EXCEEDED code Edit: I did a small test on STG to see what happens when the CPU spike occurs. In this run it took more than one hour (again just after a DEADLINE_EXCEEDED). Still OK:
And Kaboom:
things to note:
When doing an strace on PID1820 I see the same hot loop as before:
And again the open ports:
After a while POD 1820 etc were gone and PID32 was just stuck in high cpu, even though I did not see anything special there in strace (just checking our uwsg_reload script at one second intervals) |
This doesn't seem like auth thrashing - deferring to @lukesneeringer. |
Another possibility would be a connection timing out, except I do not think an hour is a connection timeout anywhere. |
As shown in the last example, the problem sometimes occurs after a longer duration (eg 1h15m). It always seems to happen just after DEADLINE_EXCEEDED but not after all DEADLINE_EXCEEDED status codes the problems occurs. I cannot tell if the real cause is DEADLINE_EXCEEDED or some other action that simply happens as well in the background. |
Because we have no idea what is going on, we decided to rip out uwsgi and just do "python app.py" similar backtrace as before:
|
I have the same issue when I use a customized policy to ignore UNAVAILABLE errors. Code snippet as follows:
If we don't ignore the UNAVAILABLE errors, messages won't come in after the error. If we ignore the errors, high CPU usage forces our cluster to scale up and prevent it from scaling down. Will this be fixed anytime soon? |
I can confirm this is a real issue (and even the "about an hour" part of the phenomenon). I have created #4600 for tracking this CPU thrashing issue. See the description there for potential mitigation. |
OS type and version:
Debian GNU/Linux 8 (jessie)
Python version and virtual environment information
python --version
Python 3.6.0
google-cloud-python version
pip show google-cloud
,pip show google-<service>
orpip freeze
Stacktrace if available:
N/A
Steps to reproduce
note: we have multiple consumers running. When I start for example 20 at the same time, they all run fine for slightly more than 1 hour and then sometime between 1h and 1h20m (I was not keeping an eye on the cpu usage explosion all the time) the cpu usage of almost all 20 increased. Sometimes it doesn't happen on one of the consumers (but there might be another reason for that)
summary of relevant parts (I cannot paste the entire application):
where _message_callback tries to do message.ack() if all was ok in application code and otherwise raises.
Relevant debugging:
then take the process with the high cpu and strace it:
Taking a look at the open sockets, something jumps out:
Note that these target IP addresses (74.125.206.84) are probably PubSub related, but I cannot really confirm that. Debugging over https obviously is a bit difficult as well.
The loop of clock_gettime + poll (with POLLHUP) in combination with the two sockets in CLOSE-WAIT with a Recv-Q of 1 leads me to believe there is somehow a bug in the Google PubSub libraries where an invalid/incomplete message is hanging in the receive queue, the library keeps trying to consume the message and then retries etc etc, causing a hot loop taking quite a lot of cpu.
Note that the application itself keeps consuming messages, but that might simply be because a uwsgi process is doing the work (not investigated).
I also did a gdb backtrace, which points to grpc:
I did not yet dive into the details of grpc, but to me all clues point in that direction. We did not encounter issues like this with the old pubsub library (we used google-cloud-pubsub==0.26 before)
The text was updated successfully, but these errors were encountered: