-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PubSub with Python 3.5 spawns threads that don't die, eventually bringing the process to consume all CPU available #4563
Comments
I take that back – after left running, Python 3.6 seems to exhibit the same behavior of spiking CPU, though the number of threads hasn't increased like it did on 3.5. Don't know if this helps, but here's the gdb thread printout:
Don't know if this helps, but figured I'd add it here in case something stands out. |
@dmontag Thanks for this example. I'll try to reproduce today. I'm hoping that the recent releases (in particular |
@dmontag It's worth noting that the
It is also present in the |
@dmontag So I am running a modified version of this, but it's essentially the same. I even attached the topic to bucket change notifications, though that doesn't seem relevant to the issue. In I am going to pre-emptively close this issue (while I continue debugging). As of right now, my example has been running (with Would you mind testing with |
@dhermes Thanks for checking this one out. I tested on 0.29.4 yesterday, and it exhibits the same behavior of having a thread eventually spike and run at 100% sustained CPU – it takes about 65-70 minutes before the condition presents though. I updated my comment on #4274 to reflect this: #4274 (comment) For now I'll have to run on 0.27.0 which has been running ok for the past 20-or-so hours. The Stackdriver dashboard reports Pull requests for 0.27 vs StreamingPull requests for 0.29 – maybe that's a place to start looking for differences on the client side. |
Though I am loosely familiar with many Google APIs, I don't have deep knowledge of many of them. Would you mind pointing to docs for how I might create such a Stackdriver dashboard? (I have a general idea, but am overwhelmed by the sheer number of metrics for each Pub / Sub resource type in the metrics explorer.) Thanks for continuing to help debug this. I wonder if it'd be possible for us to get a shared piece of code that we are both running? I've been tracking my experiments here https://github.com/dhermes/google-cloud-pubsub-performance and have yet to reproduce a CPU spike (though 5m is the longest anything has run so far). |
@dmontag I have at least been able to reproduce this (in
I currently have no idea how this is happening. I'm using the Unix |
Sounds exactly like what I'm seeing. Good to hear that it's a reproducible problem – that's the first step toward fixing it :-) |
I think I've tracked it down. In the exact "spin worker" thread where the CPU usage begins to spike to 100%, there is a thread that checks the credentials. This is a sign that the credentials are being refreshed. This makes sense, since an access token has a lifetime of an hour plus some wiggle room. /cc @jonparrott |
Very interesting, but why does refreshing the creds cause a sustained cpu spike?
|
At least now it should be easier to debug, you can fake credentials expiring.
|
I think the 60 minute mark / extra credentials check was a false Seems like grpc/grpc#13665 is an attempt at a stop-gap. We may be able to address by tracking CPU usage and killing an active subscription based on this, but that wouldn't be very fun. |
Interesting indeed. Any idea why 0.27 doesn't exhibit this? (I'm not familiar with the library internals) |
It doesn't use a |
Do you think it's reproducible without PubSub-specific code? I.e. if a test case can be written that only uses gRPC to reproduce it? |
@dmontag Yes, see grpc/grpc#9688 |
Hi,
I have searched for answers but haven't found anything matching this issue. The closest was #3965 but it might be a different problem.
My code first initializes a subscriber client and storage client at startup like so:
After ensuring it has a subscription it starts listening:
No messages are ever sent on the topic, this is all on a completely idle system.
Eventually the thread count starts to creep up, and after an hour or two CPU usage spikes to 100%. I left it running overnight with some code in the loop printing all thread stacktraces. First I checked how many threads were running:
Here's an excerpt of the stacktraces printed, and it goes on and on:
After checking out the google-cloud-python source code I noticed that the thread policy_class makes a distinction between Python 3.5 and 3.6:
https://github.com/GoogleCloudPlatform/google-cloud-python/blob/af2e6cac62fc9d380bac780fe63b1416ef94da48/pubsub/google/cloud/pubsub_v1/subscriber/policy/thread.py#L139
I installed Python 3.6.3 and tried it out, and it doesn't seem to exhibit the same behavior – the number of threads stays limited. So in that sense I have a workaround, but I figured I would still report this in case someone else is seeing a similar problem.
The text was updated successfully, but these errors were encountered: