Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 100% CPU usage #875

Merged
merged 3 commits into from
Apr 20, 2017
Merged

Conversation

jaredsuttles
Copy link

I ran into this bug using the web UI on macOS. It seems that #589 was on the right track but has unnecessarily strict rules to unregister a file descriptor. Only need to check if the writers are not writeable anymore - readers not readable.

@mnaberez
Copy link
Member

@jaredsuttles Thanks for this patch! It looks like the CI failure is unrelated (#873). There's another poller bug reported in #858. Does this patch happen to fix that as well, and if not, could you possibly look at it as well?

@mnaberez
Copy link
Member

CI failure is unrelated and should be fixed since fa7e3f5.

@jaredsuttles
Copy link
Author

@mnaberez Looked into the other issue. I think I found a solution - please verify #885

@mnaberez
Copy link
Member

@jaredsuttles Thanks much for the #885. I will test and intend to apply it if there aren't issues. Can you please check out #903? It is another patch that aims to fix the same issue as this one, but it's completely different. I could use help figuring out the best thing to do here.

@mhariri
Copy link

mhariri commented Mar 11, 2017

Reading @jaredsuttles code, I think he has a better fix than mine. So I think this one should go in.

Copy link

@mhariri mhariri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Can we get this in? The bug is really irritating...

@jaredsuttles
Copy link
Author

@mnaberez @mhariri Just a note regarding #903 - it does indeed seem to fix the issue for kqueue poller but the other pollers are also broken if you force their use.

@mnaberez
Copy link
Member

but the other pollers are also broken if you force their use.

Could you please give me some more specific details on how they are broken so that we can create tickets for them?

Do you think it would be better to just revert the poller patch entirely? It was added only recently and prior to it, supervisord was a relatively stable program. The main purpose of the poller patch is to allow supervisord to monitor a large number of file descriptors, which came up in tickets from time to time, but wasn't a problem faced by most users. I didn't merge the poller patch so I'm less familiar with all its details. I did reach out to the author of the patch (@igorsobreira) in other tickets and he hasn't responded. I'm surprised to hear that select is broken even, considered that supervisord prior to the poller patch used select and didn't have these problems.

@jaredsuttles
Copy link
Author

@mnaberez Sorry, I simply meant the other pollers have the same 100% CPU usage issue so they would also need to be dealt with were you to use #903. You can test this by just setting for instance Poller = SelectPoller at the end of poller.py

I haven't really looked at the code prior to that merge so I can't really say one way or the other.

@mhariri

This comment has been minimized.

@mnaberez

This comment has been minimized.

@mnaberez mnaberez closed this Mar 21, 2017
@mnaberez mnaberez reopened this Mar 21, 2017
@mnaberez
Copy link
Member

@jaredsuttles Thanks for the clarification above on #903. I hope to keep the kqueue poller, otherwise we have to reopen the other issues it was implemented to address. I will test this patch and #885 locally soon and if there are no issues, I'll try to make a bug fix release with them.

Copy link
Member

@mnaberez mnaberez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear on the changes to PollPoller in unregister_readable() and unregister_writable().

self._poller.unregister(fd)
if fd in self.writables:
self._poller.register(fd, self.WRITE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading this correctly, it seems like unregistering a readable causes all writables to be (re)registered(?). Could you please explain this? Same thing in unregister_writable below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In kqueue you can unregister a file descriptor for specifically read or write but, as far as I can tell, you cannot do that with poll. The workaround is to just reregister for write if we unregistered read and vice versa.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for clarifying.

Copy link
Member

@mnaberez mnaberez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'll test this locally and will merge it if I don't find any issues.

@FAKERINHEART
Copy link

I tried this branch.
I found that when I used commands 'stop program_name and start program_name', it would raise 'httplib.BadStatusLine' Error.

image

My python version is 2.7.12

@jaredsuttles
Copy link
Author

@FAKERINHEART I'm not able to reproduce. This doesn't happen on master?

@mnaberez
Copy link
Member

@FAKERINHEART I'm testing this change locally and I'm also not able to see the error you posted. Can you please help us reproduce it or let us know if you're no longer seeing it?

@FAKERINHEART
Copy link

It was my own mistake for the 'BadStatusLine' error. I am sorry to forget to reply this in time.

@mnaberez
Copy link
Member

No problem, thanks for letting us know!

@mhariri
Copy link

mhariri commented Apr 16, 2017

When can we get this merged @mnaberez ?

@mnaberez
Copy link
Member

I tested this patch locally and it worked for me. I plan to merge and release it next week.

@mnaberez
Copy link
Member

@plockaby tried this patch and said in #807 (comment):

That patch did not solve my problem. I almost immediately had one of the updated systems (not the same one as before) go into a poll craziness:

poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 593446}, NULL) = 0
gettimeofday({1492367896, 593515}, NULL) = 0
gettimeofday({1492367896, 593584}, NULL) = 0
gettimeofday({1492367896, 593637}, NULL) = 0
gettimeofday({1492367896, 593719}, NULL) = 0
gettimeofday({1492367896, 593788}, NULL) = 0
gettimeofday({1492367896, 593826}, NULL) = 0
gettimeofday({1492367896, 593863}, NULL) = 0
gettimeofday({1492367896, 593886}, NULL) = 0
gettimeofday({1492367896, 593906}, NULL) = 0
gettimeofday({1492367896, 593948}, NULL) = 0
gettimeofday({1492367896, 593987}, NULL) = 0
gettimeofday({1492367896, 594031}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 594103}, NULL) = 0
poll([{fd=4, events=POLLIN|POLLPRI|POLLHUP}, {fd=5, events=POLLIN|POLLPRI|POLLHUP}, {fd=8, events=POLLIN|POLLPRI|POLLHUP}, {fd=10, events=POLLIN|POLLPRI|POLLHUP}, {fd=11, events=POLLIN|POLLPRI|POLLHUP}, {fd=15, events=POLLIN|POLLPRI|POLLHUP}, {fd=16, events=POLLIN|POLLPRI|POLLHUP}, {fd=20, events=POLLIN|POLLPRI|POLLHUP}, {fd=21, events=POLLIN|POLLPRI|POLLHUP}, {fd=25, events=POLLIN|POLLPRI|POLLHUP}, {fd=26, events=POLLIN|POLLPRI|POLLHUP}, {fd=30, events=POLLIN|POLLPRI|POLLHUP}], 12, 1000) = 1 ([{fd=5, revents=POLLIN}])
gettimeofday({1492367896, 594331}, NULL) = 0
gettimeofday({1492367896, 594415}, NULL) = 0
gettimeofday({1492367896, 594491}, NULL) = 0
gettimeofday({1492367896, 594561}, NULL) = 0
gettimeofday({1492367896, 594642}, NULL) = 0
gettimeofday({1492367896, 594663}, NULL) = 0
gettimeofday({1492367896, 594699}, NULL) = 0
gettimeofday({1492367896, 594739}, NULL) = 0
gettimeofday({1492367896, 594769}, NULL) = 0
gettimeofday({1492367896, 594808}, NULL) = 0
gettimeofday({1492367896, 594836}, NULL) = 0
gettimeofday({1492367896, 594887}, NULL) = 0
gettimeofday({1492367896, 594934}, NULL) = 0
wait4(-1, 0x7ffe0f36f7e4, WNOHANG, NULL) = 0
gettimeofday({1492367896, 595005}, NULL) = 0

lr-x------ 1 root root 64 Apr 16 10:37 0 -> /dev/null
lrwx------ 1 root root 64 Apr 16 10:37 1 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 10 -> pipe:[36606043]
lr-x------ 1 root root 64 Apr 16 10:37 11 -> pipe:[36606045]
l-wx------ 1 root root 64 Apr 16 10:37 12 -> /data/logs/supervisor/supermon.log
l-wx------ 1 root root 64 Apr 16 10:37 13 -> /data/logs/supervisor/supermon.err
l-wx------ 1 root root 64 Apr 16 10:37 14 -> pipe:[36606047]
lr-x------ 1 root root 64 Apr 16 10:37 15 -> pipe:[36606046]
lr-x------ 1 root root 64 Apr 16 10:37 16 -> pipe:[36606048]
l-wx------ 1 root root 64 Apr 16 10:37 17 -> /data/logs/supervisor/supercron.log
l-wx------ 1 root root 64 Apr 16 10:37 18 -> /data/logs/supervisor/supercron.err
l-wx------ 1 root root 64 Apr 16 10:37 19 -> pipe:[36606050]
lrwx------ 1 root root 64 Apr 16 10:37 2 -> socket:[36606016]
lr-x------ 1 root root 64 Apr 16 10:37 20 -> pipe:[36606049]
lr-x------ 1 root root 64 Apr 16 10:37 21 -> pipe:[36606051]
l-wx------ 1 root root 64 Apr 16 10:37 22 -> /data/logs/supervisor/dart-agent.log
l-wx------ 1 root root 64 Apr 16 10:37 24 -> pipe:[36606053]
lr-x------ 1 root root 64 Apr 16 10:37 25 -> pipe:[36606052]
lr-x------ 1 root root 64 Apr 16 10:37 26 -> pipe:[36606054]
l-wx------ 1 root root 64 Apr 16 10:37 27 -> /data/logs/supervisor/pgcheck.log
l-wx------ 1 root root 64 Apr 16 10:37 28 -> /data/logs/supervisor/pgcheck.err
l-wx------ 1 root root 64 Apr 16 10:37 3 -> /data/logs/supervisord.log
lr-x------ 1 root root 64 Apr 16 10:37 30 -> pipe:[36606055]
l-wx------ 1 root root 64 Apr 16 10:37 32 -> /data/logs/supervisor/pgwatch.log
l-wx------ 1 root root 64 Apr 16 10:37 33 -> /data/logs/supervisor/pgwatch.err
lrwx------ 1 root root 64 Apr 16 10:37 4 -> socket:[36606038]
l-wx------ 1 root root 64 Apr 16 10:37 5 -> /data/logs/supervisor/dart-agent.err
l-wx------ 1 root root 64 Apr 16 10:37 6 -> pipe:[36606041]
lr-x------ 1 root root 64 Apr 16 10:37 7 -> /dev/urandom
lr-x------ 1 root root 64 Apr 16 10:37 8 -> pipe:[36606042]
l-wx------ 1 root root 64 Apr 16 10:37 9 -> pipe:[36606044]

@plockaby
Copy link
Contributor

I feel it may be prudent to add to this thread that this 100% cpu thing only started happening when I added minfds and childlogdir to my configurations. Of course, I haven't tried removing those to see if the problem goes away but I will try to do that today.

@jaredsuttles
Copy link
Author

@mnaberez @plockaby It sounds like this is a separate (but similar) issue. This PR still fixes the polling on macOS.

@plockaby
Copy link
Contributor

I agree that it is separate. Don't let my problem hold up merging this pull request.

@mnaberez
Copy link
Member

This is what I tried to reproduce the high CPU usage:

Minimal supervisord.conf:

[supervisord]
logfile = /tmp/supervisord.log
loglevel = blather

[inet_http_server]
port = 127.0.0.1:9001

[supervisorctl]
serverurl = http://127.0.0.1:9001

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

Start supervisord in the foreground with that config:

$ supervisord -n -c /path/to/supervisord.conf

Open a web browser to http://127.0.0.1:9001. The supervisord log in the foreground will start spewing lines like BLAT read event and won't stop. I believe this is the high CPU usage.

I was able to see this on macOS Sierra (10.12.4) with Python 2.7.13 using the current master (3d128e4) and applying this patch fixed it.

I also tried this on Ubuntu Desktop 16.04.1 LTS with Python 2.7.12. On that system, I wasn't able to see the problem before or after the patch.

@mnaberez mnaberez merged commit 610e832 into Supervisor:master Apr 20, 2017
@dkrieger
Copy link

@plockaby I'm also seeing poll craziness (very similar looking strace) and 100% CPU usage, on a linux rackspace box. did you open a separate issue for the issue you found? I'm running 3.3.3 fwiw, python 2.7.6

@plockaby
Copy link
Contributor

@dkrieger I worked around my problem by making my program's logger write to syslog and then not using supervisor to handle the logging from the event handler. I think I just hijacked other 100% CPU issues rather than making my own. Look at #807

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants