Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unresponsive processes not restarted #1

Open
lopuhin opened this issue May 13, 2016 · 21 comments
Open

Unresponsive processes not restarted #1

lopuhin opened this issue May 13, 2016 · 21 comments

Comments

@lopuhin
Copy link

lopuhin commented May 13, 2016

Can the processes that become unresponsive and are eating 100% CPU be restarted? They are marked as DOWN by haproxy:
2016-05-13 12 17 44

As far as I understand, haproxy just handles the requests and redirects them to alive workers, and it is someone else job to restart them. But docker can not restart them because it does not know that they are down, right? I'm know that for example uwsgi has such a feature but perhaps docker and haproxy can be made to behave in a similar way?

@lopuhin
Copy link
Author

lopuhin commented May 13, 2016

Thinking about it more, perhaps it would be better to fix it on the splash side - this way it will benefit anyone running a single splash instance.

@kmike
Copy link
Contributor

kmike commented May 13, 2016

Yeah. Even worse, even if you kill them and restart manually they are not picked up by haproxy because haproxy docker container doesn't detect new ip addresses of linked containers; I think this is the first issue to solve. It should be possible to fix since haproxy 1.6 (http://blog.haproxy.com/2015/10/14/whats-new-in-haproxy-1-6/) - it now provides a way to resolve hostnames at runtime. I haven't got to it yet; help is appreciated :)

@kmike
Copy link
Contributor

kmike commented May 13, 2016

Are you thinking about a builtin watchdog process?

@lopuhin
Copy link
Author

lopuhin commented May 13, 2016

More like a feature of a splash instance that would check if rendering is taking too long and will try to kill the call to the browser engine and re-initialize it. I'm not sure if this is possible.

@kmike
Copy link
Contributor

kmike commented May 13, 2016

There is 'timeout' feature, but it works only if rendering yields to the event loop, which is not the case when we hit some qt bug.

@lopuhin
Copy link
Author

lopuhin commented May 13, 2016

Yeah, maybe we could launch some thread or use signals here. If it is possible to kill and restart the browser engine, it should be pretty robust.

@kmike
Copy link
Contributor

kmike commented May 13, 2016

It could be tricky because we're running several QWebViews in a same event loop => all concurrent renders happen in a single thread; there may be some tasks launched by qt executed in different threads, but AFAIK by default most things we have control of are single threaded. There could be a workaround (it worths checking), but I'm a bit pessimistic. Aquarium is a Splash with multi-processing.

@lopuhin
Copy link
Author

lopuhin commented May 13, 2016

I see, thanks! I'll try to check if there is a way :)

@lopuhin
Copy link
Author

lopuhin commented May 13, 2016

Another option that seems to be easier and seems to also solve the problem for Aquarium (with the new haproxy feature that you mentioned) is to exit if processing takes too much time, similar to --max-rss option to splash.

@kmike
Copy link
Contributor

kmike commented May 13, 2016

So, a watchdog process? Also, it seems these checks can be implemented in haproxy itself (it can run external scripts), but I haven't checked the details.

@lopuhin
Copy link
Author

lopuhin commented May 13, 2016

What I meant was more like a watchdog thread inside the splash server, but now I understand that this is not that good - it would be complicated to do anything except for killing the process in this case, and it does not bring much benefit for the standalone splash, only complicates it.
So it seems more reasonable to do it in Aquarium.
This external-check Haproxy feature can be used to kill the server that does not respond to ping, perhaps.

@Glennvd
Copy link

Glennvd commented Jan 5, 2017

Any update on this?

@kmike
Copy link
Contributor

kmike commented Jan 5, 2017

@Glennvd no updates, sorry.

@ale316
Copy link

ale316 commented Jan 10, 2017

@Glennvd I've been having the same issue.

In my case, I discovered that Xvfb is the process failing and thus not letting splash do its job.

If you log into one of the DOWN containers with

docker-compose exec splash6 bash

and then

ps ax

you should see something like

PID TTY      STAT   TIME COMMAND
    1 ?        Ssl    0:18 python3 /app/bin/splash --proxy-profiles-path ...
    8 ?        Z      0:00 [Xvfb] <defunct>
   48 ?        Ss     0:00 bash
   67 ?        R+     0:00 ps ax

Now, by simply restarting the Xvfb process with Xvfb :100 -screen 0 1024x768x24, if will recover.

The question is, how do we manage to make Xvfb recover automatically? I'm going to file an issue on the Splash repository as this isn't a bug in aquarium.

@ale316
Copy link

ale316 commented Jan 10, 2017

Sorry, this actually seems to be a side effect of the machines crashing. If I log into one of the crashed container immediately after it dies, Xvfb seems to still be up and running. After a while it crashes and I guess restarting it triggers a Splash reload.

@landoncope
Copy link

Lopuhin, what did you end up doing about this issue? I'm thinking of just setting a cron job to restart the docker containers a few times a day.

@cristianocca
Copy link

@landoncope I did the same... crontab that restarts every 1 hour (yes, had to get that frequent).

Hopefully there's a better fix.

@Mideen
Copy link

Mideen commented Nov 13, 2018

I am also having the same issue. Have automated 1.5L URLs to fetch the page source with 5 splash instance Aquarium. It takes almost 15 hours to complete. After completed a few instances went down and these instances were not restarted.

Is restart the Aquarium only way to solve this issue?. Any other solution is there?

@lopuhin
Copy link
Author

lopuhin commented Nov 13, 2018

I'm not aware of any new solutions, but with splash 3.2 reliability is much better in our experience.

@Mideen
Copy link

Mideen commented Nov 13, 2018

Is there any API Call to check externally whether the splash instances are down or not instead of viewing the HAProxy statistics?. If there, we can restart the Aquarium only if an instance is down.No need to restart periodically.

@davidkong0987
Copy link

What's the way to restart this, say, once every 4 hours?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants