-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unresponsive processes not restarted #1
Comments
Thinking about it more, perhaps it would be better to fix it on the splash side - this way it will benefit anyone running a single splash instance. |
Yeah. Even worse, even if you kill them and restart manually they are not picked up by haproxy because haproxy docker container doesn't detect new ip addresses of linked containers; I think this is the first issue to solve. It should be possible to fix since haproxy 1.6 (http://blog.haproxy.com/2015/10/14/whats-new-in-haproxy-1-6/) - it now provides a way to resolve hostnames at runtime. I haven't got to it yet; help is appreciated :) |
Are you thinking about a builtin watchdog process? |
More like a feature of a splash instance that would check if rendering is taking too long and will try to kill the call to the browser engine and re-initialize it. I'm not sure if this is possible. |
There is 'timeout' feature, but it works only if rendering yields to the event loop, which is not the case when we hit some qt bug. |
Yeah, maybe we could launch some thread or use signals here. If it is possible to kill and restart the browser engine, it should be pretty robust. |
It could be tricky because we're running several QWebViews in a same event loop => all concurrent renders happen in a single thread; there may be some tasks launched by qt executed in different threads, but AFAIK by default most things we have control of are single threaded. There could be a workaround (it worths checking), but I'm a bit pessimistic. Aquarium is a Splash with multi-processing. |
I see, thanks! I'll try to check if there is a way :) |
Another option that seems to be easier and seems to also solve the problem for Aquarium (with the new haproxy feature that you mentioned) is to exit if processing takes too much time, similar to |
So, a watchdog process? Also, it seems these checks can be implemented in haproxy itself (it can run external scripts), but I haven't checked the details. |
What I meant was more like a watchdog thread inside the splash server, but now I understand that this is not that good - it would be complicated to do anything except for killing the process in this case, and it does not bring much benefit for the standalone splash, only complicates it. |
Any update on this? |
@Glennvd no updates, sorry. |
@Glennvd I've been having the same issue. In my case, I discovered that Xvfb is the process failing and thus not letting splash do its job. If you log into one of the DOWN containers with
and then
you should see something like
Now, by simply restarting the Xvfb process with The question is, how do we manage to make Xvfb recover automatically? I'm going to file an issue on the Splash repository as this isn't a bug in aquarium. |
Sorry, this actually seems to be a side effect of the machines crashing. If I log into one of the crashed container immediately after it dies, Xvfb seems to still be up and running. After a while it crashes and I guess restarting it triggers a Splash reload. |
Lopuhin, what did you end up doing about this issue? I'm thinking of just setting a cron job to restart the docker containers a few times a day. |
@landoncope I did the same... crontab that restarts every 1 hour (yes, had to get that frequent). Hopefully there's a better fix. |
I am also having the same issue. Have automated 1.5L URLs to fetch the page source with 5 splash instance Aquarium. It takes almost 15 hours to complete. After completed a few instances went down and these instances were not restarted. Is restart the Aquarium only way to solve this issue?. Any other solution is there? |
I'm not aware of any new solutions, but with splash 3.2 reliability is much better in our experience. |
Is there any API Call to check externally whether the splash instances are down or not instead of viewing the HAProxy statistics?. If there, we can restart the Aquarium only if an instance is down.No need to restart periodically. |
What's the way to restart this, say, once every 4 hours? |
Can the processes that become unresponsive and are eating 100% CPU be restarted? They are marked as DOWN by haproxy:
As far as I understand, haproxy just handles the requests and redirects them to alive workers, and it is someone else job to restart them. But docker can not restart them because it does not know that they are down, right? I'm know that for example uwsgi has such a feature but perhaps docker and haproxy can be made to behave in a similar way?
The text was updated successfully, but these errors were encountered: