Unresponsive processes not restarted #1

lopuhin · 2016-05-13T09:29:17Z

Can the processes that become unresponsive and are eating 100% CPU be restarted? They are marked as DOWN by haproxy:

As far as I understand, haproxy just handles the requests and redirects them to alive workers, and it is someone else job to restart them. But docker can not restart them because it does not know that they are down, right? I'm know that for example uwsgi has such a feature but perhaps docker and haproxy can be made to behave in a similar way?

lopuhin · 2016-05-13T09:36:32Z

Thinking about it more, perhaps it would be better to fix it on the splash side - this way it will benefit anyone running a single splash instance.

kmike · 2016-05-13T09:36:38Z

Yeah. Even worse, even if you kill them and restart manually they are not picked up by haproxy because haproxy docker container doesn't detect new ip addresses of linked containers; I think this is the first issue to solve. It should be possible to fix since haproxy 1.6 (http://blog.haproxy.com/2015/10/14/whats-new-in-haproxy-1-6/) - it now provides a way to resolve hostnames at runtime. I haven't got to it yet; help is appreciated :)

kmike · 2016-05-13T09:37:27Z

Are you thinking about a builtin watchdog process?

lopuhin · 2016-05-13T09:39:41Z

More like a feature of a splash instance that would check if rendering is taking too long and will try to kill the call to the browser engine and re-initialize it. I'm not sure if this is possible.

kmike · 2016-05-13T09:41:00Z

There is 'timeout' feature, but it works only if rendering yields to the event loop, which is not the case when we hit some qt bug.

lopuhin · 2016-05-13T09:42:29Z

Yeah, maybe we could launch some thread or use signals here. If it is possible to kill and restart the browser engine, it should be pretty robust.

kmike · 2016-05-13T09:48:03Z

It could be tricky because we're running several QWebViews in a same event loop => all concurrent renders happen in a single thread; there may be some tasks launched by qt executed in different threads, but AFAIK by default most things we have control of are single threaded. There could be a workaround (it worths checking), but I'm a bit pessimistic. Aquarium is a Splash with multi-processing.

lopuhin · 2016-05-13T09:50:23Z

I see, thanks! I'll try to check if there is a way :)

lopuhin · 2016-05-13T09:52:58Z

Another option that seems to be easier and seems to also solve the problem for Aquarium (with the new haproxy feature that you mentioned) is to exit if processing takes too much time, similar to --max-rss option to splash.

kmike · 2016-05-13T10:08:08Z

So, a watchdog process? Also, it seems these checks can be implemented in haproxy itself (it can run external scripts), but I haven't checked the details.

lopuhin · 2016-05-13T11:21:13Z

What I meant was more like a watchdog thread inside the splash server, but now I understand that this is not that good - it would be complicated to do anything except for killing the process in this case, and it does not bring much benefit for the standalone splash, only complicates it.
So it seems more reasonable to do it in Aquarium.
This external-check Haproxy feature can be used to kill the server that does not respond to ping, perhaps.

Glennvd · 2017-01-05T16:21:27Z

Any update on this?

kmike · 2017-01-05T17:10:26Z

@Glennvd no updates, sorry.

ale316 · 2017-01-10T09:03:45Z

@Glennvd I've been having the same issue.

In my case, I discovered that Xvfb is the process failing and thus not letting splash do its job.

If you log into one of the DOWN containers with

docker-compose exec splash6 bash

and then

ps ax

you should see something like

PID TTY      STAT   TIME COMMAND
    1 ?        Ssl    0:18 python3 /app/bin/splash --proxy-profiles-path ...
    8 ?        Z      0:00 [Xvfb] <defunct>
   48 ?        Ss     0:00 bash
   67 ?        R+     0:00 ps ax

Now, by simply restarting the Xvfb process with Xvfb :100 -screen 0 1024x768x24, if will recover.

The question is, how do we manage to make Xvfb recover automatically? I'm going to file an issue on the Splash repository as this isn't a bug in aquarium.

ale316 · 2017-01-10T10:07:30Z

Sorry, this actually seems to be a side effect of the machines crashing. If I log into one of the crashed container immediately after it dies, Xvfb seems to still be up and running. After a while it crashes and I guess restarting it triggers a Splash reload.

landoncope · 2018-02-26T02:56:50Z

Lopuhin, what did you end up doing about this issue? I'm thinking of just setting a cron job to restart the docker containers a few times a day.

cristianocca · 2018-03-21T01:34:37Z

@landoncope I did the same... crontab that restarts every 1 hour (yes, had to get that frequent).

Hopefully there's a better fix.

Mideen · 2018-11-13T06:26:54Z

I am also having the same issue. Have automated 1.5L URLs to fetch the page source with 5 splash instance Aquarium. It takes almost 15 hours to complete. After completed a few instances went down and these instances were not restarted.

Is restart the Aquarium only way to solve this issue?. Any other solution is there?

lopuhin · 2018-11-13T06:34:21Z

I'm not aware of any new solutions, but with splash 3.2 reliability is much better in our experience.

Mideen · 2018-11-13T06:50:00Z

Is there any API Call to check externally whether the splash instances are down or not instead of viewing the HAProxy statistics?. If there, we can restart the Aquarium only if an instance is down.No need to restart periodically.

davidkong0987 · 2022-08-24T16:36:29Z

What's the way to restart this, say, once every 4 hours?

landoncope mentioned this issue Mar 20, 2018

Randomly getting stuck at 100% CPU scrapinghub/splash#754

Open

davidkong0987 mentioned this issue Aug 24, 2022

Getting The X11 connection broke: I/O error (code 1) After started using V3.5 scrapinghub/splash#1083

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unresponsive processes not restarted #1

Unresponsive processes not restarted #1

lopuhin commented May 13, 2016

lopuhin commented May 13, 2016

kmike commented May 13, 2016

kmike commented May 13, 2016

lopuhin commented May 13, 2016 •

edited

Loading

kmike commented May 13, 2016

lopuhin commented May 13, 2016 •

edited

Loading

kmike commented May 13, 2016

lopuhin commented May 13, 2016

lopuhin commented May 13, 2016

kmike commented May 13, 2016

lopuhin commented May 13, 2016

Glennvd commented Jan 5, 2017

kmike commented Jan 5, 2017

ale316 commented Jan 10, 2017

ale316 commented Jan 10, 2017

landoncope commented Feb 26, 2018

cristianocca commented Mar 21, 2018

Mideen commented Nov 13, 2018

lopuhin commented Nov 13, 2018

Mideen commented Nov 13, 2018

davidkong0987 commented Aug 24, 2022

Unresponsive processes not restarted #1

Unresponsive processes not restarted #1

Comments

lopuhin commented May 13, 2016

lopuhin commented May 13, 2016

kmike commented May 13, 2016

kmike commented May 13, 2016

lopuhin commented May 13, 2016 • edited Loading

kmike commented May 13, 2016

lopuhin commented May 13, 2016 • edited Loading

kmike commented May 13, 2016

lopuhin commented May 13, 2016

lopuhin commented May 13, 2016

kmike commented May 13, 2016

lopuhin commented May 13, 2016

Glennvd commented Jan 5, 2017

kmike commented Jan 5, 2017

ale316 commented Jan 10, 2017

ale316 commented Jan 10, 2017

landoncope commented Feb 26, 2018

cristianocca commented Mar 21, 2018

Mideen commented Nov 13, 2018

lopuhin commented Nov 13, 2018

Mideen commented Nov 13, 2018

davidkong0987 commented Aug 24, 2022

lopuhin commented May 13, 2016 •

edited

Loading

lopuhin commented May 13, 2016 •

edited

Loading