-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[4.x] Prevent horizon:purge from killing too many processes #820
[4.x] Prevent horizon:purge from killing too many processes #820
Conversation
This change modifies the master supervisor basename to include the hash of the current directory. The process inspector that determines the processes to kill uses this hash to only look for processes that were created by the Horizon command running in the current Laravel instance's directory, not by all Laravel instances running on the server. This fixes #769.
This change modifies the master supervisor basename to include the hash of the current directory. The process inspector that determines the processes to kill uses this hash to only look for processes that were created by the Horizon command running in the current Laravel instance's directory, not by all Laravel instances running on the server. This fixes #769.
…-instances-v4' into H/rafalglowacz/horizon-purge-killing-processes-of-other-instances
I had to make a merge and a few style fixes. Should I make a rebase now or will that come at the end if the PR gets approved? |
I will note that only one master Horizon process is meant to be running per server. |
@rafalglowacz Any idea why not just the changes in this PR? #822 |
@rafalglowacz can you give us some feedback on #822 |
I'll check it out tomorrow. I thought I'd be able to have a look today, but something else came up. |
I tried the code in #822 and the effect was the same as before the changes - the Horizon processes of another instance got killed. When the process inspector in #822 looks for current processes, it first gets the processes including the path hash, but one line later it adds all masters. So it won't individually kill every worker of other Laravel apps, but it will kill their masters, so the end result is the same. I guess you want to keep it as simple as possible. I checked if I can use #822 and just add mastersWithoutSupervisors from my PR to prevent killing valid masters. That works but then rogue supervisors don't get terminated. Hence the changes in ProcessInspector@monitoring and the new method ProcessInspector@monitoredMastersWithSupervisors in my PR. |
@rafalglowacz But it won't kill other masters since these masters are stored in Redis and we have a reference to these, so even masters for other projects will be returned and thus will be considered as valid. |
Are you referring to this bit in ProcessInspector@monitoring? ->merge(
Arr::pluck(app(MasterSupervisorRepository::class)->all(), 'pid')
)->all(); Other projects will use separate Redis databases, so they won't be returned by this method called inside the current project. |
@rafalglowacz and where in this PR do you collect these processes that use a different redis database? |
Ideally each project would be able to kill its own rogue processes. The path hash takes care of that for workers and supervisors, but that leaves the master supervisors. That's why ProcessInspector@mastersWithoutSupervisors collects all masters that don't have any child processes (supervisors). So that's the only place dealing with processes of other projects, i.e. processes that use a different database. |
@rafalglowacz what about rogue masters that has supervisors. We should be able to purge those anyway. |
…s-of-other-instances
I'm not familiar with that case. How would it manifest itself? Would there be a master process with supervisor processes, but no info about them in |
You say each app may use a different redis connection/database. That's why I originally shared that purging processes 100% is not possible since people can configure each app to read from a different connection so we won't be able to collect every valid process running on the server. I keep recommending not to run multiple queue heavy apps on the same server because if redis crashes by 1 app the queue for the rest of apps will stop. That's why the purge command doesn't really care if it's killing other valid processes, they'll start again anyway but having them running on a single server is not recommended in the first place. Anyway, I think your PR isn't really fixing it completely since a rogue master with supervisors will still be running. Killing processes of other apps isn't a problem really, the supervisors will start these processes again so no harm is done. |
The problem my company originally had was only with rogue processes like in #178
So as far as I can see, if the master process is still there and the other processes are inside it, it means one of two things:
So in what case is there a regression compared to the current |
This PR breaks our horizon routine, which is as follows:
Even when using a custom nameResolver the path is still appended. For now we have reverted to v4.2.1, but I'm wondering what your thoughts on this breaking change are! |
@koenvu THANK YOU! I just spend 2 hours on the phone with a colleague trying to figure out what the hell was wrong with our horizon. "Why are we not able to terminate horizon on this app, while all the other ones are working fine." Looking at the basename method in the code and then thinking "wait, this should not actually work anywhere", to finally realize that we were running a different version to then find this. What a journey! |
Summary
This fixes the problem described in #769. To recap: if there's more than one Laravel application on one server, each with their own Horizon instance, running
php artisan horizon:purge
for one of the instances will kill valid Horizon processes of other instances.Workaround
The issue is supposed to be alleviated by using a supervisor daemon which will restart all processes after they get killed, but
Proposed fix
MasterSupervisor: the hash
My company gave me some time to work on fixing the issue. The fix is based around adding an extra bit of information to the master supervisor's base name. It's a 4-character hash, similar to the random one already appended to base name, but this one instead of being random is based on the current directory, so will remain constant for the current Laravel application.
ProcessInspector: determining current processes
Instead of just doing
pgrep -f [h]orizon
which will also find processes of other Laravel apps, we only look for processes that include the current master supervisor base name. All supervisors and workers include it, but master supervisors themselves don't, so ProcessInspector@current will not return all the results we need. That's why we now also have ProcessInspector@mastersWithoutSupervisors - it returns ids of master supervisor processes that don't have supervisor processes underneath them. It can't determine whether the masters are from current Laravel app or not, but they don't have the expected child processes, so this time it's not a problem.Testing the PR
In order to see the described behavior before and after the PR, you can use the following steps:
ps -fx
) of one of them and kill it with withkill -9
php artisan horizon:purge
on the app where you killed the master. Before the PR, it will kill not just the rouge processes of this instance, but also the valid processes of the other one. After the PR, it will just kill the rouge processes of the current instanceThat's it, let me know what you think!