-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to recover in-process yet abandoned jobs? #159
Comments
Hey @wflanagan, we already have support for this situation that you describe:
When you start a new supervisor, or simply periodically, the supervisor checks for processes that have died and deregister them. Deregistering a process releases its claimed executions. Also, claimed executions have their own You'd need to investigate why workers are getting deleted from the |
@rosa Hey thanks for the response. So, we likely don't get a SigINT or other signal before it just terminates. This is because the node or the process has violated some constraint and it will just kill the process and respawn it. The link you point to seems to look for a clean termination. When you're running on, for example, Spot instances on AWS, you don't get any sort of clean signal. The process is just gone. Are you saying it cleans up in this situation when there's not an opportunity for a graceful shutdown? |
Yes, exactly that, that's the exact situation I designed for, and this is what I copied from the link I provided, where both graceful and not clean termination are described, also in my comment:
When you start a new supervisor, or simply periodically, the supervisor checks for processes that have died and deregister them. Deregistering a process releases its claimed executions. Also, claimed executions have their own You'd need to investigate why workers are getting deleted from the |
Ok. Maybe it was the version. I had "running jobs" for 18 days that were connected to workers that were no longer present. That's what got me going on this. How often are they "reassigned"? Other than my code to query to find the working/missing ones, we're running a VERY stock SolidQueue setup. 1 queue, expecting it to "just work." I'll rewatch is all and see if it resolves. |
This has been like this from the first version 😕
This is what seems wrong, and that I think happened because of something external. For Solid Queue, it's not possible to delete workers without deleting their assigned jobs because that's done in a transaction, via an |
@wflanagan, so, for Solid Queue, it's not possible to delete workers without deleting their assigned jobs because that's done in a transaction, via an Also, if you want to clear things up and wait until this happens again so you have logs and can investigate, you can do that in your Rails console as: SolidQueue::ClaimedExecution.left_joins(:process).where(process: { id: nil }).find_each(&:release) It's going to take a bit because you have so many, but it should release all those jobs. Alternatively, if you have more recent orphaned jobs like this, you might be able to find logs for those 🤔 |
So, when we run in kubernetes, or any sort of cloud infrastructure, the
infrastructure can just "go away" in a mostly unceremonious way. We used to
have this often in Amazon where we'd run on spot instances. Those just
terminate and you're left with your WIP/InProcess jobs hanging out there.
That's what's happening here. The process goes over budget on memory or cpu
requests, and when that happens, the process is killed.
It's my understanding from our previous conversation that you handle this
case. But, 1) I'm not seeing it, and 2) there might be miscommunication.
Any cloud infrastructure could/would have something like this happen. It's
a pretty typical scenario.
But, you're saying this is supposed to be cleaned up?
The relative count of these, relative to the total jobs performed, is low.
BUT, it's not 0. And, from our previous conversation, I stopped working on
a way to reclaim these jobs.
Is that still the case?
…On Fri, Apr 5, 2024 at 2:52 PM Rosa Gutierrez ***@***.***> wrote:
@wflanagan <https://github.com/wflanagan>, so, for Solid Queue, it's not
possible to delete workers without deleting their assigned jobs because that's
done in a transaction
<https://github.com/basecamp/solid_queue/blob/d71aa8ad6dd435fc458fc07be5ecc3b0cf4e89de/app/models/solid_queue/process.rb#L12>,
via an after_destroy callback. If workers are deleted via destroy, no
jobs would be left assigned. Is it possible for you to investigate what
happened with worker with ID 594396? I imagine not, since it's been quite a
while.
Also, if you want to clear things up and wait until this happens again so
you have logs and can investigate, you can do that in your Rails console as:
SolidQueue::ClaimedExecution.left_joins(:process).where(process: { id: nil }).find_each(&:release)
It's going to take a bit because you have so many, but it should release
all those jobs.
Alternatively, if you have more recent orphaned jobs like this, you might
be able to find logs for those 🤔
—
Reply to this email directly, view it on GitHub
<#159 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACAJZIWJ3DKE35H5RFDGSLY33XHRAVCNFSM6AAAAABDS6VMW2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBQGQ2DINZXGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Yes, it can also happen in our case, and why we handle this via heartbeats and having the supervisor clean up both when it starts and periodically.
Yes.
The processes are being deleted, at least, because you don't have them, so I'd like to know how processes are being deleted for you without their claimed executions not being released as well. The process + its claimed executions get deleted together in a DB transaction, so something must be interfering with that. |
First, let me say i appreciate your efforts on this. I know it can be thankless. I used to work on the Cequel gem.
I don't quite understand what you're asking. We have a node running in our infrastructure that is running a worker process. That worker process is killed via the equivalent of What I understand is that the process 'record' and the execution are together in a transaction. But, the process that is executing that transaction just disappears. I am not a DB expert. But, what would roll this record back? There's nothing to do it. The process that was running it was killed. The entire process in the worker that runs is run with FYI, we aren't specifically deleting any jobs or any records in the other tables. One other thought. We are going through PgBouncer. But, don't know why that would be involved. |
Thank you!
Yes, that's right. The process would be killed, and the DB records that correspond to these processes, that is, the rows in the solid_queue/app/models/solid_queue/process.rb Lines 22 to 27 in 108aea9
And
So, whenever a process record gets destroyed (independently of when the actual, real process was killed), its associated claimed executions would get deleted too, as part of the same DB transaction. In your case, you have process records that have been somehow deleted but you still have claimed executions associated to these records. I'm not sure how this can happen, and have never seen it happen, so this is what I'm trying to figure out and wondering if you had any logs or any way to find out how the missing process records were deleted in the first place. |
We have PG logs, which COULD be helpful or not. Tracking down the specific logs might be challenging though. |
We are seeing this same behavior pretty consistently - Jobs will get stuck in processing that are assigned to a worker that no longer exists, completely restarting does not correct the job to worker assignments. |
We are also seeing this sometimes, after killing the worker in dev. Something with the transaction seems to not be working as expected.. |
I believe this is basically what we are experiencing as well @arjun810 |
Hey, just a mention that we see similar issue on production k8s. The jobs where in "in_progress" and assigned to the worker that didn't exist. Manual |
We've never encountered this, and I haven't been able to reproduce it, so I'm a bit lost on what might be happening. If anyone could grab DB logs from the moment the worker gets deleted, that'd be really useful to find and fix the root cause. |
When developing, this might be caused by workers dying when your mac hibernates and resumes. I noticed when coming back to my mac this morning, the workers tab had a strange state of Again, just an idea to what is possibly happening. |
Thanks a lot @dan003400! That's indeed likely, but in that case, the worker record in the I'm trying to reproduce this by doing the following, using this job. First, I enqueue 10 of these, so that they take 1 minute when being performed: >> 10.times { StoreResultJob.perform_later(42, pause: 60.seconds) } Then I start Solid Queue with this config: workers:
- queues: background
threads: 3
- queues: default
threads: 5 4 bin/rails solid_queue:start
SolidQueue-0.3.1 Register Supervisor (36.0ms) pid: 63970, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Prune dead processes (19.8ms) size: 0
SolidQueue-0.3.1 Register Dispatcher (28.7ms) pid: 63985, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Dispatcher (29.7ms) pid: 63985, hostname: "Rosas-MacBook-Air.local", polling_interval: 1, batch_size: 500, concurrency_maintenance_interval: 600, recurring_schedule: nil
SolidQueue-0.3.1 Register Worker (28.6ms) pid: 63986, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Register Worker (28.7ms) pid: 63987, hostname: "Rosas-MacBook-Air.local"
SolidQueue-0.3.1 Started Worker (29.3ms) pid: 63986, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "background", thread_pool_size: 3
SolidQueue-0.3.1 Started Worker (29.4ms) pid: 63987, hostname: "Rosas-MacBook-Air.local", polling_interval: 0.1, queues: "default", thread_pool_size: 5
SolidQueue-0.3.1 Unblock jobs (2.0ms) limit: 500, size: 0 And now I kill the workers with a
Let's check claimed jobs and registered workers:
As expected, all claimed executions have a registered process. Note that 3 of them are still claimed by one of the killed workers (process ID 7):
Now, let's kill the supervisor by sending a
Workers and dispatcher realise the supervisor is gone, so they shut down orderly and clean up:
We still have the 3 claimed executions by the worker that I killed individually:
But the worker record is still there as expected:
Now, if I start the supervisor again, one of the first thing it does is to run the process prune, releasing claimed executions and deregistering the process with ID 7, whose heartbeat has expired already:
Relevant lines:
I can't manage to leave |
When coming back to my mac today after the mac was hibernated all night. There is a job stuck in progress assigned to The only way I have been able to correct this in the past is by deleting the rows in |
@dan003400, no clue/logs about how worker 304 got deleted? 😕 Is it possible you have that in your |
This is running in production on Kubernetes. So, it's not a machine hibernating. Good thought though. |
@rosa I am going to private message you a video in our process. I haven't done anything to clear them up yet, as I am hoping i can get some feedback and advice that can help us troubleshoot this case. |
@rosa related to this, i'm wondering, when ANY worker restarts, does it clean up any claimed executions that are now missing and outside of the heartbeat window? Is there some sort of assumption that a worker has some sort of static logic, and the "rebooted" worker would pick up the same work? Just brainstorming why it might be happening. Because, if any worker cleans up any old execution records, then this shouldn't be happening. |
Thanks @wflanagan!
No, not really. When a worker is deleted from the DB, it releases its claimed executions. A rebooted worker will have a different ID and will know nothing about other executions. The piece I'm missing here, basically, is how a worker record can get deleted without releasing its executions as well, and what I haven't been able to reproduce in any way. The after_destroy -> { claimed_executions.release_all } So, both releasing claimed executions and deleting the worker should happen within a transaction, which is why I haven't been able to reproduce what's happening here: a worker (process) getting deleted, leaving all executions behind. A way this could happen is that the worker gets deregistered from the DB while some executions are still being claimed, in a way that
I'm going to make some changes regardless to see if they help with this scenario, but it's tricky without being able to reproduce it or having an explanation why this is happening. If someone could find some logs for the missing workers being deleted, that'd shed some light for sure. |
I think #244 should help with this. |
@rosa You're a trooper. Thank you. I'll try it out and let you know. |
@rosa I pointed at this repo and updated production. Two things.
So, It seems that the job.id isn't found. So, it can't release the process. |
I did more research. So, what's happening in my system is this. The worker process is actually still there, but it's "idling." In kubernetes, the pod was present. I went into the pod and found there was no work being done, that it was idle. When i killed the pod, it immediately released all the in process jobs that were outstanding. So, this works. I guess the question becomes, and maybe it's related to my own infrastructure, is why the process is still present and waiting, showing no activity, but the job doesn't perceive itself as completed? I'm still working on it, but i wanted to give you the update. |
Hmmm... this doesn't seem possible, because there's a foreign key:
so you can't have a row in |
The foreign key is in the schema. |
@wflanagan, and does this return any records for you? 😕 SolidQueue::ClaimedExecution.where.missing(:job) |
@rosa No. Empty array. |
@espenbff, thanks for writing this up 🙏 what's the size of your database connection pool? You'll find this in your |
It is set to 5
I see I overwrite it with 5 anyway. Will do a new test with 20. |
Got it! Yes, it'll depend on how you have your workers configured as well (the number of threads you're using). |
I have got some load on SolidQueue in my production app.
Our workers work in an environment where the worker can be killed at any time (cloud-type infrastructure). Because of this, over the course of time, we will develop some jobs that show as "in-process" but the worker that was running them has died. So, they have been "in process" for 13 days, etc.
I'm able to query the jobs and find the jobs that are in process but not assigned to any current active worker.
I've built a method to try to recover the jobs.
As you can see, it deletes the claimed execution. Then, it tries to find the scheduled execution and set its time to now to make it ready.
This seems to work. BUT, it throws a nasty error and 0 of my workers are now working.
So, I'm wondering if I don't understand from reading the docs how these executions work. It COULD be that this error is somewhat unrelated to what I did above (which I did for about 5000 jobs). But, I'd guess this was related.
The documentation on the executions and stuff is pretty sparse, and I'm not sure I really "got" how this works. Any documentation help would help me get to the bottom of this.
I'd appreciate it.
The text was updated successfully, but these errors were encountered: