-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Kill rq jobs by id #509
✨ Kill rq jobs by id #509
Conversation
Codecov Report
@@ Coverage Diff @@
## master #509 +/- ##
==========================================
+ Coverage 91.70% 91.78% +0.07%
==========================================
Files 74 75 +1
Lines 3401 3433 +32
==========================================
+ Hits 3119 3151 +32
Misses 282 282
Continue to review full report at Codecov.
|
edca967
to
111f2dc
Compare
creator/utils.py
Outdated
|
||
# Wait until job has been moved to failed job registry to delete or | ||
# cancel job | ||
while time.time() < (time.time() + WAIT_FOR_FAIL_TIMEOUT): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long does this operation take and where do we plan to call it from?
The while
is going to peg the CPU, possibly of a worker that might otherwise be serving requests, which could cause our autoscaler to kick up the instance count. It will also bombard redis with requests which can be a problem if we end up canceling lots of jobs at once.
Perhaps it would be better to queue another job to check up on the status ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation will be called from the IngestRun.cancel
method. Eh yea 30s was left from debugging. I think I'm just going to sleep for less than a second, maybe 0.1 bc I noticed during testing that's all it takes for rq to move the job to a registry.
In general submitting another job to check on status seems like it would have the same problem bc the status might not have changed by the time you're inside the job. In this case it would probably work but I'd rather just sleep. Sigh the real solution is to have rq support subscription to job status changes. This issue looks like what we want: rq/rq#717 but no ones looked at it in a while
ca7346f
to
0a57108
Compare
creator/utils.py
Outdated
|
||
# Wait until job has been moved to failed job registry to delete or | ||
# cancel job | ||
time.sleep(WAIT_FOR_FAIL_TIME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the impact of never calling delete()
or cancel()
on the job? It seems likely this could happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't call delete the killed jobs will pile up in the failed job registry. I guess it depends on if you care about that and if you want to differentiate between jobs that failed due to kill horse or something else.
For cancel, I just realized I need to change the logic bc it removes jobs from the queue that haven't started. So I don't care about waiting for the failed job reg to update
1cf2923
to
09afa61
Compare
worker.get_current_job().id == job_id | ||
): | ||
logger.info(f"Killing job {job_id}") | ||
send_kill_horse_command(queue.connection, worker.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably want a break
here so you don't end up querying all the workers after you found the job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops ye
f"Delete job {job_id} by removing it from queue: " | ||
f"{queue.name} and all job registries" | ||
) | ||
redis_job.delete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the job will be deleted even if the send_kill_horse_command
was never sent. Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that can happen? If the execution gets down here it means the else block executed bc the job was running, and if its running then it has to be executing in one of the work horses.
But that being said I wonder if the user should have the option to delete the job given the other cases where it either already failed/finished or we cancel it? Right now that doesn't happen because I return at the end of the other if blocks. Or I could just remove deletion entirely and leave it to the user to do that after calling this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: In my last commit I made it so that delete=False by default. But if user sets delete=True the job will always be deleted at the end of stop job. I can change that though if we don't like it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
## Release 1.14.7 ### Summary - Emojis: 🐛 x2, 🔥 x2, ⬆️ x1, ✨ x1 - Categories: Fixes x2, Removals x2, Ops x1, Additions x1 ### New features and changes - [#515](#515) - 🐛 Don't resolve downloadUrls if not enough info - [8dbae2a](8dbae2a) by [dankolbman](https://github.com/dankolbman) - [#514](#514) - 🐛 Fix events query for non-admins - [9f2c9f8](9f2c9f8) by [dankolbman](https://github.com/dankolbman) - [#511](#511) - 🔥 Remove ego authentication - [f803ec5](f803ec5) by [dankolbman](https://github.com/dankolbman) - [#512](#512) - 🔥 Remove hypothesis testing - [09b1a9e](09b1a9e) by [dankolbman](https://github.com/dankolbman) - [#496](#496) - ⬆️ Bump cryptography from 2.5 to 3.2 - [709450f](709450f) by [dependabot[bot]](https://github.com/apps/dependabot) - [#509](#509) - ✨ Kill rq jobs by id - [c747e00](c747e00) by [znatty22](https://github.com/znatty22)
Makes use of the new
send_kill_horse_command
to kill a job via job idNOTE: This will likely go away once rq/rq#1371 is implemented.