Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Kill rq jobs by id #509

Merged
merged 5 commits into from
Nov 16, 2020
Merged

✨ Kill rq jobs by id #509

merged 5 commits into from
Nov 16, 2020

Conversation

znatty22
Copy link
Member

@znatty22 znatty22 commented Nov 11, 2020

Makes use of the new send_kill_horse_command to kill a job via job id

NOTE: This will likely go away once rq/rq#1371 is implemented.

@znatty22 znatty22 added the feature New functionality label Nov 11, 2020
@znatty22 znatty22 self-assigned this Nov 11, 2020
@codecov
Copy link

codecov bot commented Nov 11, 2020

Codecov Report

Merging #509 (0cdf063) into master (c5ad853) will increase coverage by 0.07%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #509      +/-   ##
==========================================
+ Coverage   91.70%   91.78%   +0.07%     
==========================================
  Files          74       75       +1     
  Lines        3401     3433      +32     
==========================================
+ Hits         3119     3151      +32     
  Misses        282      282              
Impacted Files Coverage Δ
creator/utils.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5ad853...3a61591. Read the comment docs.

@znatty22 znatty22 force-pushed the kill-jobs branch 2 times, most recently from edca967 to 111f2dc Compare November 11, 2020 20:30
@znatty22 znatty22 marked this pull request as ready for review November 11, 2020 20:37
@znatty22 znatty22 changed the title ✨ Kill rq jobs ✨ Kill rq jobs by id Nov 11, 2020
@znatty22 znatty22 mentioned this pull request Nov 11, 2020
creator/utils.py Outdated Show resolved Hide resolved
creator/utils.py Outdated Show resolved Hide resolved
creator/utils.py Outdated Show resolved Hide resolved
creator/utils.py Outdated

# Wait until job has been moved to failed job registry to delete or
# cancel job
while time.time() < (time.time() + WAIT_FOR_FAIL_TIMEOUT):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long does this operation take and where do we plan to call it from?
The while is going to peg the CPU, possibly of a worker that might otherwise be serving requests, which could cause our autoscaler to kick up the instance count. It will also bombard redis with requests which can be a problem if we end up canceling lots of jobs at once.
Perhaps it would be better to queue another job to check up on the status ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation will be called from the IngestRun.cancel method. Eh yea 30s was left from debugging. I think I'm just going to sleep for less than a second, maybe 0.1 bc I noticed during testing that's all it takes for rq to move the job to a registry.

In general submitting another job to check on status seems like it would have the same problem bc the status might not have changed by the time you're inside the job. In this case it would probably work but I'd rather just sleep. Sigh the real solution is to have rq support subscription to job status changes. This issue looks like what we want: rq/rq#717 but no ones looked at it in a while

creator/utils.py Outdated Show resolved Hide resolved
tests/test_utils.py Outdated Show resolved Hide resolved
tests/test_utils.py Outdated Show resolved Hide resolved
creator/utils.py Outdated

# Wait until job has been moved to failed job registry to delete or
# cancel job
time.sleep(WAIT_FOR_FAIL_TIME)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the impact of never calling delete() or cancel() on the job? It seems likely this could happen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't call delete the killed jobs will pile up in the failed job registry. I guess it depends on if you care about that and if you want to differentiate between jobs that failed due to kill horse or something else.

For cancel, I just realized I need to change the logic bc it removes jobs from the queue that haven't started. So I don't care about waiting for the failed job reg to update

creator/utils.py Outdated Show resolved Hide resolved
creator/utils.py Outdated Show resolved Hide resolved
@znatty22 znatty22 marked this pull request as ready for review November 13, 2020 20:11
creator/utils.py Outdated Show resolved Hide resolved
worker.get_current_job().id == job_id
):
logger.info(f"Killing job {job_id}")
send_kill_horse_command(queue.connection, worker.name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably want a break here so you don't end up querying all the workers after you found the job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops ye

f"Delete job {job_id} by removing it from queue: "
f"{queue.name} and all job registries"
)
redis_job.delete()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the job will be deleted even if the send_kill_horse_command was never sent. Is that expected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that can happen? If the execution gets down here it means the else block executed bc the job was running, and if its running then it has to be executing in one of the work horses.

But that being said I wonder if the user should have the option to delete the job given the other cases where it either already failed/finished or we cancel it? Right now that doesn't happen because I return at the end of the other if blocks. Or I could just remove deletion entirely and leave it to the user to do that after calling this method?

Copy link
Member Author

@znatty22 znatty22 Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: In my last commit I made it so that delete=False by default. But if user sets delete=True the job will always be deleted at the end of stop job. I can change that though if we don't like it

tests/test_utils.py Outdated Show resolved Hide resolved
Copy link
Contributor

@dankolbman dankolbman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@znatty22 znatty22 merged commit c747e00 into master Nov 16, 2020
@znatty22 znatty22 deleted the kill-jobs branch November 16, 2020 17:43
dankolbman added a commit that referenced this pull request Nov 20, 2020
## Release 1.14.7

### Summary

- Emojis: 🐛 x2, 🔥 x2, ⬆️ x1, ✨ x1
- Categories: Fixes x2, Removals x2, Ops x1, Additions x1

### New features and changes

- [#515](#515) - 🐛 Don't resolve downloadUrls if not enough info - [8dbae2a](8dbae2a) by [dankolbman](https://github.com/dankolbman)
- [#514](#514) - 🐛 Fix events query for non-admins - [9f2c9f8](9f2c9f8) by [dankolbman](https://github.com/dankolbman)
- [#511](#511) - 🔥 Remove ego authentication - [f803ec5](f803ec5) by [dankolbman](https://github.com/dankolbman)
- [#512](#512) - 🔥 Remove hypothesis testing - [09b1a9e](09b1a9e) by [dankolbman](https://github.com/dankolbman)
- [#496](#496) - ⬆️ Bump cryptography from 2.5 to 3.2 - [709450f](709450f) by [dependabot[bot]](https://github.com/apps/dependabot)
- [#509](#509) - ✨ Kill rq jobs by id - [c747e00](c747e00) by [znatty22](https://github.com/znatty22)
@dankolbman dankolbman mentioned this pull request Nov 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants