Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-initiated job cancellation improvements #444

Open
tazlin opened this issue Aug 15, 2024 · 0 comments
Open

User-initiated job cancellation improvements #444

tazlin opened this issue Aug 15, 2024 · 0 comments
Labels
countermeasures enhancement New feature or request

Comments

@tazlin
Copy link
Member

tazlin commented Aug 15, 2024

While considering issue #443, I identified that job cancellations, although a corner case in normal operations with well-intentioned users, also represent a potential Denial of Service (DoS) attack vector and is an actual non-trivial source of wasted GPU cycles. This issue is distinct from the bug identified in #443, which pertains specifically to the submission of completed jobs by workers. To address my other concerns, I propose the following improvements to the handling of canceled jobs within the worker job dispatch system.

Proposed Changes:

  1. Job Cancellation Handling:

    • Introduce a new field jobs_cancelled in the job pop responses. This field will list job ids that were assigned to the worker but have since been canceled by the requesting user.
  2. New Worker Notification Endpoint:

    • Create a new POST endpoint for worker notifications:
      • The endpoint will always respond with the jobs_cancelled field, providing a list of canceled job ids.
      • It will not assign new jobs to the worker in this response.
      • The worker can send a payload containing the jobs_cancelled field to acknowledge that they have stopped working on the canceled job(s).
  3. Prorated Kudos for Canceled Jobs:

    • Implement a prorated kudos system where the amount of kudos awarded decreases based on how much time has elapsed before the worker acknowledges the job cancellation. This incentivizes workers to abandon canceled jobs quickly, thereby saving GPU cycles.
  4. Abuse Prevention Measures:

    • Recognize the potential for abuse and introduce mechanisms to mitigate it:
      • Flagging High Cancellation Pairs: Monitor and flag user/worker pairs that have a high frequency of job cancellations for review.
      • Statistical Anomalies: Identify and flag workers with abnormal or statistically unlikely cancellation rates.
      • Targeted Cancellations: Pay extra attention to workers who cancel jobs that were specifically targeted to them using the workers field.
      • Untrusted workers: Workers who are not yet trusted should trigger additional scrutiny when high volumes of cancellations occur for jobs they have been assigned.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
countermeasures enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant