Expose job controller's workqueue rate limiting configs #674

roteme-runai · 2025-01-01T08:30:15Z

Issue:

The mpi-operator does not handle a large number of MPI jobs effectively. Specifically, when creating a significant number of jobs (e.g., 100), the operator experiences delays in job management, leading to the following issues:

Excessive time taken to create all the pods required for the jobs.
Significant delay between a job's completion and the corresponding status update.
Delayed cleanup of pods (per CleanPodPolicy) due to the lag in status updates.

Root Cause:

The controller uses a workqueue with a default rate limiter configuration that is not adjustable via operator options. This is in contrast to other load-related configurations (e.g., threadiness, qps, etc.), which are user-configurable. The low default rate-limiting settings result in insufficient parallel processing, thereby delaying job handling.

Proposed Solution:

To address this, I propose exposing the controller's rate-limiting settings as user-configurable options. This change would allow users to adjust the rate limiter based on their specific usage requirements, expected scale, and system capabilities.
The solution has been tested and verified in a production environment and has demonstrated improved handling of bigger-scale MPI jobs.

Backporting Request:

If this fix is approved, I kindly request its inclusion in a new release, ideally in versions from v0.6 onward.

Thank you for your time!

Signed-off-by: Rotem Elad <rotem.elad@run.ai>

google-oss-prow · 2025-01-01T08:30:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mlsorensen · 2025-01-02T17:28:13Z

cmd/mpi-operator/app/options/options.go

@@ -75,4 +77,7 @@ func (s *ServerOption) AddFlags(fs *flag.FlagSet) {

 	fs.IntVar(&s.QPS, "kube-api-qps", 5, "QPS indicates the maximum QPS to the master from this client.")
 	fs.IntVar(&s.Burst, "kube-api-burst", 10, "Maximum burst for throttle.")
+
+	fs.IntVar(&s.ControllerRateLimit, "controller-queue-rate-limit", 10, "Rate limit of the controller events queue .")
+	fs.IntVar(&s.ControllerBurst, "controller-queue--burst", 100, "Maximum burst of the controller events queue.")


Is the double hyphen a typo?

Expose controller workqueue config via options

44e1180

Signed-off-by: Rotem Elad <rotem.elad@run.ai>

google-oss-prow bot requested review from carmark and zw0610 January 1, 2025 08:30

google-oss-prow bot added the size/M label Jan 1, 2025

mlsorensen reviewed Jan 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose job controller's workqueue rate limiting configs #674

Expose job controller's workqueue rate limiting configs #674

roteme-runai commented Jan 1, 2025

google-oss-prow bot commented Jan 1, 2025

mlsorensen Jan 2, 2025

Expose job controller's workqueue rate limiting configs #674

Are you sure you want to change the base?

Expose job controller's workqueue rate limiting configs #674

Conversation

roteme-runai commented Jan 1, 2025

Issue:

Root Cause:

Proposed Solution:

Backporting Request:

google-oss-prow bot commented Jan 1, 2025

mlsorensen Jan 2, 2025

Choose a reason for hiding this comment