Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Image Builder Jobs Label and Annotation #380

Merged
merged 7 commits into from
Apr 17, 2023

Conversation

ariefrahmansyah
Copy link
Contributor

@ariefrahmansyah ariefrahmansyah commented Apr 17, 2023

What this PR does / why we need it:

  1. The image-building jobs in Merlin are timing out. After some investigation, we found that one of the root causes is the node pool got scaled down resulting in the image building pods to be rescheduled. This PR adds "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" to avoid the pod get killed and rescheduled.
  2. In Refactor deployed model labels Refactor deployed model labels #346, we managed to propagate user labels to models and batch prediction jobs, but not to image builder jobs (here). This PR includes adding project + model version labels to image builder job.
  3. This PR also improve e2e test by:
    a. Removing the duplicate of kserve-controller-manager statefulset as in kserve 0.9.0 manifest has both of statefulset and deployment
    b. Removing kserve-controller-manager's cpu limits

Does this PR introduce a user-facing change?:

NONE

Checklist

  • Added unit test, integration, and/or e2e tests
  • Tested locally
  • Updated documentation
  • Update Swagger spec if the PR introduce API changes
  • Regenerated Golang and Python client if the PR introduce API changes

@@ -144,6 +144,7 @@ func initImageBuilder(cfg *config.Config) (webserviceBuilder imagebuilder.ImageB
Tolerations: cfg.ImageBuilderConfig.Tolerations,
NodeSelectors: cfg.ImageBuilderConfig.NodeSelectors,
MaximumRetry: cfg.ImageBuilderConfig.MaximumRetry,
JobSafeToEvict: cfg.ImageBuilderConfig.JobSafeToEvict,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Shall we just call this SafeToEvict as opposed to JobSafeToEvict ? For 2 reasons:

  • Since this property only corresponds to imagebuilding and it's found under the ImageBuilder.Config, I feel that SafeToEvict is more concise.
  • JobSafeToEvict may cause some confusion with prediction job for the readers (like myself, the first time I read it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. I understand your confusion. I have updated this part.

}

annotations := map[string]string{
"cluster-autoscaler.kubernetes.io/safe-to-evict": fmt.Sprint(c.config.JobSafeToEvict),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we conditionally set this label only when it's supposed to be "false" ? Though we don't currently plan to use the value "true", I'm afraid that if we do so at some point in the future, it will make it more prone to eviction than usual (doc: https://kubernetes.io/docs/reference/labels-annotations-taints/#cluster-autoscaler-kubernetes-io-safe-to-evict).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, TIL. Thanks!

scripts/e2e/config/kserve/overlay.yaml Show resolved Hide resolved
scripts/e2e/config/kserve/overlay.yaml Show resolved Hide resolved

annotations := make(map[string]string)
if !c.config.SafeToEvict {
annotations["cluster-autoscaler.kubernetes.io/safe-to-evict"] = fmt.Sprint(c.config.SafeToEvict)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: this can just be:

annotations["cluster-autoscaler.kubernetes.io/safe-to-evict"] = "false"

Also, perhaps we can leave an inline comment on why we add this annotation conditionally, linking to the doc?

Copy link
Collaborator

@krithika369 krithika369 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @ariefrahmansyah !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants