Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add volcano jobs phase metric #3650

Merged
merged 1 commit into from
Oct 22, 2024

Conversation

Prepmachine4
Copy link
Contributor

fix: #2493

I add all job phase shows in job apis.

const (
	// Pending is the phase that job is pending in the queue, waiting for scheduling decision
	Pending JobPhase = "Pending"
	// Aborting is the phase that job is aborted, waiting for releasing pods
	Aborting JobPhase = "Aborting"
	// Aborted is the phase that job is aborted by user or error handling
	Aborted JobPhase = "Aborted"
	// Running is the phase that minimal available tasks of Job are running
	Running JobPhase = "Running"
	// Restarting is the phase that the Job is restarted, waiting for pod releasing and recreating
	Restarting JobPhase = "Restarting"
	// Completing is the phase that required tasks of job are completed, job starts to clean up
	Completing JobPhase = "Completing"
	// Completed is the phase that all tasks of Job are completed
	Completed JobPhase = "Completed"
	// Terminating is the phase that the Job is terminated, waiting for releasing pods
	Terminating JobPhase = "Terminating"
	// Terminated is the phase that the job is finished unexpected, e.g. events
	Terminated JobPhase = "Terminated"
	// Failed is the phase that the job is restarted failed reached the maximum number of retries.
	Failed JobPhase = "Failed"
)

The metric record event happend in jobInformer.Informer() received event.

cc.jobInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
	AddFunc:    cc.addJob,
	UpdateFunc: cc.updateJob,
	DeleteFunc: cc.deleteJob,
})

And the processing of record metrics is not use cache lock that maybe produces some inaccurate data but will improve some performance.

But if it traverses all jobs every time during updates, I think there might be some trouble. Should we switch to incremental updates or scheduled metrics at intervals?

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Aug 3, 2024
@googs1025
Copy link
Member

This implementation is too complex; we just need to add metrics at the state transition points.

@googs1025
Copy link
Member

In practical scenarios, we only focus on the final states of success and failure, and we are not concerned with or do not need other states because they are not final states.

@googs1025
Copy link
Member

You just need to find the place where the controller last modified the status in these two related places and make the changes.
status.State.Phase = vcbatch.Completed
status.State.Phase = vcbatch.Failed

@googs1025
Copy link
Member

/assign

@Prepmachine4
Copy link
Contributor Author

Prepmachine4 commented Aug 3, 2024

You just need to find the place where the controller last modified the status in these two related places and make the changes. status.State.Phase = vcbatch.Completed status.State.Phase = vcbatch.Failed

ok

@volcano-sh-bot volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 3, 2024
Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ok-to-test

@volcano-sh-bot volcano-sh-bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Aug 3, 2024
@Prepmachine4
Copy link
Contributor Author

image image

I tested the metrics, and they were in line with expectations.

@googs1025
Copy link
Member

squash the commit to one

@Prepmachine4
Copy link
Contributor Author

squash the commit to one

If i squash the commit to one, it seems like I need to re-create a PR, or is there another way?

@googs1025
Copy link
Member

squash the commit to one

If i squash the commit to one, it seems like I need to re-create a PR, or is there another way?

Why do we need to resubmit a PR when using git rebase or git squash? This is to update your git log view locally, and has nothing to do with the PR itself.

@Prepmachine4
Copy link
Contributor Author

squash the commit to one

If i squash the commit to one, it seems like I need to re-create a PR, or is there another way?

Why do we need to resubmit a PR when using git rebase or git squash? This is to update your git log view locally, and has nothing to do with the PR itself.

ok ,i have squashed the commit to one.

Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added lgtm Indicates that a PR is ready to be merged. and removed lgtm Indicates that a PR is ready to be merged. labels Aug 5, 2024
@volcano-sh-bot
Copy link
Contributor

@hwdef: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hwdef
Copy link
Member

hwdef commented Oct 15, 2024

/reopen

@volcano-sh-bot
Copy link
Contributor

@hwdef: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hwdef
Copy link
Member

hwdef commented Oct 15, 2024

Hi, thanks for your contributions! Please fix the CI

I tried to fix it, but I don't know why the 'default' queue is not open. Can you give me some advice on where I should start. .image

Encountered this error again, It may be that volcano has just been deployed and the test process has been executed. The queue of volcano has not yet been opened.

Please add

sleep 30s

in

minikube kubectl -- replace --force -f ./installer/volcano-development.yaml

target:

...
    - name: Build lastest volcano images
      run: |
        eval $(minikube docker-env)
        make TAG=latest update-development-yaml
        make TAG=latest images
        docker images | grep volcano
        cat ./installer/volcano-development.yaml  | grep image:
        minikube kubectl -- replace --force -f ./installer/volcano-development.yaml
        sleep 30s
...

@hwdef
Copy link
Member

hwdef commented Oct 16, 2024

I made a mistake, the reason why the queue has no status is because the controller failed to start. Not for the reason I mentioned above

The reason for this is mentioned below. It seems that the controller does not use the latest code.

image It seems that the issue is caused by the controller image used in the 'E2E Spark Integration Test' run not being the latest version.

@hwdef
Copy link
Member

hwdef commented Oct 16, 2024

Maybe it’s because the pull policy of the image is always.

@Prepmachine4
Copy link
Contributor Author

Maybe it’s because the pull policy of the image is always.

change to IfNotPresent?

@hwdef
Copy link
Member

hwdef commented Oct 16, 2024

Yes, But I didn't think of a better replacement method. Please try add this

sed -i 's/imagePullPolicy: Always/imagePullPolicy: IfNotPresent/g' installer/volcano-development.yaml

at

make TAG=latest update-development-yaml

target:

    - name: Build lastest volcano images
      run: |
        eval $(minikube docker-env)
        make TAG=latest update-development-yaml
        sed -i 's/imagePullPolicy: Always/imagePullPolicy: IfNotPresent/g' installer/volcano-development.yaml
        make TAG=latest images
        docker images | grep volcano
        cat ./installer/volcano-development.yaml  | grep image:
        minikube kubectl -- replace --force -f ./installer/volcano-development.yaml

Do you have any better suggestions?

@hwdef
Copy link
Member

hwdef commented Oct 16, 2024

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 16, 2024
@JesseStutler
Copy link
Contributor

Please see my review opinions also:

  1. I think we should use gauge, not counter.
  2. In command line, controller set the enable-metrics to false:
    fs.BoolVar(&s.EnableMetrics, "enable-metrics", false, "Enable the metrics function; it is false by default")
    but in helm chart template that --enable-metrics is written to be true, we should make this configurable, for example using {{ .Values.EnableMetrics}} and then add EnableMetrics:true in values.yaml
  3. You should not delete the metrics in scheduler, scheduler does not have these metrics.

@volcano-sh-bot volcano-sh-bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 16, 2024
@Prepmachine4
Copy link
Contributor Author

Please see my review opinions also:

  1. I think we should use gauge, not counter.
  2. In command line, controller set the enable-metrics to false:
    fs.BoolVar(&s.EnableMetrics, "enable-metrics", false, "Enable the metrics function; it is false by default")
    but in helm chart template that --enable-metrics is written to be true, we should make this configurable, for example using {{ .Values.EnableMetrics}} and then add EnableMetrics:true in values.yaml
  3. You should not delete the metrics in scheduler, scheduler does not have these metrics.

Thank you for your suggestions. I have made changes to points 2 and 3, but I think using counter is more appropriate because these two metrics only have increment operations.

Signed-off-by: Prepmachine4 <prepmachine4@gmail.com>
Copy link
Member

@hwdef hwdef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Oct 17, 2024
@Monokaix
Copy link
Member

/approve

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Monokaix

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 22, 2024
@volcano-sh-bot volcano-sh-bot merged commit 74c51c7 into volcano-sh:master Oct 22, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

metrics exposure for volcano job succeeded state and failed state
7 participants