Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The behavior is unexpected when replicas of job set to 0 #1709

Open
HeGaoYuan opened this issue Dec 26, 2022 · 5 comments
Open

The behavior is unexpected when replicas of job set to 0 #1709

HeGaoYuan opened this issue Dec 26, 2022 · 5 comments

Comments

@HeGaoYuan
Copy link
Contributor

I tested by following yaml which the replicas of Worker was set to 0. I think the expectation is as there is no Worker at all, but the real behavior is as following picture, it would create one Worker Pod, then delete it, then recreate it, then redelete it....

Referring to point4 of #1703

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-test-replicas
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      template:
        spec:
          containers:
            - name: pytorch
              image: alpine:latest
              command: ["sleep", "365d"]
    Worker:
      replicas: 0
      template:
        spec:
          containers:
            - name: pytorch
              image: alpine:latest
              command: ["sleep", "365d"]

截屏2022-12-26 上午11 38 36

@HeGaoYuan
Copy link
Contributor Author

The reason I found is following codes, the initial of variable size should be set -1 not 0.

Of course, there are other solutions. For example, remove the whole Worker part when replicas of Worker is 0. (which must not update the job spec in etcd, just update the internal job spec)

We can discuss which solution is better.

https://github.com/kubeflow/common/blob/21910a93c4ed8d8338d9d7414067f888801dd0bc/pkg/core/pod.go#L51

https://github.com/kubeflow/common/blob/21910a93c4ed8d8338d9d7414067f888801dd0bc/pkg/core/service.go#L53

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@tenzen-y
Copy link
Member

/remove-lifecycle stale

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member

/lifecycle frozen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants