Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use inline for runPolicy and register defaulter function #1330

Merged
merged 1 commit into from
Aug 5, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 44 additions & 37 deletions config/crd/bases/kubeflow.org_mxjobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,20 +36,6 @@ spec:
spec:
description: MXJobSpec defines the desired state of MXJob
properties:
activeDeadlineSeconds:
description: Specifies the duration in seconds relative to the startTime
that the job may be active before the system tries to terminate
it; value must be positive integer.
format: int64
type: integer
backoffLimit:
description: Optional number of retries before marking this job failed.
format: int32
type: integer
cleanPodPolicy:
description: CleanPodPolicy defines the policy to kill pods after
the job completes. Default to Running.
type: string
jobMode:
description: JobMode specify the kind of MXjob to do. Different mode
may have different MXReplicaSpecs request
Expand Down Expand Up @@ -6773,34 +6759,55 @@ spec:
common.ReplicaSpec, "Server": common.ReplicaSpec, "Worker":
common.ReplicaSpec, }'
type: object
schedulingPolicy:
description: SchedulingPolicy defines the policy related to scheduling,
e.g. gang-scheduling
runPolicy:
description: RunPolicy encapsulates various runtime policies of the
distributed training job, for example how to clean up resources
and how long the job can stay active.
properties:
minAvailable:
activeDeadlineSeconds:
description: Specifies the duration in seconds relative to the
startTime that the job may be active before the system tries
to terminate it; value must be positive integer.
format: int64
type: integer
backoffLimit:
description: Optional number of retries before marking this job
failed.
format: int32
type: integer
minResources:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: ResourceList is a set of (resource name, quantity)
pairs.
type: object
priorityClass:
type: string
queue:
cleanPodPolicy:
description: CleanPodPolicy defines the policy to kill pods after
the job completes. Default to Running.
type: string
schedulingPolicy:
description: SchedulingPolicy defines the policy related to scheduling,
e.g. gang-scheduling
properties:
minAvailable:
format: int32
type: integer
minResources:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: ResourceList is a set of (resource name, quantity)
pairs.
type: object
priorityClass:
type: string
queue:
type: string
type: object
ttlSecondsAfterFinished:
description: TTLSecondsAfterFinished is the TTL to clean up jobs.
It may take extra ReconcilePeriod seconds for the cleanup, since
reconcile gets called periodically. Default to infinite.
format: int32
type: integer
type: object
ttlSecondsAfterFinished:
description: TTLSecondsAfterFinished is the TTL to clean up jobs.
It may take extra ReconcilePeriod seconds for the cleanup, since
reconcile gets called periodically. Default to infinite.
format: int32
type: integer
required:
- jobMode
- mxReplicaSpecs
Expand Down
74 changes: 52 additions & 22 deletions config/crd/bases/kubeflow.org_pytorchjobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,24 +31,12 @@ spec:
object represents. Servers may infer this from the endpoint the client
submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type: string
metadata:
description: Standard Kubernetes object's metadata.
type: object
spec:
description: Specification of the desired state of the PyTorchJob.
properties:
activeDeadlineSeconds:
description: Specifies the duration (in seconds) since startTime during
which the job can remain active before it is terminated. Must be
a positive integer. This setting applies only to pods where restartPolicy
is OnFailure or Always.
format: int64
type: integer
backoffLimit:
description: Number of retries before marking this job as failed.
format: int32
type: integer
cleanPodPolicy:
description: Defines the policy for cleaning up pods after the PyTorchJob
completes. Defaults to None.
type: string
pytorchReplicaSpecs:
additionalProperties:
description: ReplicaSpec is a description of the replica
Expand Down Expand Up @@ -6767,13 +6755,55 @@ spec:
Specifies the PyTorch cluster configuration. For example, { "Master":
PyTorchReplicaSpec, "Worker": PyTorchReplicaSpec, }'
type: object
ttlSecondsAfterFinished:
description: Defines the TTL for cleaning up finished PyTorchJobs
(temporary before Kubernetes adds the cleanup controller). It may
take extra ReconcilePeriod seconds for the cleanup, since reconcile
gets called periodically. Defaults to infinite.
format: int32
type: integer
runPolicy:
description: RunPolicy encapsulates various runtime policies of the
distributed training job, for example how to clean up resources
and how long the job can stay active.
properties:
activeDeadlineSeconds:
description: Specifies the duration in seconds relative to the
startTime that the job may be active before the system tries
to terminate it; value must be positive integer.
format: int64
type: integer
backoffLimit:
description: Optional number of retries before marking this job
failed.
format: int32
type: integer
cleanPodPolicy:
description: CleanPodPolicy defines the policy to kill pods after
the job completes. Default to Running.
type: string
schedulingPolicy:
description: SchedulingPolicy defines the policy related to scheduling,
e.g. gang-scheduling
properties:
minAvailable:
format: int32
type: integer
minResources:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: ResourceList is a set of (resource name, quantity)
pairs.
type: object
priorityClass:
type: string
queue:
type: string
type: object
ttlSecondsAfterFinished:
description: TTLSecondsAfterFinished is the TTL to clean up jobs.
It may take extra ReconcilePeriod seconds for the cleanup, since
reconcile gets called periodically. Default to infinite.
format: int32
type: integer
type: object
required:
- pytorchReplicaSpecs
type: object
Expand Down
84 changes: 47 additions & 37 deletions config/crd/bases/kubeflow.org_tfjobs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,47 +31,63 @@ spec:
object represents. Servers may infer this from the endpoint the client
submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
type: string
metadata:
description: Standard Kubernetes object's metadata.
type: object
spec:
description: Specification of the desired state of the TFJob.
properties:
activeDeadlineSeconds:
description: Specifies the duration in seconds relative to the startTime
that the job may be active before the system tries to terminate
it; value must be positive integer.
format: int64
type: integer
backoffLimit:
description: Optional number of retries before marking this job failed.
format: int32
type: integer
cleanPodPolicy:
description: CleanPodPolicy defines the policy to kill pods after
the job completes. Default to Running.
type: string
enableDynamicWorker:
description: A switch to enable dynamic worker
type: boolean
schedulingPolicy:
description: SchedulingPolicy defines the policy related to scheduling,
e.g. gang-scheduling
runPolicy:
description: RunPolicy encapsulates various runtime policies of the
distributed training job, for example how to clean up resources
and how long the job can stay active.
properties:
minAvailable:
activeDeadlineSeconds:
description: Specifies the duration in seconds relative to the
startTime that the job may be active before the system tries
to terminate it; value must be positive integer.
format: int64
type: integer
backoffLimit:
description: Optional number of retries before marking this job
failed.
format: int32
type: integer
minResources:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: ResourceList is a set of (resource name, quantity)
pairs.
type: object
priorityClass:
type: string
queue:
cleanPodPolicy:
description: CleanPodPolicy defines the policy to kill pods after
the job completes. Default to Running.
type: string
schedulingPolicy:
description: SchedulingPolicy defines the policy related to scheduling,
e.g. gang-scheduling
properties:
minAvailable:
format: int32
type: integer
minResources:
additionalProperties:
anyOf:
- type: integer
- type: string
pattern: ^(\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))(([KMGTPE]i)|[numkMGTPE]|([eE](\+|-)?(([0-9]+(\.[0-9]*)?)|(\.[0-9]+))))?$
x-kubernetes-int-or-string: true
description: ResourceList is a set of (resource name, quantity)
pairs.
type: object
priorityClass:
type: string
queue:
type: string
type: object
ttlSecondsAfterFinished:
description: TTLSecondsAfterFinished is the TTL to clean up jobs.
It may take extra ReconcilePeriod seconds for the cleanup, since
reconcile gets called periodically. Default to infinite.
format: int32
type: integer
type: object
successPolicy:
description: SuccessPolicy defines the policy to mark the TFJob as
Expand Down Expand Up @@ -6795,12 +6811,6 @@ spec:
Specifies the TF cluster configuration. For example, { "PS":
ReplicaSpec, "Worker": ReplicaSpec, }'
type: object
ttlSecondsAfterFinished:
description: TTLSecondsAfterFinished is the TTL to clean up jobs.
It may take extra ReconcilePeriod seconds for the cleanup, since
reconcile gets called periodically. Default to infinite.
format: int32
type: integer
required:
- tfReplicaSpecs
type: object
Expand Down
Loading