- Support to ScaleUp/ScaleDown with Strong Safety Guarantee
- Support to use whole cluster shared etcd or per-application dedicated etcd. If latter is used, the etcd will be automatically cleaned up when the whole FrameworkAttempt is completed
- Common Feature
- Need to setup Kubernetes DNS
- Need to setup Kubernetes GPU Device Plugin for at least 4 NVIDIA GPUs
- Need to setup Kubernetes Cluster-Level Logging, if you need to persist and expose the log for deleted Pod
- Create Service for etcd as below, so that etcd can be discovered by training workers:
apiVersion: v1
kind: Service
metadata:
name: pet-etcd
spec:
selector:
FC_FRAMEWORK_NAME: pet
FC_TASKROLE_NAME: etcd
ports:
- targetPort: 2379
port: 2379
- Create Service for training worker as below, so that training workers can be discovered by each other:
# See comments in ./example/framework/basic/servicestateful.yaml
apiVersion: v1
kind: Service
metadata:
name: pet-worker
spec:
clusterIP: None
publishNotReadyAddresses: true
selector:
FC_FRAMEWORK_NAME: pet
FC_TASKROLE_NAME: worker
- Create Framework for training as below, and wait until all Tasks are AttemptRunning:
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
name: pet
spec:
executionType: Start
retryPolicy:
fancyRetryPolicy: true
maxRetryCount: 2
taskRoles:
- name: etcd
taskNumber: 1
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: -1
task:
# Always retry etcd if there is only etcd failure
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: -2
# Large timeout to force delete Pod as it may break the stateful batch.
podGracefulDeletionTimeoutSec: 1800
pod:
spec:
restartPolicy: Always
containers:
- name: etcd
image: quay.io/coreos/etcd:v3.4.9
command: [
"sh", "-c",
"/usr/local/bin/etcd
--data-dir /var/lib/etcd --enable-v2
--listen-client-urls http://0.0.0.0:2379
--advertise-client-urls http://0.0.0.0:2379
--initial-cluster-state new"]
ports:
- containerPort: 2379
- name: worker
# Should within torchelastic nnodes range.
taskNumber: 4
# As exit barrier is not yet supported in below torchelastic image, it is better
# to still wait until all workers succeeded and only until then succeed the
# whole training.
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: 4
task:
retryPolicy:
fancyRetryPolicy: true
maxRetryCount: 2
# Large timeout to force delete Pod as it may break the stateful batch.
podGracefulDeletionTimeoutSec: 1800
pod:
spec:
# See comments in ./example/framework/basic/servicestateful.yaml
hostname: "{{FC_TASKROLE_NAME}}-{{FC_TASK_INDEX}}"
subdomain: "{{FC_FRAMEWORK_NAME}}-{{FC_TASKROLE_NAME}}"
restartPolicy: Never
containers:
- name: pytorch
# Using official image to demonstrate this example.
# The imagenet example does not require a distributed shared file system to broadcast checkpoint:
# https://github.com/pytorch/elastic/blob/45dc33f3eca1344fe1fd84634fb0d62767822f3e/examples/imagenet/main.py#L336-L343
image: torchelastic/examples:0.2.0
command: [
"sh", "-c",
"python -m torchelastic.distributed.launch
--rdzv_backend=etcd
--rdzv_endpoint=${FC_FRAMEWORK_NAME}-etcd:2379
--rdzv_id=${FC_FRAMEWORK_NAME}
--nnodes=1:4
--nproc_per_node=1
/workspace/examples/imagenet/main.py
--arch=resnet18 --batch-size=32 --epochs=2
/workspace/data/tiny-imagenet-200"]
resources:
limits:
# Should equal to torchelastic nproc_per_node.
nvidia.com/gpu: 1
volumeMounts:
# Mount shared memory otherwise pytorch data loaders may be OOM.
- name: shm-volume
mountPath: /dev/shm
volumes:
- name: shm-volume
emptyDir:
medium: Memory
- All workers will train the model, with log like below:
[INFO] 2020-07-31 06:52:44,968 launch: Running torchelastic.distributed.launch with args: ['/opt/conda/lib/python3.7/site-packages/torchelastic/distributed/launch.py', '--rdzv_backend=etcd', '--rdzv_endpoint=pet-etcd:2379', '--rdzv_id=pet', '--nnodes=1:4', '--nproc_per_node=1', '/workspace/examples/imagenet/main.py', '--arch=resnet18', '--batch-size=32', '--epochs=2', '/workspace/data/tiny-imagenet-200']
INFO 2020-07-31 06:52:44,976 Etcd machines: ['http://0.0.0.0:2379']
[INFO] 2020-07-31 06:52:44,985 launch: Using nproc_per_node=1.
[INFO] 2020-07-31 06:52:45,715 api: [default] starting workers for function: wrapper_fn
[INFO] 2020-07-31 06:52:45,715 api: [default] Rendezvous'ing worker group
INFO 2020-07-31 06:52:45,715 Attempting to join next rendezvous
INFO 2020-07-31 06:52:45,723 New rendezvous state created: {'status': 'joinable', 'version': '1', 'participants': []}
INFO 2020-07-31 06:52:45,739 Joined rendezvous version 1 as rank 0. Full state: {'status': 'joinable', 'version': '1', 'participants': [0]}
INFO 2020-07-31 06:52:45,739 Rank 0 is responsible for join last call.
INFO 2020-07-31 06:52:46,942 Rank 0 finished join last call.
INFO 2020-07-31 06:52:46,942 Waiting for remaining peers.
INFO 2020-07-31 06:52:46,943 All peers arrived. Confirming membership.
INFO 2020-07-31 06:52:46,954 Waiting for confirmations from all peers.
INFO 2020-07-31 06:52:47,064 Rendezvous version 1 is complete. Final state: {'status': 'final', 'version': '1', 'participants': [0, 1, 2, 3], 'keep_alives': ['/torchelastic/p2p/run_pet/rdzv/v_1/rank_0', '/torchelastic/p2p/run_pet/rdzv/v_1/rank_3', '/torchelastic/p2p/run_pet/rdzv/v_1/rank_2', '/torchelastic/p2p/run_pet/rdzv/v_1/rank_1'], 'num_workers_waiting': 0}
INFO 2020-07-31 06:52:47,064 Creating EtcdStore as the c10d::Store implementation
[INFO] 2020-07-31 06:52:47,071 api: [default] Rendezvous complete for workers.
Result:
restart_count=0
group_rank=0
group_world_size=4
rank stride=1
assigned global_ranks=[0]
master_addr=worker-0.pet-worker.default.svc.cluster.local
master_port=43429
[INFO] 2020-07-31 06:52:47,071 api: [default] Starting worker group
=> set cuda device = 0
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Epoch: [0][ 0/782] Time 3.613 ( 3.613) Data 0.079 ( 0.079) Loss 7.0412e+00 (7.0412e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 0.00)
Epoch: [0][ 10/782] Time 1.638 ( 1.849) Data 0.086 ( 0.300) Loss 5.7640e+00 (6.3097e+00) Acc@1 0.00 ( 0.00) Acc@5 0.00 ( 1.99)
......
Test: [300/313] Time 0.122 ( 0.167) Loss 7.0159e+00 (7.0172e+00) Acc@1 0.00 ( 0.40) Acc@5 6.25 ( 1.43)
Test: [310/313] Time 0.139 ( 0.166) Loss 7.3541e+00 (7.0174e+00) Acc@1 0.00 ( 0.39) Acc@5 3.12 ( 1.43)
* Acc@1 0.390 Acc@5 1.420
=> saved checkpoint for epoch 0 at /tmp/checkpoint.pth.tar
=> best model found at epoch 0 saving to /tmp/model_best.pth.tar
Epoch: [1][ 0/782] Time 6.522 ( 6.522) Data 0.052 ( 0.052) Loss 4.4326e+00 (4.4326e+00) Acc@1 3.12 ( 3.12) Acc@5 15.62 ( 15.62)
Epoch: [1][ 10/782] Time 1.427 ( 1.703) Data 0.045 ( 0.202) Loss 4.3480e+00 (4.4527e+00) Acc@1 0.00 ( 7.67) Acc@5 34.38 ( 22.73)
......
- ScaleDown Framework: Decrease
worker
taskNumber and minSucceededTaskCount from 4 to 2 by below patch:
[
{
"op": "test",
"path": "/spec/taskRoles/1/name",
"value": "worker"
},
{
"op": "replace",
"path": "/spec/taskRoles/1/taskNumber",
"value": 2
},
{
"op": "replace",
"path": "/spec/taskRoles/1/frameworkAttemptCompletionPolicy/minSucceededTaskCount",
"value": 2
}
]
- Remaining workers
pet-worker-0
, pet-worker-1
will re-rendezvous and recover from last epoch checkpoint, with log like below:
......
Epoch: [1][180/782] Time 1.186 ( 1.230) Data 0.095 ( 0.179) Loss 4.2262e+00 (4.3819e+00) Acc@1 9.38 ( 9.12) Acc@5 34.38 ( 25.57)
Epoch: [1][190/782] Time 1.580 ( 1.230) Data 0.782 ( 0.180) Loss 3.9395e+00 (4.3789e+00) Acc@1 9.38 ( 9.18) Acc@5 34.38 ( 25.59)
Traceback (most recent call last):
File "/workspace/examples/imagenet/main.py", line 603, in <module>
main()
File "/workspace/examples/imagenet/main.py", line 188, in main
train(train_loader, model, criterion, optimizer, epoch, device_id, print_freq)
File "/workspace/examples/imagenet/main.py", line 471, in train
loss.backward()
File "/opt/conda/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: NCCL error: unhandled system error, NCCL version 2.4.8
[ERROR] 2020-07-31 07:15:34,783 local_elastic_agent: [default] Worker group failed
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torchelastic/agent/server/local_elastic_agent.py", line 190, in _monitor_workers
if self._process_context.join(timeout=-1):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join
raise Exception(msg)
Exception:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, *args)
File "/opt/conda/lib/python3.7/site-packages/torchelastic/agent/server/local_elastic_agent.py", line 79, in _wrap
ret = fn(*args)
File "/opt/conda/lib/python3.7/site-packages/torchelastic/distributed/launch.py", line 392, in wrapper_fn
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '/workspace/examples/imagenet/main.py', '--arch=resnet18', '--batch-size=32', '--epochs=2', '/workspace/data/tiny-imagenet-200']' returned non-zero exit status 1.
[INFO] 2020-07-31 07:15:34,785 api: [default] Worker group FAILED. 3/3 attempts left; will restart worker group
[INFO] 2020-07-31 07:15:34,785 api: [default] Stopping worker group
[INFO] 2020-07-31 07:15:34,785 api: [default] Rendezvous'ing worker group
INFO 2020-07-31 07:15:34,785 Attempting to join next rendezvous
INFO 2020-07-31 07:15:34,791 Observed existing rendezvous state: {'status': 'joinable', 'version': '2', 'participants': [0]}
INFO 2020-07-31 07:15:34,826 Joined rendezvous version 2 as rank 1. Full state: {'status': 'joinable', 'version': '2', 'participants': [0, 1]}
INFO 2020-07-31 07:15:34,826 Waiting for remaining peers.
INFO 2020-07-31 07:16:04,896 All peers arrived. Confirming membership.
INFO 2020-07-31 07:16:04,904 Waiting for confirmations from all peers.
INFO 2020-07-31 07:16:04,908 Rendezvous version 2 is complete. Final state: {'status': 'final', 'version': '2', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_pet/rdzv/v_2/rank_1', '/torchelastic/p2p/run_pet/rdzv/v_2/rank_0'], 'num_workers_waiting': 0}
INFO 2020-07-31 07:16:04,908 Creating EtcdStore as the c10d::Store implementation
[INFO] 2020-07-31 07:16:04,915 api: [default] Rendezvous complete for workers.
Result:
restart_count=1
group_rank=1
group_world_size=2
rank stride=1
assigned global_ranks=[1]
master_addr=worker-1.pet-worker.default.svc.cluster.local
master_port=55787
[INFO] 2020-07-31 07:16:04,915 api: [default] Starting worker group
=> set cuda device = 0
=> creating model: resnet18
=> loading checkpoint file: /tmp/checkpoint.pth.tar
=> loaded checkpoint file: /tmp/checkpoint.pth.tar
=> using checkpoint from rank: 1, max_epoch: 0
=> checkpoint broadcast size is: 93588276
/opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
=> done broadcasting checkpoint
=> done restoring from previous checkpoint
=> start_epoch: 1, best_acc1: 0.38999998569488525
Epoch: [1][ 0/1563] Time 2.916 ( 2.916) Data 0.095 ( 0.095) Loss 4.3512e+00 (4.3512e+00) Acc@1 15.62 ( 15.62) Acc@5 21.88 ( 21.88)
Epoch: [1][ 10/1563] Time 0.650 ( 0.833) Data 0.043 ( 0.090) Loss 4.4603e+00 (4.4707e+00) Acc@1 12.50 ( 8.52) Acc@5 21.88 ( 20.45)
......
- ScaleUp Framework: Increase
worker
taskNumber and minSucceededTaskCount from 2 to 3 by below patch:
[
{
"op": "test",
"path": "/spec/taskRoles/1/name",
"value": "worker"
},
{
"op": "replace",
"path": "/spec/taskRoles/1/taskNumber",
"value": 3
},
{
"op": "replace",
"path": "/spec/taskRoles/1/frameworkAttemptCompletionPolicy/minSucceededTaskCount",
"value": 3
}
]
- Augmented workers
pet-worker-0
, pet-worker-1
, pet-worker-2
will re-rendezvous and recover from last epoch checkpoint, with log like below:
......
Epoch: [1][1450/1563] Time 0.563 ( 0.783) Data 0.108 ( 0.177) Loss 4.0248e+00 (4.2794e+00) Acc@1 12.50 ( 10.54) Acc@5 37.50 ( 28.52)
Epoch: [1][1460/1563] Time 0.560 ( 0.783) Data 0.074 ( 0.179) Loss 4.5901e+00 (4.2787e+00) Acc@1 6.25 ( 10.55) Acc@5 18.75 ( 28.53)
[INFO] 2020-07-31 07:35:16,348 api: [default] Detected 1 new nodes from group_rank=1; will restart worker group
[INFO] 2020-07-31 07:35:16,349 api: [default] Stopping worker group
[INFO] 2020-07-31 07:35:21,306 api: [default] Rendezvous'ing worker group
INFO 2020-07-31 07:35:21,307 Attempting to join next rendezvous
INFO 2020-07-31 07:35:21,310 Observed existing rendezvous state: {'status': 'final', 'version': '2', 'participants': [0, 1], 'keep_alives': ['/torchelastic/p2p/run_pet/rdzv/v_2/rank_1', '/torchelastic/p2p/run_pet/rdzv/v_2/rank_0'], 'num_workers_waiting': 1}
INFO 2020-07-31 07:35:21,363 Announce self as waiting CAS unsuccessful, retrying
INFO 2020-07-31 07:35:21,411 Added self to waiting list. Rendezvous full state: {"status": "final", "version": "2", "participants": [0, 1], "keep_alives": ["/torchelastic/p2p/run_pet/rdzv/v_2/rank_1", "/torchelastic/p2p/run_pet/rdzv/v_2/rank_0"], "num_workers_waiting": 3}
INFO 2020-07-31 07:35:30,806 Keep-alive key /torchelastic/p2p/run_pet/rdzv/v_2/rank_1 is not renewed.
INFO 2020-07-31 07:35:30,807 Rendevous version 2 is incomplete.
INFO 2020-07-31 07:35:30,807 Attempting to destroy it.
INFO 2020-07-31 07:35:30,808 Rendezvous attempt failed, will retry. Reason: Key not found : /torchelastic/p2p/run_pet/rdzv/active_version
INFO 2020-07-31 07:35:31,810 Attempting to join next rendezvous
INFO 2020-07-31 07:35:31,813 Observed existing rendezvous state: {'status': 'joinable', 'version': '3', 'participants': [0]}
INFO 2020-07-31 07:35:31,867 Joined rendezvous version 3 as rank 1. Full state: {'status': 'joinable', 'version': '3', 'participants': [0, 1]}
INFO 2020-07-31 07:35:31,868 Waiting for remaining peers.
INFO 2020-07-31 07:36:01,831 All peers arrived. Confirming membership.
INFO 2020-07-31 07:36:01,882 Waiting for confirmations from all peers.
INFO 2020-07-31 07:36:01,915 Rendezvous version 3 is complete. Final state: {'status': 'final', 'version': '3', 'participants': [0, 1, 2], 'keep_alives': ['/torchelastic/p2p/run_pet/rdzv/v_3/rank_0', '/torchelastic/p2p/run_pet/rdzv/v_3/rank_1', '/torchelastic/p2p/run_pet/rdzv/v_3/rank_2'], 'num_workers_waiting': 0}
INFO 2020-07-31 07:36:01,915 Creating EtcdStore as the c10d::Store implementation
[INFO] 2020-07-31 07:36:01,919 api: [default] Rendezvous complete for workers.
Result:
restart_count=1
group_rank=1
group_world_size=3
rank stride=1
assigned global_ranks=[1]
master_addr=worker-1.pet-worker.default.svc.cluster.local
master_port=44823
[INFO] 2020-07-31 07:36:01,919 api: [default] Starting worker group
=> set cuda device = 0
=> creating model: resnet18
=> loading checkpoint file: /tmp/checkpoint.pth.tar
=> loaded checkpoint file: /tmp/checkpoint.pth.tar
=> using checkpoint from rank: 1, max_epoch: 0
=> checkpoint broadcast size is: 93588276
/opt/conda/conda-bld/pytorch_1587428398394/work/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
=> done broadcasting checkpoint
=> done restoring from previous checkpoint
=> start_epoch: 1, best_acc1: 0.38999998569488525
Epoch: [1][ 0/1042] Time 2.741 ( 2.741) Data 0.073 ( 0.073) Loss 4.7370e+00 (4.7370e+00) Acc@1 0.00 ( 0.00) Acc@5 15.62 ( 15.62)
Epoch: [1][ 10/1042] Time 0.732 ( 1.339) Data 0.045 ( 0.292) Loss 4.5558e+00 (4.4728e+00) Acc@1 3.12 ( 7.39) Acc@5 12.50 ( 23.86)
......
- When training completed, all workers will succeed, with log like below:
......
Test: [300/313] Time 0.118 ( 0.179) Loss 8.1714e+00 (8.1763e+00) Acc@1 3.12 ( 1.15) Acc@5 6.25 ( 3.87)
Test: [310/313] Time 0.281 ( 0.177) Loss 8.7371e+00 (8.1767e+00) Acc@1 0.00 ( 1.14) Acc@5 6.25 ( 3.86)
* Acc@1 1.130 Acc@5 3.850
=> saved checkpoint for epoch 1 at /tmp/checkpoint.pth.tar
=> best model found at epoch 1 saving to /tmp/model_best.pth.tar
[INFO] 2020-07-31 07:55:16,413 api: [default] All workers successfully finished.
- Finally, the whole Framework will succeed, with Status like below:
apiVersion: frameworkcontroller.microsoft.com/v1
kind: Framework
metadata:
name: pet
namespace: default
creationTimestamp: '2020-07-31T06:52:25Z'
generation: 43
resourceVersion: '35492058'
selfLink: "/apis/frameworkcontroller.microsoft.com/v1/namespaces/default/frameworks/pet"
uid: c3f9e3a1-314d-4b5f-94b9-9287d15ac5d6
spec:
executionType: Start
retryPolicy:
fancyRetryPolicy: true
maxRetryCount: 2
taskRoles:
- name: etcd
taskNumber: 1
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: -1
task:
retryPolicy:
fancyRetryPolicy: false
maxRetryCount: -2
podGracefulDeletionTimeoutSec: 1800
pod:
spec:
restartPolicy: Always
containers:
- name: etcd
image: quay.io/coreos/etcd:v3.4.9
command:
- sh
- "-c"
- "/usr/local/bin/etcd --data-dir /var/lib/etcd --enable-v2 --listen-client-urls
http://0.0.0.0:2379 --advertise-client-urls http://0.0.0.0:2379 --initial-cluster-state
new"
ports:
- containerPort: 2379
- name: worker
taskNumber: 3
frameworkAttemptCompletionPolicy:
minFailedTaskCount: 1
minSucceededTaskCount: 3
task:
retryPolicy:
fancyRetryPolicy: true
maxRetryCount: 2
podGracefulDeletionTimeoutSec: 1800
pod:
spec:
hostname: "{{FC_TASKROLE_NAME}}-{{FC_TASK_INDEX}}"
subdomain: "{{FC_FRAMEWORK_NAME}}-{{FC_TASKROLE_NAME}}"
restartPolicy: Never
containers:
- name: pytorch
image: torchelastic/examples:0.2.0
command:
- sh
- "-c"
- python -m torchelastic.distributed.launch --rdzv_backend=etcd --rdzv_endpoint=${FC_FRAMEWORK_NAME}-etcd:2379
--rdzv_id=${FC_FRAMEWORK_NAME} --nnodes=1:4 --nproc_per_node=1 /workspace/examples/imagenet/main.py
--arch=resnet18 --batch-size=32 --epochs=2 /workspace/data/tiny-imagenet-200
resources:
limits:
nvidia.com/gpu: '1'
volumeMounts:
- name: shm-volume
mountPath: "/dev/shm"
volumes:
- name: shm-volume
emptyDir:
medium: Memory
status:
startTime: '2020-07-31T06:52:25Z'
completionTime: '2020-07-31T07:56:14Z'
state: Completed
transitionTime: '2020-07-31T07:56:14Z'
retryPolicyStatus:
accountableRetriedCount: 0
retryDelaySec:
totalRetriedCount: 0
attemptStatus:
id: 0
startTime: '2020-07-31T06:52:25Z'
runTime: '2020-07-31T06:52:30Z'
completionTime: '2020-07-31T07:56:14Z'
instanceUID: 0_da3b7866-d1b2-45be-8222-ff85ac67ef23
configMapName: pet-attempt
configMapUID: da3b7866-d1b2-45be-8222-ff85ac67ef23
completionStatus:
code: 0
phrase: Succeeded
type:
attributes: []
name: Succeeded
diagnostics: Pod succeeded
trigger:
message: SucceededTaskCount 3 has reached MinSucceededTaskCount 3 in the TaskRole
taskRoleName: worker
taskIndex: 1
taskRoleStatuses:
- name: etcd
podGracefulDeletionTimeoutSec: 1800
taskStatuses:
- index: 0
startTime: '2020-07-31T06:52:25Z'
completionTime: '2020-07-31T07:56:14Z'
state: Completed
transitionTime: '2020-07-31T07:56:14Z'
deletionPending: false
retryPolicyStatus:
accountableRetriedCount: 0
retryDelaySec:
totalRetriedCount: 0
attemptStatus:
id: 0
startTime: '2020-07-31T06:52:25Z'
runTime: '2020-07-31T06:52:30Z'
completionTime: '2020-07-31T07:56:13Z'
instanceUID: 0_7870a450-4eb4-4596-a0d2-be5c6727de03
podName: pet-etcd-0
podUID: 7870a450-4eb4-4596-a0d2-be5c6727de03
podNodeName: node11
podIP: 10.207.128.5
podHostIP: 10.151.40.232
completionStatus:
code: -220
phrase: FrameworkAttemptCompletion
type:
attributes:
- Permanent
name: Failed
diagnostics: Stop to complete current FrameworkAttempt
- name: worker
podGracefulDeletionTimeoutSec: 1800
taskStatuses:
- index: 0
startTime: '2020-07-31T06:52:25Z'
completionTime: '2020-07-31T07:55:18Z'
state: Completed
transitionTime: '2020-07-31T07:55:18Z'
deletionPending: false
retryPolicyStatus:
accountableRetriedCount: 0
retryDelaySec:
totalRetriedCount: 0
attemptStatus:
id: 0
startTime: '2020-07-31T06:52:25Z'
runTime: '2020-07-31T06:52:30Z'
completionTime: '2020-07-31T07:55:18Z'
instanceUID: 0_a550a22a-6371-4ac3-991b-fc5957fb0dac
podName: pet-worker-0
podUID: a550a22a-6371-4ac3-991b-fc5957fb0dac
podNodeName: node9
podIP: 10.204.128.1
podHostIP: 10.151.40.230
completionStatus:
code: 0
phrase: Succeeded
type:
attributes: []
name: Succeeded
diagnostics: Pod succeeded
pod:
containers:
- code: 0
name: pytorch
reason: Completed
- index: 1
startTime: '2020-07-31T06:52:25Z'
completionTime: '2020-07-31T07:55:30Z'
state: Completed
transitionTime: '2020-07-31T07:55:30Z'
deletionPending: false
retryPolicyStatus:
accountableRetriedCount: 0
retryDelaySec:
totalRetriedCount: 0
attemptStatus:
id: 0
startTime: '2020-07-31T06:52:25Z'
runTime: '2020-07-31T06:52:31Z'
completionTime: '2020-07-31T07:55:29Z'
instanceUID: 0_6213facb-d11d-416d-9530-32adeb708439
podName: pet-worker-1
podUID: 6213facb-d11d-416d-9530-32adeb708439
podNodeName: node10
podIP: 10.201.0.2
podHostIP: 10.151.40.231
completionStatus:
code: 0
phrase: Succeeded
type:
attributes: []
name: Succeeded
diagnostics: Pod succeeded
pod:
containers:
- code: 0
name: pytorch
reason: Completed
- index: 2
startTime: '2020-07-31T07:34:55Z'
completionTime: '2020-07-31T07:55:22Z'
state: Completed
transitionTime: '2020-07-31T07:55:22Z'
deletionPending: false
retryPolicyStatus:
accountableRetriedCount: 0
retryDelaySec:
totalRetriedCount: 0
attemptStatus:
id: 0
startTime: '2020-07-31T07:34:55Z'
runTime: '2020-07-31T07:34:58Z'
completionTime: '2020-07-31T07:55:22Z'
instanceUID: 0_d12681fc-986d-4f36-95f3-f0edde5750ca
podName: pet-worker-2
podUID: d12681fc-986d-4f36-95f3-f0edde5750ca
podNodeName: node6
podIP: 10.202.128.6
podHostIP: 10.151.40.227
completionStatus:
code: 0
phrase: Succeeded
type:
attributes: []
name: Succeeded
diagnostics: Pod succeeded
pod:
containers:
- code: 0
name: pytorch
reason: Completed