-
Notifications
You must be signed in to change notification settings - Fork 733
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The Job condition's transition is not clear and has bugs #1711
Comments
AFAIK, It was a conscious decision to mark job running when at least one pod is in running state. Others can add /cc @kubeflow/wg-training-leads @gaocegege @zw0610 |
First of all, this discussion is long overdue and quite important. Here are my thoughts:
|
@zw0610 👍 I agree with you. We should define the meaning of conditions of |
I use the following pytorch code to do test import argparse
import os
import torch
import timeit
import numpy as np
import torch.distributed as dist
import torch.optim as optim
import torch.nn.functional as F
from torchvision import models
from torch.nn.parallel import DistributedDataParallel as DDP
def spmd_main(args, data, target):
# (1) call dist.init_process_group()
# These are the parameters used to initialize the process group
env_dict = {
key: os.environ[key]
for key in ("MASTER_ADDR", "MASTER_PORT", "RANK", "WORLD_SIZE", "LOCAL_RANK")
}
print(f"[{os.getpid()}] Initializing process group with: {env_dict}")
dist.init_process_group(backend="nccl")
print(
f"[{os.getpid()}]: world_size = {dist.get_world_size()}, "
+ f"local_rank = {args.local_rank}, "
+ f"rank = {dist.get_rank()}, backend={dist.get_backend()} \n", end=''
)
# (2) get local_rank...
if 'LOCAL_RANK' in os.environ:
local_rank = int(os.environ["LOCAL_RANK"])
log('LOCAL_RANK env is not none, use env, the local_rank is %d' % local_rank)
else:
local_rank = args.local_rank
log('LOCAL_RANK env is none, use args, the local_rank is %d' % local_rank)
world_size = dist.get_world_size()
rank = dist.get_rank()
# (3) construct model,ddp_model, optimizer
model = getattr(models, 'resnet50')().to(local_rank)
ddp_model = DDP(model, device_ids=[local_rank],
output_device=local_rank)
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
target = target.to(local_rank)
# (4) call benchmark_step for warm up
time = timeit.timeit(lambda: benchmark_step(ddp_model, optimizer, data, target),
number=args.num_batches_per_iter)
# (5) benchmark testing
img_secs = []
for x in range(args.num_iters):
time = timeit.timeit(lambda: benchmark_step(ddp_model, optimizer, data, target),
number=args.num_batches_per_iter)
img_sec = args.batch_size * args.num_batches_per_iter / time
log('Iter #%d, rank %s: %.1f img/sec per %s' % (x, rank, img_sec, 'gpu'))
img_secs.append(img_sec)
# (6) caculate the result of step(5)
img_sec_mean = np.mean(img_secs)
img_sec_conf = 1.96 * np.std(img_secs)
log('Img/sec per %s: %.1f +-%.1f' % ('gpu', img_sec_mean, img_sec_conf))
log('Total img/sec on %d %s(s): %.1f +-%.1f'
% (world_size, 'gpu', world_size * img_sec_mean,
world_size * img_sec_conf))
# (7) destroy the process group
dist.destroy_process_group()
def benchmark_step(ddp_model, optimizer, data, target):
optimizer.zero_grad()
output = ddp_model(data)
loss = F.cross_entropy(output, target)
loss.backward()
optimizer.step()
def log(s, nl=True):
print(s, end='\n' if nl else '')
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='PyTorch Synthetic Benchmark',
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("--local_rank", type=int, default=0)
parser.add_argument("--local_world_size", type=int, default=1)
parser.add_argument('--batch-size', type=int, default=32,
help='input batch size')
parser.add_argument('--num-iters', type=int, default=10,
help='number of benchmark iterations')
parser.add_argument('--num-batches-per-iter', type=int, default=10,
help='number of batches per benchmark iteration')
args = parser.parse_args()
data = torch.randn(args.batch_size, 3, 224, 224)
target = torch.LongTensor(args.batch_size).random_() % 1000
# The main entry point is called directly without using subprocess
spmd_main(args, data, target) |
This two pictures are about a pytorchjob with 1 master and 2 workers, the image is I think I can get a conclusion, in pytorch 1.8.1, pytorchjob only get actually training when all pods/processes are running. And when there is any pod/process failed, the training will stop. |
This two pictures are about a pytorchjob with 1 master and 2 workers, the image is I think I can get a conclusion, in pytorch 1.12.1, in not elastic mode, pytorchjob only get actually training when all pods/processes are running. And when there is any pod/process failed, the training will stop. It is like in pytorch 1.8.1 |
This pictures are about a pytorchjob with 1 master and 2 workers, the image is I think I can get a conclusion, in pytorch 1.12.1, in elastic mode, pytorchjob cat get actually training when meet the min pod/node requirement. And one pod failed will not lead task failed. |
Besides above test, I did more other tests. And I think my test can get the following conclusion. @johnugeorge @zw0610 what do you think?
|
Thanks @HeGaoYuan for the tests From the list, I see that these items are missing now
Anything else ? |
Yes, you are right, nothing more for now. But I think the most important is we discuss the definition of status as @zw0610 said, make the definition correctly and clearly. Then we could revise the code to follow the definition. The list I have written maybe not 100% correct. For example, for pytorchjob, could the definition of And I missed the definition of For the definition of For the definition of |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove-lifecycle stale |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
There are 5 JobConditionType( Created, Running, Restarting, Succeeded, Failed) now, it is like a state machine. The flow of these Condition as roughly shown below, and there are some questions we need to make it clear. For example,
Besides, users always want to know the condition of state transition(Yes, it is the state transition table). We also need to make the state transition table clear. It can help users to understand what's going on and help developers to guide how to write correctly codes.
Why this is important by real cases: using the following yaml to create a PytorchJob(the image of Worker is wrong), then the worker pod will not running at all, but the PytorchJob will be JobRunning condition.
Referring to point3 of #1703
The text was updated successfully, but these errors were encountered: