-
Notifications
You must be signed in to change notification settings - Fork 53
Allow 0 worker in pytorch plugins & Add objectMeta to PyTorchJob #348
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: byhsu <byhsu@linkedin.com>
Codecov Report
@@ Coverage Diff @@
## master #348 +/- ##
==========================================
+ Coverage 62.76% 64.06% +1.30%
==========================================
Files 148 148
Lines 12444 10080 -2364
==========================================
- Hits 7810 6458 -1352
+ Misses 4038 3026 -1012
Partials 596 596
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there any benefit to use only one master node in pytorch CRD? people can just run pytorch in regular python task for single node training, right?
I can only imagine the env vars set by the operator like world size, rank, ...? In the torch elastic task we opted to run single worker trainings in a normal python task/single pod so that users don't need the training operator. |
In our case, we start with 0 worker since it's easier to debug, and then adjust to multiple workers. Although python task can achieve the same thing, we shouldn't error out 0 worker in PyTorch because it's what PyTorch operator allows |
} | ||
job := &kubeflowv1.PyTorchJob{ | ||
TypeMeta: metav1.TypeMeta{ | ||
Kind: kubeflowv1.PytorchJobKind, | ||
APIVersion: kubeflowv1.SchemeGroupVersion.String(), | ||
}, | ||
Spec: jobSpec, | ||
Spec: jobSpec, | ||
ObjectMeta: *objectMeta, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just to set labels / annotations on the CR the same as the replicas? If this is the route we want to go, this should probably be done for all kf operator plugins.
Fair point. Users can still switch to python task if they prefer by removing the Only thing I wonder: in the new pytorch elastic task we decided that with |
I actually want to get rid of a required dependency on PytorchOperator for simple single node training. which can suffice in many cases. This actually makes scaling really nice, you start with one node and simply scale to more nodes - to scale you may need to deploy the operator? This is why when nnodes=1, we just change the task type itself. WDYT? @fg91 and @ByronHsu |
Could you elaborate the drawback to use PyTorch Operator to training a single node case? |
@ByronHsu - FlytePropeller is way more optimal in allocating resources, retrying, and completing sooner. Also this for single node is faster, runs without needing an operator and does not need a CRD to be created. |
@kumare3 skipping the CRD part can definitely be faster. Thanks. I will raise a corresponding pr in flytekit |
Will merge with #345 and do a integration test |
Is this still supposed to be pushed over the finish line or shell we close it? |
TL;DR
Type
Are all requirements met?