-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Future for accelerator
and devices
default values
#10606
Comments
We shouldn't have defaults for Option 2: Auto assigning only one device for the users doesn't sound good to me. We should just crash it if user does And the last two options would be a breaking change. Option 1 ftw! |
I believe it should be more granular: CPU -> 1 GPUS:
IPUS -> request TPUS -> request |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
I would like to propose a different behavior. We don't change the default, but if the user was to do @awaelchli @carmocca @ananthsub Thoughts ? |
the reason why I voted for accelerator="auto", devices=1 is that this should in very high likelihood be safe. we can fall back to a single-device strategy while preserving most of the behavior today. If devices="auto", and if the accelerator supports multiple devices, this means we also need to automatically select the strategy, if one wasn't already available. That means, depending on the device availability, setting Trainer(devices="auto") could lead to very different behaviors:
Some concerns with this approach:
@four4fish has been going through this part of the codebase in much more detail and I'd love to hear what she thinks |
I would suggest that each person writes their table of preferences, these are mine: | User passed | We set |
This adheres to:
Then there's the question of how many devices are used with "auto" depending on the accelerator used ("x"pu). But I think the answer to that is "as many as possible" but for CPU. Then there are the questions of which strategy and accelerator to pick when there are multiple options, but we can easily define the priority list. |
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team! |
I wanted to add my thoughts to this, since they actually differ from what's being discussed above. IMO, the point of lightning is to be as effortless as possible for users who don't want to get involved in the details of training, such as managing devices and accelerators, and to provide escape hatches for those people who want to use more complex behaviors. This understanding would lead me to vote for the "auto"/"auto" default, since it will automatically make use of the computational power available on the machine for somebody who doesn't really care about the devices, and just wants to make their code run as quickly as possible given the hardware that is available. I disagree that such a UX is an overreach - lightning already enforces relatively strong requirements on the users (such as code structure), and this approach abstracts away one more detail about training: which device is in use. For people who care about specifying devices, having the option to do so still remains (just not as a default). That being said, It is significantly more development work to get the auto/auto right - but is it worth it? I'm not entirely sure. I really get the points above, particularly from a legacy use and implementational complexity standpoint. This is a relatively dangerous default to change, since it would greatly impact anyone who doesn't explicitly specify the devices already, however, I would also note that most people who care about their devices have already changed this option to use an accelerator of some kind, so it probably wouldn't change how many people's code behaves out of the box. Some more comments:
In an ideal world, this is something that is handled by lightning itself; training on one GPU vs. 4 GPUs vs. a TPU pod shouldn't impact the user workflow.
This is a challenging problem to solve, and I agree that this would be the main barrier to implementing an auto/auto default. That being said, if the paths are well-documented, I believe that anyone who is running into enough issues with an auto/auto default should be able to (1) quickly identify the issue, and (2) explore how to define their own choice of the parameters to solve the problem. |
Thank you everyone for the votes and the discussions. Great community thinking! It looks like most people (counting all the polls) voted for having We have carefully evaluated the scenarios, and we decided to proceed with Here's the thinking:
From the points above it descends that:
Given the fact that multi-GPU users setting Alright! I'm sure we can't please everyone, but I'm confident we made an informed decision. Way to go team ⚡️! |
Discussion
Currently, the default for both the
accelerator
anddevices
flags is None. There has been discussions about updating the defaults. #10192 #10410 (comment)We have three options to consider and vote for! Comments appreciated.
accelerator="auto"
anddevices=1
as defaults. (Jax like)accelerator="auto"
anddevices="auto"
as defaults.Also, to note with the last two changes, it will take priority over
gpus
,tpu_cores
, etc.cc @Borda @justusschock @kaushikb11 @awaelchli @rohitgr7 @akihironitta
The text was updated successfully, but these errors were encountered: