Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Future for accelerator and devices default values #10606

Closed
kaushikb11 opened this issue Nov 18, 2021 · 9 comments · Fixed by #16847
Closed

[RFC] Future for accelerator and devices default values #10606

kaushikb11 opened this issue Nov 18, 2021 · 9 comments · Fixed by #16847
Assignees
Labels
accelerator discussion In a discussion stage feature Is an improvement or enhancement trainer: argument
Milestone

Comments

@kaushikb11
Copy link
Contributor

kaushikb11 commented Nov 18, 2021

Discussion

Currently, the default for both the accelerator and devices flags is None. There has been discussions about updating the defaults. #10192 #10410 (comment)

We have three options to consider and vote for! Comments appreciated.

  1. 😄 None sounds good for both the flags. Let's keep it the same way.
  2. 🚀 Have accelerator="auto" and devices=1 as defaults. (Jax like)
  3. 👀 Have accelerator="auto" and devices="auto" as defaults.

Also, to note with the last two changes, it will take priority over gpus, tpu_cores, etc.

cc @Borda @justusschock @kaushikb11 @awaelchli @rohitgr7 @akihironitta

@kaushikb11 kaushikb11 added feature Is an improvement or enhancement discussion In a discussion stage accelerator labels Nov 18, 2021
@kaushikb11
Copy link
Contributor Author

kaushikb11 commented Nov 18, 2021

Why I don't agree with the last two options?

We shouldn't have defaults for devices. Option 3 is a definite no for me, as there could be multiple cases like where user wouldn't want to use all the available gpus on the node. And having this behavior as a default would be bad UX and a very strong enforcement on the user.

Option 2: Auto assigning only one device for the users doesn't sound good to me. We should just crash it if user does Trainer(accelerator="auto") and tell them to define the devices flag. For that particular reason, we support devices="auto".

And the last two options would be a breaking change.

Option 1 ftw!

@tchaton
Copy link
Contributor

tchaton commented Nov 18, 2021

🚀 Have accelerator="auto" and devices=1 as defaults. (Jax like)

I believe it should be more granular:

CPU -> 1

GPUS:

  • If 1 only GPU is available -> default to 1
  • If multiple GPUS are available -> request devices the users

IPUS -> request

TPUS -> request

@stale
Copy link

stale bot commented Dec 20, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Dec 20, 2021
@stale stale bot closed this as completed Dec 28, 2021
@kaushikb11 kaushikb11 removed the won't fix This will not be worked on label Dec 28, 2021
@kaushikb11 kaushikb11 reopened this Dec 28, 2021
@tchaton
Copy link
Contributor

tchaton commented Jan 20, 2022

@kaushikb11

I would like to propose a different behavior.

We don't change the default, but if the user was to do Trainer(devices="auto"), this would set accelerator="auto" by the same occasion.

@awaelchli @carmocca @ananthsub Thoughts ?

@ananthsub
Copy link
Contributor

ananthsub commented Jan 20, 2022

the reason why I voted for accelerator="auto", devices=1 is that this should in very high likelihood be safe. we can fall back to a single-device strategy while preserving most of the behavior today.

If devices="auto", and if the accelerator supports multiple devices, this means we also need to automatically select the strategy, if one wasn't already available.

That means, depending on the device availability, setting Trainer(devices="auto") could lead to very different behaviors:

  • is GPU is available? do DDP
  • is TPU is available? do tpu spawn?
  • is IPU available? do IPU
  • is HPU available? do HPU

Some concerns with this approach:

  • Where do we log/output what configuration is finally used? this is critical for debuggability and reproducibility of experiments
  • automagically figuring this out will add even more initialization paths the trainer supports. this all lands up in the accelerator connector, which is already incredibly hard to read & debug. what flags take priority over other flags? how can we validate that the passed in flags actually make sense and don't fail much later, or worse, don't fail but silently produce incorrect results?

@four4fish has been going through this part of the codebase in much more detail and I'd love to hear what she thinks

@carmocca
Copy link
Contributor

carmocca commented Jan 20, 2022

I would suggest that each person writes their table of preferences, these are mine:

      |       User passed         |          We set           |
Case accelerator devices accelerator devices
1 None None auto 1
2 xpu None xpu 1
3 auto None auto 1
4 None N auto N
5 xpu N xpu N
6 auto N auto N
8 None auto auto auto
8 xpu auto xpu auto
9 auto auto auto auto

This adheres to:

🚀 Have accelerator="auto" and devices=1 as defaults. (Jax like)

We don't change the default, but if the user was to do Trainer(devices="auto"), this would set accelerator="auto" by the same occasion.

Then there's the question of how many devices are used with "auto" depending on the accelerator used ("x"pu). But I think the answer to that is "as many as possible" but for CPU.

Then there are the questions of which strategy and accelerator to pick when there are multiple options, but we can easily define the priority list.

@stale
Copy link

stale bot commented Feb 26, 2022

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Feb 26, 2022
@carmocca carmocca added this to the 1.7 milestone Feb 26, 2022
@stale stale bot removed the won't fix This will not be worked on label Feb 26, 2022
@carmocca carmocca modified the milestones: pl:1.7, pl:future Jul 19, 2022
@carmocca carmocca moved this to Waiting in Lightning RFCs Sep 11, 2022
@Borda Borda pinned this issue Nov 7, 2022
@tchaton tchaton unpinned this issue Dec 18, 2022
@DavidMChan
Copy link
Contributor

DavidMChan commented Feb 21, 2023

I wanted to add my thoughts to this, since they actually differ from what's being discussed above. IMO, the point of lightning is to be as effortless as possible for users who don't want to get involved in the details of training, such as managing devices and accelerators, and to provide escape hatches for those people who want to use more complex behaviors. This understanding would lead me to vote for the "auto"/"auto" default, since it will automatically make use of the computational power available on the machine for somebody who doesn't really care about the devices, and just wants to make their code run as quickly as possible given the hardware that is available.

I disagree that such a UX is an overreach - lightning already enforces relatively strong requirements on the users (such as code structure), and this approach abstracts away one more detail about training: which device is in use. For people who care about specifying devices, having the option to do so still remains (just not as a default).

That being said, It is significantly more development work to get the auto/auto right - but is it worth it?

I'm not entirely sure. I really get the points above, particularly from a legacy use and implementational complexity standpoint. This is a relatively dangerous default to change, since it would greatly impact anyone who doesn't explicitly specify the devices already, however, I would also note that most people who care about their devices have already changed this option to use an accelerator of some kind, so it probably wouldn't change how many people's code behaves out of the box.

Some more comments:

Where do we log/output what configuration is finally used? this is critical for debuggability and reproducibility of experiments

In an ideal world, this is something that is handled by lightning itself; training on one GPU vs. 4 GPUs vs. a TPU pod shouldn't impact the user workflow.

automagically figuring this out will add even more initialization paths the trainer supports. this all lands up in the accelerator connector, which is already incredibly hard to read & debug. what flags take priority over other flags? how can we validate that the passed in flags actually make sense and don't fail much later, or worse, don't fail but silently produce incorrect results?

This is a challenging problem to solve, and I agree that this would be the main barrier to implementing an auto/auto default. That being said, if the paths are well-documented, I believe that anyone who is running into enough issues with an auto/auto default should be able to (1) quickly identify the issue, and (2) explore how to define their own choice of the parameters to solve the problem.

@carmocca carmocca modified the milestones: future, 2.0 Feb 22, 2023
@carmocca carmocca self-assigned this Feb 22, 2023
@lantiga
Copy link
Collaborator

lantiga commented Feb 22, 2023

Thank you everyone for the votes and the discussions. Great community thinking!

It looks like most people (counting all the polls) voted for having accelerator="auto" as the default, either with devices=1 or devices="auto".

We have carefully evaluated the scenarios, and we decided to proceed with accelerator="auto" and devices="auto" as the new defaults.

Here's the thinking:

  • defaults today are accelerator=None and devices="auto" (it wasn't the case when this issue was first created)
  • this means that if you set accelerator="gpu" and you have 4 GPUs on your node, you already get to automatically run on 4 GPUs with a DDP strategy
  • we know that a lot of users already just set accelerator="gpu" and let devices to its current default ("auto")
  • we know of no user that leaves to its default for any meaningful use

From the points above it descends that:

  • changing the default for accelerator will not impact anyone making meaningful use, because they already override accelerator anyway (which vouches for solution 2 or 3)
  • changing the default for devices will impact everyone running running on multi-GPU who now just sets accelerator="gpu", which is a problem, because these are users with sizable workloads (which vouches against solution 2)

Given the fact that multi-GPU users setting accelerators="gpu" today are not complaining about the strategy being set for them, we are confident that this will keep not being an issue, even if we switch accelerators="auto" by default.

Alright! I'm sure we can't please everyone, but I'm confident we made an informed decision. Way to go team ⚡️!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accelerator discussion In a discussion stage feature Is an improvement or enhancement trainer: argument
Projects
No open projects
Status: Waiting
Development

Successfully merging a pull request may close this issue.

6 participants