-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Future of gpus/ipus/tpu_cores
with respect to devices
#10410
Comments
gpus/ipus/tpus
with respect to devices
gpus/ipus/tpu_cores
with respect to devices
IMO we should follow the contributing guidelines: https://github.com/PyTorchLightning/pytorch-lightning/blob/master/.github/CONTRIBUTING.md#main-core-value-one-less-thing-to-remember Having multiple options in the public API to do the same thing is really confusing. I'm in favor of |
+1, totally agree Current device related flags are confusing. Multiple flags partially overlap and interfere each other. When multiple flags passed in, we define prioritize and ignore some of the flags. For example: I prefer option 2, drop With this option: |
I think going from this:
to
is a major step backwards in usability. now users have to dig into docs to understand how to use things. it definitely violates the "one-less-thing-to-remember" part of the API. I guess, I'm just wondering why we're exploring this? I thought we were already pretty stable on the device API stuff |
@williamFalcon The more kinds of accelerators we get, the more flags we will also have. Switching from Also I suspect, we would likely have the accelerator defaulting to |
@williamFalcon As @justusschock shared, the previous approach doesn't scale well and makes discoverability harder. Furthermore, the new API provides an Trainer(devices="auto", accelerator="auto") which would make the code runnable on every hardware without any code changes. Which isn't possible with the previous API. And we could even support Trainer(devices="auto", accelerator="auto", num_nodes="auto") |
To address the discoverability issue, isn't it common to import the I opened the issue as I felt it was important as a community we come to an agreement as the idea was floating around a few PRs (with inconsistent agreement). It's important to have one single direction here (especially as we introduce other accelerators). I do strongly disagree with removing I think it would be beneficial to try get community votes on this, so maybe a post on our General slack channel is warranted? |
Even something like But this isn't supported at all with Lightning because when specifying In my head the accelerator in Lightning maps to the torch device being used. By using the same sematics PyTorch offers, Lightning can keep parity more easily and smoothen the transition for users coming from vanilla PyTorch |
Adding to the conversation and @ananthsub comment. In this issue, a user is requesting to adding MIG for A100-like a machine: #10529. This is another example of arguments like gpus, tpu-cores can grow out of control, and the need for a single |
@tchaton do we have a consensus to move forward with this issue? |
Maybe thought that is a bit different there. |
To take the pro-Accelerator argument to the extreme (also with the "fractional" devices), how about not splitting If instantiating Accelerator all the time is too much of a hassle for @williamFalcon 's taste (I never liked the configuration part of tf sessions, either, and there is a good reason why PyTorch doesn't force you to do Trainer(devices=2) # I want two of whatever is available (so GPUs > CPUs in preference, but only the same kind. Occasions where "casual users" will have TPU GPU and IPU in the same box will be rare enough... For more elaborate configs, one could have Trainer(devices=Accelerator("cuda", 2)) My apologies for adding another color of shed, but to my mind, there are these cases we want to cater to:
Best regards Thomas |
To add to @t-vi comment, I believe the accelerator could be set to 'auto' by default as it is quite unlikely there is an overlapping machine with both GPUs and TPUs available. So the hardware is totally abstracted and this provides an experience closer to Jax with their auto platform detection Trainer(gpus=8) or Trainer(tpu_cores=8) or Trainer(cpu=8) or Trainer(hpus=8) or Trainer(ipus=8) ... would be replaced directly with: Trainer(devices=8) If a user has a machine with a GPU and wants to debug on CPU, he would simply add the accelerator to force the decision-making. Trainer(devices=2, accelerator="cpu") By the most critical point is:
I believe this API would need to provide a smarter hardware detection mechanism for MIG hardware. |
Coming from both the High-Performance and embedded spaces, I'll weigh in here with some general thoughts.
To the above points, I have a couple of suggestions:
I am very interested to see where this discussion goes, and I apologize for the ramble. |
This discussion has extended to other related points, but to give my opinion the original question, I fully agree with @tchaton's API vision here: #10410 (comment). Where the original I don't think adding new options
Keep in mind that the |
A lot of great inputs! Let me start off by summarizing: The current API was built when only GPUs were supported. Then TPUs were added. And now a few years later, we live in a world where more alternatives are starting to emerge. This is the current API. Trainer(gpus=2)
Trainer(tpus=2) But now, we live in a world where more than GPU|TPU devices are coming out (HPU, etc...). In this case, the proposal is to modify the API like so: Trainer(devices=2, accelerator='tpu') Well... we also introduced the 'auto' flag, so the actual default call would look like this: Trainer(devices=2)
# because Trainer(devices=2, accelerator='auto') is the default @t-vi also brought up the alternative that there could be a class in the event that configs get unwieldy Trainer(accelerator=Accelerator("cuda", 2)) @dlangerm also brought up that in certain complex scenarios:
DecisionSo, with all that said, if there's broader community support for moving from: Trainer(gpus=2) to: Trainer(devices=2, accelerator='tpu')
# default is auto
Trainer(devices=2) Then I'm happy to back this option as it is more scalable and my only concern is "having to only remember one thing"... if there are no major quelms about this and everyone's excited, let's roll it out for 2.0 |
I have a question about this; if we want to roll this out for 2.0 when can we start working on it? Could we start working on it now for example? |
@tchaton @awaelchli @ananthsub What's you guys' thought on when will be the right time for 2.0? Should Accelerator Refactor and stable accelerator be part of the 2.0? |
Hey @daniellepintz @four4fish. Yes, I agree with you both. I don't believe this change requires a Lightning 2.0 as this is a natural evolution of Lightning becoming hardware-agnostic directly at the Trainer level. IMO, I would like to action this change for v1.6. @ananthsub @awaelchli @carmocca @kaushikb11 @justusschock Any thoughts on this ? If we are all positive to make this change, I will mark this issue as let's do it. |
Agreed. We could go ahead with this change for v1.6, along with major Accelerator refactor. |
Given the above decisions, is there a consensus on renaming If these changes are towards a hardware-agnostic API that can be used for either training or inference, |
@dlangerm this is not really related to this issue, so I won't go into much detail here. Feel free to open a new issue for this discussion. From my POV, we shouldn't rename the |
Hey @kaushikb11 I saw you assigned yourself to this issue. I was planning on working on the accelerator_connector refactor (#10422) which was blocked by this issue. Am I okay to proceed with accelerator_connector refactor or is that something you were planning on doing? |
I am working on this in #11040 - do we also want to deprecate |
I think |
Got it, thanks! |
Proposed refactoring or deprecation
Currently we have two methods to specifying devices. Let's take GPUs for example:
accelerator='tpu'
we automatically know to use 2 TPU cores.Recently, it has come up in #10404 (comment) that we may want to deprecate and prevent further device specific names from appearing in the Trainer (such as
hpus
).Related conversation #9053 (comment)
I see two options:
🚀 We keep both device specific arguments (
gpus
tpu_cores
ipus
for the Trainer) anddevices
👀 We drop
gpus
tpu_cores
ipus
in the future and fully rely ondevices
. (Potentially this would likely be done in Lightning 2.0, instead of after 2 minor releases)cc @kaushikb11 @justusschock @ananthsub @awaelchli
The text was updated successfully, but these errors were encountered: