Optimizer CPU offload for single GPU training #584

gau-nernst · 2024-08-01T03:36:47Z

Background

Currently there is no simple way to do optimizer CPU offload for single GPU training, although such feature exists for FSDP. DeepSpeed ZeRO-Offload can work with single GPU, but it requires installing DeepSpeed, which can be complicated, and major changes to training loop (not convenient to switch between DeepSpeed and non-DeepSpeed).

Optimizer memory footprint is the largest in a training setup (2x model size for plain Adam), thus offloading optimizer to CPU would be greatly beneficial.

Below is a copy of optimizer CPU offload README

Optimizer CPU Offload

This folder also implements optimizer CPU offload (i.e. ZeRO-Offload) for single GPU training. For multi-GPU training, you can use FSDP's built-in CPU offload.

import torch
from torchao.prototype.low_bit_optim import CPUOffloadOptimizer

model = ...

# only offload optimizer state
optim = CPUOffloadOptimizer(model.parameters(), torch.optim.AdamW, fused=True)

# offload optimizer state AND gradients
optim = CPUOffloadOptimizer(model.parameters(), torch.optim.AdamW, offload_gradients=True, fused=True)

This will reduce GPU memory usage by optimizer state size, and additionally gradient size if offload_gradients=True. CPUOffloadOptimizer can wrap any base optimizer.

For saving and loading CPUOffloadOptimizer, it is important that you load model's weights BEFORE creating the optimizer, since we create a CPU copy of the parameters inside CPUOffloadOptimizer.__init__(). (TODO: we might want to have a method to synchronize CUDA and CPU params in either direction CPU->CUDA and CUDA->CPU, in case they are out of sync.)

ckpt = torch.load("checkpoint.pth")

model = ...
model.load_state_dict(ckpt["model"])

optim = CPUOffloadOptimizer(model.parameters(), torch.optim.AdamW, fused=True)
optim.load_state_dict(ckpt["optim"])

NOTE:

Since the optimizer step is done on CPU, it is highly recommended to use a fast CPU optimizer, such as torch.optim.AdamW(fused=True) (requires PyTorch 2.4). For other optimizers, you can try torch.compile() their optimizer step.
To minimize the amount of CPU<->GPU data transfer, we keep a copy of parameters and pre-allocate gradients memory on CPU. Therefore, expect your RAM usage to increase by 2x model size + optimizer state (which is 2x model size for Adam).
It is recommended NOT to torch.compile() your whole model when CPUOffloadOptimizer is used, as it prevents us from interleaving gradient device-to-host transfer with backward pass. To minimize such impact, you can compile parts of your model separately. See #584 for more information.
CPU optimizer step is often the bottleneck when optimizer CPU offload is used. To minimize the slowdown, it is recommended to (1) do full BF16 training (instead of AMP), so that parameters, gradients, and optimizer states are in BF16; and (2) give GPU more work per optimizer step (e.g. larger batch size with activation checkpointing, gradient accumulation).
offload_gradients=True is not compatible with gradient accumulation, since we clear gradients on GPU every backward pass.
Gradient clipping is currently not supported.

Benchmark done for timm/vit_giant_patch14_dinov2.lvd142m (1.1B params), eager mode, full BF16 training, activations checkpointing, batch size 32, on 4070Ti SUPER (16GB VRAM), Ryzen 5600, DDR4 RAM. DeepSpeed is untuned.

Adam offload	Time per step	Max memory
None	1.27s/it	9.82 GB
DeepSpeed ZeRO-Offload	3.13s/it	6.85 GB
ao	1.52s/it	5.24 GB
ao (offload gradients)	1.53s/it	4.01 GB

Ablations on AMP and torch.compile()

Training config	Adam offload	Time per step	Max memory
Full BF16, compiled	None	1.18s/it	9.90 GB
Full BF16, compiled	ao	1.75s/it	5.33 GB
BF16 AMP, eager	None	OOM	OOM
BF16 AMP, eager	ao	2.18s/it	9.90 GB

Implementation details

Keep a copy of params on CPU. After backward pass, copy gradients from GPU to CPU (optionally deallocate GPU gradients). Do optimizer step on CPU. Copy updated gradients from CPU to GPU.

To hide CPU <-> GPU data movements, interleave grad device->host with backward, and interleave param host->device with CPU optim step. Also start CPU optim step as soon as CPU is free (i.e. after launching all backward kernels) -> some interleaving of backward and CPU optim step. The following trace illustrates the strategy.

Two interesting observations:

1. torch.compile() prevents overlapping grad D2H with backward. Probably because compiled backward will launch/queue all backward kernels at once, so waiting for current stream will last until all backwards finish. Trace with torch.compile()

One way to mitigate this is to compile parts of the model separately, so on the host side, backwards are launched as K groups of kernels, then we can start grad D2H in-between. (haven't tried it, just an idea. maybe still not possible). This would also reduce kernel launch overhead, which helps CPU Adam to start even earlier.

2. Fused CPU Adam is much faster in BF16 than in FP32. Trace with BF16 AMP (params, grads, optimizer states are in FP32), optim step time increases from ~700ms -> 1200ms.

Time for CUDA optim: CUDA forward time + CUDA backward time + CUDA optim time.
Time with CPU offload optim: CUDA forward time + CPU time to launch backward + CPU optim time.

If CPU optim time is super fast (assume zero), the upper bound will be CUDA forward time + CUDA backward time + param H2D time (though we can further hide H2D time with forward). So optim CPU offload MAY even be faster than optim on GPU.

My setup is Ryzen 5600 with dual-channel DDR4. Using CPUs with AVX-512 support (Ryzen 4 and later or server CPUs) and DDR5 (or 6-channel DDR4 on servers) would probably be faster.

pytorch-bot · 2024-08-01T03:36:50Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/584

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 231a6ef with merge base de4a1fb ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/prototype/low_bit_optim/cpu_offload.py

benchmarks/benchmark_low_bit_adam.py

awgu

In this approach, it looks like you are okay with having only the optimizer states be solely on CPU and are not targeting any parameter/gradient memory, which is fine. However, you would still see throughput improvements if you tried to overlap gradient D2H copies with backward and H2D copies with forward, but this leads to some complexity.

For example, with this kind of overlapping (taken from FSDP2 backward), it is possible to mostly hide the copies:

awgu · 2024-08-01T12:38:45Z

torchao/prototype/low_bit_optim/cpu_offload.py

+ # copy gradients from CUDA to CPU
+ for p_cpu, p_cuda in self.param_cpu2cuda_map.items():
+ if p_cuda.grad is not None:
+ p_cpu.grad = p_cuda.grad.to("cpu", non_blocking=True)


To check that we on the same page, the non_blocking=True here means that the host (CPU) is not blocked on this D2H copy. However, there is nothing for these D2H copies to overlap with, so the main benefit you are getting here is that copying D2H with non_blocking=True will copy directly to pinned memory.

Otherwise, the CPU side should look like issuing D2H copy for each gradient and then blocking via the torch.cuda.synchronize() for all D2H copies to finish.

awgu · 2024-08-01T12:40:19Z

torchao/prototype/low_bit_optim/cpu_offload.py

+
+ # copy updated param from CPU to CUDA
+ for p_cpu, p_cuda in self.param_cpu2cuda_map.items():
+ p_cuda.copy_(p_cpu, non_blocking=True)


For these H2D copies, the non_blocking=True here only means that the CPU will not be blocked. The p_cpu is already in pinned memory, so there is no further pinned memory consideration.

Calling non_blocking=True allows the CPU to proceed into the next logic whether that is logging, the next iteration data loading, etc. or whatever.

However, subsequent CUDA kernels issued in the default stream will still serialize with the H2D copies.

I will still mention that this non_blocking is still benefiicial as it allows the cpu to enqueue all the copies and much better saturate the bw even if there is no overlap with compute.

@albanD I wanted to understand this point better.

If you call non_blocking=False, then there is a cudaDeviceSynchronize after each copy, blocking the CPU until the copy finishes. After that, the CPU will proceed to issue the next copy, so there may be some slight gaps between each H2D copy.

The part that I am not clear on is, are you suggesting that these gaps are exactly what would hurt the overall copy bandwidth, or do you mean that if you issue back-to-back H2D Memcpys, then there is some kind of batching effect across copies that improves bandwidth? (The latter would be non-intuitive to me, so I wanted to check.)

I guess for non_blocking=False, the additional cudaDeviceSynchronize is coupled with having to copy to paged memory as well, so that also is slower than copying to pinned memory.

gau-nernst · 2024-08-01T13:13:06Z

@awgu Thank you for your feedback.

For "overlap gradient D2H copies with backward", it probably cannot be done without intrusive change to the training code? Perhaps something like this can help with the overlapping (i.e. once we finish accumulating gradient for a param, we start moving it to CPU, while still doing backward for other params. Since optim step on CPU is blocking, we can only do optim step once all gradients are copied to CPU?).

For "H2D copies with forward", how can I do this? I have been reading this, and it says that I need to use a separate CUDA stream for data transfer to overlap with computation. So it means that I have to somehow synchronize the CPU->CUDA transfer (which will be in a separate CUDA stream) before that param is needed for forward? Perhaps some kind of forward hook? (again, intrusive changes to the training code T.T)

awgu · 2024-08-01T13:18:56Z

I agree once you want to overlap, it becomes quite intrusive 😢 .

The post-accumulate-grad hook could be a good point to run the D2H copy for gradients, but again like you said, without a separate CUDA stream, that copy is not going to overlap. You will mainly be moving the same kernel that would happen in optimizer.step() into the backward.

I do not have a good solution for how to do the overlap in a non-intrusive way. From what I have seen, it is hard to do this kind of overlap without some kind of nn.Module level API since that gives you good points to hook into the forward and backward.

gau-nernst · 2024-08-01T14:25:28Z

Add a poor man's attempt at interleaving grad D2H with backward. seems to work! (you can check the latest changes). speed improves from 1.2s/it to 1.0s/it.

Before (blue is backward kernels, red is copy kernels)

After

in the 2nd image, one thing concerns me is the backward kernels (blue) ends after copy kernels (red). Maybe some bugs with torch profiler? (I'm using torch.profiler.profile, the trace is obtained from the benchmarks/benchmark_low_bit_adam.py script). Loss curve seems ok.

The profiling trace also reveals that CPU optimizer step is the bottleneck, which we won't be able to hide.

So I think in this workload, I should try to feed more work to GPU (e.g. larger batch size, gradient accumulation...).

awgu · 2024-08-01T14:30:14Z

@gau-nernst Is there any way to share the trace file? The backward kernels ending after the D2H copies is pretty interesting 😆 .

gau-nernst · 2024-08-01T14:35:02Z

the file size is quite big, even after gzip (26MB). I will re-run with fewer number of training steps (currently 20, maybe I reduce to 5). is sharing directly here ok? or you prefer some other channels, possibly for security reasons. I can also share via CUDA-MODE discord.

awgu · 2024-08-01T14:38:50Z

Oh, I did not realize that the profiler was profiling so many steps. I think it okay to just profile 1 step? I am okay with anyway for you to share it.

gau-nernst · 2024-08-01T14:47:17Z

Here it is

optim_cpu_offload_d2h_overlap.tar.gz

For interleaving H2D param copy with forward, I'm thinking of using nn.Module.register_forward_pre_hook(). But even with the forward hook, it is tricky to know which params under that module should be synchronized. Maybe we only synchronize the immediate nn.Parameter (there is no direct API for this I think, but using .named_parameters() and check for no prefix should be ok).

awgu · 2024-08-01T14:53:54Z

torchao/prototype/low_bit_optim/cpu_offload.py

+
+ def copy_grad_hook(p_cuda):
+ if p_cuda.grad is not None:
+ p_cpu = self.param_cuda2cpu_map[p_cuda]


I think we need a self.d2h_grad_stream.wait_stream(torch.cuda.current_stream()) or else the D2H copy may not see the correct values in p_cuda.grad. This should be why your backward kernels are finishing after your D2H copies.

It is interesting that loss looks good 😃

I'm testing with ViT fine-tuning, so I guess it's quite forgiving to bugs 🤣

awgu

SGTM!

By the way, in pre-training, people really like to do clip_grad_norm_. If we do that on CPU, I imagine it will also be super slow, so it might be better to leave gradients on GPU, clip on GPU, and then incur an exposed D2H copy to CPU. (just something to think about)

Also, I would be pretty curious to see what DeepSpeed's trace looks like to understand the perf difference.

awgu · 2024-08-05T21:54:32Z

torchao/prototype/low_bit_optim/cpu_offload.py

+
+ # deallocate CUDA gradients once D2H transfer finishes.
+ if offload_gradients:
+ p_cuda.grad.record_stream(self.stream)


Note that record_stream will have non-deterministic memory behavior (namely, when the CUDA tensor gets freed depends on when its last GPU kernel finishes, which is difficult to reason about precisely).

We really want to move away from using record_stream, but it can make your implementation more complicated.

Just curious, what are the alternatives?

The crux of the issue is that p_cuda.grad is allocated in the default stream but has ops on it in a different stream, so there is a producer/consumer stream relationship.

In such cases, you need to make sure that the consumer stream's kernels (in this case, the D2H copy) before any kernels in the producer stream reuse that memory.

The idea then is to hold a reference to p_cuda.grad until the CPU has issued the ops with which you want the D2H copy to overlap with, and then you do torch.cuda.current_stream().wait_event(event) where event was recorded in self.stream right after the D2H copy and current_stream() is the default/producer stream. That way, any subsequent ops in the producer stream will run after the D2H copy has finished and can safely reuse the p_cuda.grad address.

The challenge can be that you do not know how many / which ops to overlap with, so it is not convenient to sync back (torch.cuda.current_stream().wait_event(d2h_event)).

However, for cases like FSDP, we do have a good time: e.g. the previous reduce-scatter must finish before the next reduce-scatter, so let us wait for the previous reduce-scatter (doing this "sync back") right before the next reduce-scatter.

torch.cuda.current_stream().wait_event(event) means that the next backward op cannot overlap with D2H?

I think another option is to delete p_cuda.grad reference inside optim.step(), but it means we only start deallocating CUDA grad when we iterate over self.queue -> might not reduce much peak memory.

torch.cuda.current_stream().wait_event(event) means that the next backward op cannot overlap with D2H?

Yes, the next backward op after the wait_event call cannot overlap with the D2H right before the recorded event.

One way to think about it is to think about the actual CUDA address. The GPU gradient must have some address A. We need to make sure that no other op uses A until the D2H finishes. We can reserve A and make sure no other backward ops use it as long as we keep a reference to A (as a PyTorch implementation fact). At some point, we have overlapped enough ops, and we can then free A, requiring the aforementioned sync back.

Note that with record_stream, the address A will be reserved until the D2H copy finishes on GPU, at which point maybe many or even all backward ops were issued (in the most extreme case). In that case, none of the backward ops can actually reuse A. This memory reuse depends on the relative timing of CPU and GPU, which makes it difficult to reason about precisely.

I think another option is to delete p_cuda.grad reference inside optim.step(), but it means we only start deallocating CUDA grad when we iterate over self.queue -> might not reduce much peak memory.

I think calling record_stream probably dominates that approach because you will have to block CPU until the D2H finishes anyway.

I see. Thank you for your detailed explanation!

awgu · 2024-08-05T21:55:27Z

torchao/prototype/low_bit_optim/cpu_offload.py

+ params = param_group.pop("params")
+
+ for p_cuda in params:
+ p_cpu = p_cuda.detach().cpu().pin_memory()


nit: If you want this init to be slightly faster, you can probably pre-allocate the pinned memory and copy to it directly so that you do not have the intermediate copy to CPU paged memory.

awgu · 2024-08-05T21:56:21Z

torchao/prototype/low_bit_optim/cpu_offload.py

+
+ for p_cuda, grad_d2h_event in self.queue.items():
+ grad_d2h_event.synchronize()
+ self.optim_dict[p_cuda].step()


ah, so fused optimizer only fuses vertically? (or is there a potential perf hit here by running per-parameter fused optimizer step?)

there is some perf hit due to calling fused Adam on each parameter separately (my current approach) instead of all (or some) parameters (650ms -> 750ms iirc). I couldn't figure out a way to call fused Adam on more than one parameter because of synchronization: in __init__(), we don't know which params will have theirs grads D2H finish first, so we can't statically schedule and group the params.

Technically it's still possible if we use functional Adam (i.e. wait for a few items in self.queue, then call functional Adam on them), but then it would require writing optim-specific code, instead of treating base optimizer as a black box.

gau-nernst · 2024-08-05T23:05:33Z

I sent the DeepSpeed trace on Discord. Two main reasons (1) DeepSpeed CPU Adam is slower than PyTorch fused Adam, (2) They don't interleave data transfer as well as CPU optim step (might be because I didn't set the config correctly).

gau-nernst · 2024-08-05T23:16:47Z

Regarding gradient clipping, I thought about it too. The biggest improvement is actually from overlap CPU Adam with backward (i.e. start CPU Adam as soon as host finish launching all backward kernels). We can still move grad D2H during backward (help with hiding data transfer + offload gradients), but CPU Adam can only start when all gradients are present on CPU to do gradient clipping. Even if we do gradient clipping on GPU, CPU Adam still needs to wait for all gradients to be available (i.e. backward finish).

Probably good to add a note that this CPU offload optimizer doesn't support gradient clipping at the moment. We can add support for it in a future PR.

Edit: another thing I haven't considered is gradient clipping speed. It is memory-bound. Gradient clipping on CPU would probably be much slower than on GPU.

msaroufim · 2024-08-06T05:31:48Z

torchao/prototype/low_bit_optim/README.md

+optim = CPUOffloadOptimizer(model.parameters(), torch.optim.AdamW, fused=True)
+optim.load_state_dict(ckpt["optim"])
+```
+
 ## Credits

 Credits to Tim Dettmers for creating the wonderful [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) library, and [lpmm](https://github.com/thu-ml/low-bit-optimizers) authors for their work on 4-bit optimizers.


credit deepspeed as well

msaroufim · 2024-08-06T05:32:30Z

torchao/prototype/low_bit_optim/README.md

+NOTE:
+- Since the optimizer step is done on CPU, it is highly recommended to use a fast CPU optimizer, such as `torch.optim.AdamW(fused=True)` (requires PyTorch 2.4). For other optimizers, you can try `torch.compile()` their optimizer step.
+- To minimize the amount of CPU<->GPU data transfer, we keep a copy of parameters and pre-allocate gradients memory on CPU. Therefore, expect your RAM usage to increase by 2x model size + optimizer state (which is 2x model size for Adam).
+- It is recommended NOT to `torch.compile()` your whole model when `CPUOffloadOptimizer` is used, as it prevents us from interleaving gradient device-to-host transfer with backward pass. To minimize such impact, you can compile parts of your model separately.


it's not clear to me from the test or benchmark when this specific point is relevant

Are you referring to the one about torch.compile? (sorry not sure which line you are referring to from GitHub UI). If it is, I can add benchmark for this config in README, and also its trace in the PR description (to compare with eager).

msaroufim · 2024-08-06T05:33:35Z

torchao/prototype/low_bit_optim/README.md

+- Since the optimizer step is done on CPU, it is highly recommended to use a fast CPU optimizer, such as `torch.optim.AdamW(fused=True)` (requires PyTorch 2.4). For other optimizers, you can try `torch.compile()` their optimizer step.
+- To minimize the amount of CPU<->GPU data transfer, we keep a copy of parameters and pre-allocate gradients memory on CPU. Therefore, expect your RAM usage to increase by 2x model size + optimizer state (which is 2x model size for Adam).
+- It is recommended NOT to `torch.compile()` your whole model when `CPUOffloadOptimizer` is used, as it prevents us from interleaving gradient device-to-host transfer with backward pass. To minimize such impact, you can compile parts of your model separately.
+- CPU optimizer step is often the bottleneck when optimizer CPU offload is used. To minimize the slowdown, it is recommended to (1) do full BF16 training (instead of AMP), so that parameters, gradients, and optimizer states are in BF16; and (2) give GPU more work per optimizer step (e.g. larger batch size with activation checkpointing, gradient accumulation).


full bf16 training can be tricky fwiw, i believe we'll likely run into convergence issues at larger model sizes but this is fine for now

I can add benchmarks for BF16 AMP.

msaroufim

some minor nits but this is very nice

* initial commit * use fused=True by default for PyTorch adam * detach param * try overlap D2H grad copy with backward * add customizable profile num steps * add v2 * fix various bugs * fix v1 impl * add full BF16 option * change n_profile_steps to 5 * add v3 * fix gradient accumulation * add note * add deepspeed offload * update deepspeed config * add some notes * update instructions. make some packages optional. change to AdamW * add last updated ordered dict * update deepspeed version * remove old versions * update docs * say deepspeed is untuned * add test * add test for offload_gradients. update benchmark script * update benchmark results. fix test. fix benchmark script * fix language * add save and load * pre-allocate CPU params. add note about gradient clipping * update README and remove unused imports

Theodotus1243 · 2024-09-21T01:50:29Z

How to use CPUOffloadOptimizer with LRScheduler
As it has check

# Attach optimizer
        if not isinstance(optimizer, Optimizer):
            raise TypeError(f'{type(optimizer).__name__} is not an Optimizer')

And

class CPUOffloadOptimizer:
    def __init__(

gau-nernst · 2024-09-21T01:55:48Z

Hi @Theodotus1243, you have to manually set the LR, since built-in PyTorch's LRScheduler will enforce the optimizer to be an torch.optim.Optimizer subclass as you have discovered. Something like this

ao/benchmarks/benchmark_low_bit_adam.py

Lines 256 to 261 in 0bdde92

 lr = lr_schedule.get_lr(step) 

 for param_group in optim.param_groups: 

 if isinstance(param_group["lr"], torch.Tensor): 

 param_group["lr"].fill_(lr) 

 else: 

 param_group["lr"] = lr

The reason I don't want to make CPUOffloadOptimizer to be a torch.optim.Optimizer subclass is that it doesn't seem right: CPUOffloadOptimizer itself doesn't hold the params and buffers, it delegates to the base optimizer class, and only hold a list of base optimizers.

Hope it clarifies the problem. I think we can add this caveat to doc.

bghira · 2024-09-26T23:32:10Z

i think it should not be referred to as a drop-in replacement then @gau-nernst and as it is, having a method that sets the lr isn't too much to ask for, i hope?

initial commit

3cd42d2

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 1, 2024

msaroufim requested review from awgu, albanD and msaroufim August 1, 2024 05:10

msaroufim reviewed Aug 1, 2024

View reviewed changes

torchao/prototype/low_bit_optim/cpu_offload.py Show resolved Hide resolved

torchao/prototype/low_bit_optim/cpu_offload.py Outdated Show resolved Hide resolved

torchao/prototype/low_bit_optim/cpu_offload.py Show resolved Hide resolved

use fused=True by default for PyTorch adam

c044d88

msaroufim reviewed Aug 1, 2024

View reviewed changes

benchmarks/benchmark_low_bit_adam.py Outdated Show resolved Hide resolved

detach param

d85b172

awgu reviewed Aug 1, 2024

View reviewed changes

try overlap D2H grad copy with backward

d468e6f

add customizable profile num steps

d7a07eb

awgu reviewed Aug 1, 2024

View reviewed changes

gau-nernst added 9 commits August 2, 2024 01:53

add v2

fe653e9

fix various bugs

8ae42c3

fix v1 impl

b2c00e5

add full BF16 option

68835e3

change n_profile_steps to 5

b5393cb

add v3

3069b23

fix gradient accumulation

7af8518

add note

5ff2e5a

add deepspeed offload

a8a7b5a

gau-nernst added 7 commits August 4, 2024 11:34

update docs

c514dba

say deepspeed is untuned

cfdfe5d

add test

c4ea68b

add test for offload_gradients. update benchmark script

6478be9

update benchmark results. fix test. fix benchmark script

03cf0ad

fix language

a144b22

add save and load

d344817

gau-nernst marked this pull request as ready for review August 4, 2024 08:48

Merge branch 'pytorch:main' into optim_cpu_offload

fc358b1

gau-nernst requested review from awgu and msaroufim August 5, 2024 21:36

awgu approved these changes Aug 5, 2024

View reviewed changes

gau-nernst added 2 commits August 6, 2024 07:19

Merge branch 'main' into optim_cpu_offload

5a5253e

pre-allocate CPU params. add note about gradient clipping

7aa31eb

msaroufim reviewed Aug 6, 2024

View reviewed changes

msaroufim self-requested a review August 6, 2024 05:38

msaroufim approved these changes Aug 6, 2024

View reviewed changes

update README and remove unused imports

231a6ef

msaroufim merged commit 1b1e94c into pytorch:main Aug 6, 2024
13 checks passed

gau-nernst deleted the optim_cpu_offload branch August 6, 2024 23:57

gau-nernst mentioned this pull request Aug 7, 2024

[RFC] Optimizer CPU offload from torchao for single GPU low memory config pytorch/torchtune#1278

Open

gau-nernst mentioned this pull request Sep 26, 2024

CPUOffloadOptimizer incompatible with learning rate schedulers #959

Open

Optimizer CPU offload for single GPU training #584

Optimizer CPU offload for single GPU training #584

Conversation

gau-nernst commented Aug 1, 2024 • edited Loading

Background

Optimizer CPU Offload

Implementation details

pytorch-bot bot commented Aug 1, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/584

✅ No Failures

awgu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gau-nernst commented Aug 1, 2024

awgu commented Aug 1, 2024

gau-nernst commented Aug 1, 2024

awgu commented Aug 1, 2024

gau-nernst commented Aug 1, 2024

awgu commented Aug 1, 2024

gau-nernst commented Aug 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awgu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gau-nernst commented Aug 5, 2024

gau-nernst commented Aug 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msaroufim left a comment

Choose a reason for hiding this comment

Theodotus1243 commented Sep 21, 2024

gau-nernst commented Sep 21, 2024

bghira commented Sep 26, 2024

gau-nernst commented Aug 1, 2024 •

edited

Loading

pytorch-bot bot commented Aug 1, 2024 •

edited

Loading

gau-nernst commented Aug 5, 2024 •

edited

Loading