forked from bigscience-workshop/bigscience
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
39 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Finetuning | ||
|
||
Notes on the plans to do finetuning with the pre-trained model | ||
|
||
# Large Model on smaller hardware setup | ||
|
||
- fine-tuning a 150-200B model with fewer GPUs than the pre-training setup | ||
|
||
## a. Fine-Tuning requiring only the model weights from the pre-training and uninitialized optimizer states | ||
|
||
|
||
Solution: This can also be done using ZeRO-Infinity | ||
|
||
Hardware Requirements: This would require about 2.5-5 TB of aggregate memory for 100-200B model. It can be either CPU memory or NVMe memory, and it can be within a single node or across nodes. A single node server with enough CPU or NVMe can work, if speed is not an issue. | ||
|
||
Estimated Work: We can do this with ZeRO-Infinity. Seems like @Shaden Smith already has the code to load the model parameters checkpoints from Megatron+DeepSpeed 3D to Megatron+ DeepSpeed ZeRO-Infinity. | ||
|
||
## b. Continued-Training requiring both the model weights and optimizer states after pre-training | ||
|
||
Solution: This can be done using Megatron+DeepSpeed 3D with ZeRO CPU Offload. | ||
|
||
Hardware Requirements: This option will require 2-4 TB of aggregate CPU memory to store the optimizer states and 600-1200GB of aggregate GPU memory to store parameters, gradients and activations for 100-200B model. | ||
|
||
This reduces the number of GPUs required by 4x. Will run on 32-64 GPUs on 4-8x nodes with 8xV100, 768GB RAM. | ||
|
||
Estimated work: The current code already supports it. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Inference | ||
|
||
Notes on the plans to do inference with the pre-trained model | ||
|
||
# Large Model on limited hardware | ||
|
||
- inferencing and tinkering on a single host (150-200B model) | ||
|
||
Solution: We can do this with ZeRO-Infinity. Seems like @Shaden Smith already has the code to load the model parameters checkpoints from Megatron+DeepSpeed 3D to Megatron+ DeepSpeed ZeRO-Infinity. The remaining work is to add an inference only mode to ZeRO-Infinity that drops all the non-parameter states. | ||
|
||
Hardware Requirements : Would require about 500-1000 GB of memory (can be CPU, GPU or NVMe). Single Node with enough CPU or NVMe memory should work here. | ||
|
||
Estimated Work: If all works as expected, 1-3 weeks based on bandwidth availability. Tuning for the best performance might another week or so, but that wont be blocking the availability of the functionality. |