Skip to content

Conversation

@santacml
Copy link
Owner

No description provided.

tjruwase and others added 27 commits September 13, 2023 15:33
* FlexGen reference

* Fix DS version and opt issue

* Fix script
This PR updates the Llama check in the DS-Chat Step 3 PPO trainer to use the actor module object instead of model when accessing the configuration. This is necessary since not all model types will work when using model, particularly for the BLOOM model family.
Performance metrics in DeepSpeed-Chat were broken with transformers version 4.33.2, so that version is explicitly filtered out in the DeepSpeed-Chat requirements.txt. This will fix the currently broken nv-ds-chat workflow.

The issue is recently fixed in this HuggingFace transformers PR, so this will not be a problem in the next transformers release.
Currently, chatbot assumes OPTForCausalLM model.
Modify it to use the required model from the checkpoint.

Change-Id: I04cbc28f87c7be4fc89a3fac39a3e5634b151b32

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
* FlexGen reference

* Fix DS version and opt issue

* Fix script

* Fix type and padding

* loop option

* PR feedback

* Bug fix

* Format fixes
DeepSpeed's bf16_optimizer does not have an overflow attribute.
This is ok since bf16 dtype has same range as fp32 and is not expected to
overflow.
Therefore, for bf16, always return no overflow.

Change-Id: I66a2204f3af81e52e7fa8d024afafdbbc7494327

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
…i#746)

Currently, only disable_dropout configuration is supported.
However, some models (e.g. Bloom) have a default of dropout=0 in model config.
Therefore, modify to support explicit dropout configuration.
Also, update accordingly existing training scripts.

Change-Id: I5ee96a77ca2b58d9787573a48009e2af36a270b0

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
When using lora only, get_optimizer_grouped_parameters() returns a list of 3
parameter groups, where only the second is not empty.
Then, deepspeed removes empty parameter groups.
[ref: DeepSpeedEngine._configure_optimizer() deepspeed v0.10.3]
However, the lr_scheduler still contains 3 groups.
This causes the lr scheduler to update the lora params with the wrong lr.

Fix it by removing all empty groups in get_optimizer_grouped_parameters().

Change-Id: I520841312bdedd6a572cf4c827e0bbf06f983575

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
* support trust_remote_code

* make trust_remote _code as an argument

---------

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: xiaoxiawu-microsoft <xiaoxiawu@microsoft.com>
Using loss in fp32 can improve training accuracy for all 3 stages.
This was tested with Bloom model using bf16 dtype

While at it, fix stage2 reward model creation: pass zero_stage to
create_critic_model.

Also, in stage3, when using bf16 and tensorboard enabled, we record the actor
and critic loss. Tensorboard accepets a scalar bf16 loss tensor and converts
it to numpy. This fails since numpy does not support conversion from tensor to
bf16. Fix it by logging to tensorboard the loss.item().

Change-Id: I9c8e95d4886cdb44aaa6c14c4aee738e133ae405

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
Add support for periodic evaluation during rm reward model training.
Configurable via added arguments: --eval_interval and --eval_iters.
The default configuration is backward compatible.

In addition, display also the score of the rejected predictions.

Change-Id: Ib377fd731fe676c01114c087581a30777a3f3f49

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* Fix typo

* Fix precommit check

* Format fix

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
When using only optimize lora, we still need to train the v_head parameter.

Change-Id: I252c3ee69819997bf336482c6779b070f2e76df8

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
Co-authored-by: Lev Kurilenko <113481193+lekurile@users.noreply.github.com>
Current default name used to detect LN layers is "LayerNorm.weight".
This does not work for the following models:
- opt: uses "layer_norm"
- llama: uses "norm" and "layernorm"
- bloom: uses "layernorm" and "ln_f"

Therefore, modify the default names to accomodate for the above.
Also, compare names in lower-caps to capture models with different caps.

Change-Id: I5b805df2663c62daf3d9c8a31a973742e344e76b

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
)

Bloom-560m model has high variance in its last LN layer weight.
This causes accuracy issues in bf16 stage2 training.
Therefore, reset the parameters of the last LN layer before training.
This is a good practice in any case where we replace the classifier that
follows the LN.

In addition, in case we are using only optimize lora, we need to force the
training of the LN parameters that were reset.

Note that current fix uses plain initialization of final LN.
A separate commit will provide support for zero3 initialization.

Change-Id: I323d8947907eb4a1cc0fa6354bdaf0cbbf33a68d

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
Currently, ppl is calculated for local worker and then averaged over data
parallel workers. Fix it by first averaging the loss over data parallel
workers and then caclulate ppl of averaged loss.

While at it, print loss in evaluate.

Change-Id: Ic4108ca48a18b326677d80c1eee81c535b3a27a9

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Stages 1 & 2 append '<|endoftext|>' text marker to all samples.
However, some tokenizers (e.g. OPT, Bloom), encode this marker as a sequence
of subword tokens and not as a single special token.

This commit adds an optional support to add the EOT marker as a special token
to force the tokenizer to encode it as a single token.

Note that using EOT special token may change the dynamics of stage3 training.
Therefore, to be backward compliant, this commit makes it optional.

Change-Id: If98d348fcaa7d6685e755aabe305e23e7649c367

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
@santacml santacml merged commit f14540e into python_package Oct 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.