-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support pp accuracy calculation #9379
Conversation
Thanks for your contribution! |
paddlenlp/trainer/trainer.py
Outdated
if pp_group.nranks > 1: | ||
logit_shape = [[]] | ||
if "pp_logits" in infohub: | ||
logits = paddle.concat(infohub["pp_logits"], axis=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为啥是concat了?不是很理解
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9379 +/- ##
===========================================
+ Coverage 52.93% 53.10% +0.17%
===========================================
Files 688 694 +6
Lines 109379 110966 +1587
===========================================
+ Hits 57899 58930 +1031
- Misses 51480 52036 +556 ☔ View full report in Codecov by Sentry. |
paddlenlp/trainer/trainer.py
Outdated
# evaluation dont support drop last, | ||
# so set the `accumulate_steps` to actually | ||
# eval batch size. | ||
model_config_backup = model.accumulate_steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个命名是不是不太规范? 很明显这个又不是一个model config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlenlp/trainer/trainer.py
Outdated
logits = None | ||
if "pp_logits" in infohub: | ||
logits = paddle.concat(infohub["pp_logits"], axis=0) | ||
logits = logits._copy_to(paddle.framework._current_expected_place(), False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里拷贝的原因是pp_logits是放在cpu memory 或者 cuda pin memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,如果这里不放在cpu或者 pin memory 会在 concat 的时候造成增加两倍 logits 大小的峰值显存,导致 OOM
paddlenlp/trainer/trainer.py
Outdated
@@ -3312,6 +3347,8 @@ def prediction_step( | |||
if self.args.pipeline_parallel_degree > 1: | |||
# hack for pipeline mode | |||
inputs = self._prepare_inputs(inputs) | |||
if self.args.metric_for_best_model == "accuracy": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个建议不要放在trainer,放在SFTTrainer更加合理
paddlenlp/trainer/trainer.py
Outdated
# evaluation dont support drop last, | ||
# so set the `accumulate_steps` to actually | ||
# eval batch size. | ||
model_config_backup = model.accumulate_steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlenlp/trainer/trainer.py
Outdated
else: | ||
input_ids = inputs | ||
|
||
model.accumulate_steps = input_ids.shape[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要不就要把model.micro_batch_size直接设为1
…nto support-pp-acc Conflicts: paddlenlp/trainer/trainer.py
@@ -81,6 +81,7 @@ | |||
"fp16_opt_level": "O2", | |||
"max_grad_norm": 1.0, | |||
"dataloader_num_workers": 0, | |||
"metric_for_best_model": "accuracy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
后续也在开源模型适配
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* support pp accuracy calculation * add pp accuracy ci * add comment * update * mv logits accumulation to cpu * refactor code * code refactor * remove ci, not support yet * update
PR types
PR changes
Description