Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text classification bug fix & support ernie m #3184

Merged
merged 3 commits into from
Sep 5, 2022

Conversation

lugimzzz
Copy link
Contributor

@lugimzzz lugimzzz commented Sep 2, 2022

PR types

Others

PR changes

Others

Description

修复了文本分类中的bug,新增支持ernie-m模型(裁剪和部署部分暂不支持)

@@ -376,7 +384,7 @@ python prune.py \
* `per_device_eval_batch_size`:开发集评测过程批处理大小,请结合显存情况进行调整,若出现显存不足,请适当调低这一参数;默认为32。
* `learning_rate`:训练最大学习率;默认为3e-5。
* `num_train_epochs`: 训练轮次,使用早停法时可以选择100;默认为10。
* `logging_steps`: 训练过程中日志打印的间隔steps数,默认5
* `logging_steps`: 训练过程中日志打印的间隔steps数,默认100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的logging steps对于大多数cpu用户来说,是不是太大了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是因为trainer里默认的参数就是100,但我在训练时命令行的参数还是设置--logging_steps 5

'token_type_ids'], batch['labels']
logits = model(input_ids, token_type_ids)
label = batch.pop("labels")
logits = model(**batch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不展开的目的是什么了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为了可以适配erniem模型,erniem模型tokenizer得到的数据和模型输入都没有token_type_ids

'token_type_ids'], batch['labels']
logits = model(input_ids, token_type_ids)
label = batch.pop("labels")
logits = model(**batch)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

@@ -264,16 +267,15 @@ checkpoint/

* 如需恢复模型训练,则可以设置 `init_from_ckpt` , 如 `init_from_ckpt=checkpoint/model_state.pdparams` 。
* 如需训练英文文本分类任务,只需更换预训练模型参数 `model_name` 。英文训练任务推荐使用"ernie-2.0-base-en",更多可选模型可参考[Transformer预训练模型](https://paddlenlp.readthedocs.io/zh/latest/model_zoo/index.html#transformer)。

* 英文和中文以外文本分类任务建议使用多语言预训练模型"ernie-m-base","ernie-m-large", 多语言模型暂不支持文本分类模型部署,相关功能正在加速开发中。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ernie-m版本目前在适配压缩API,主要问题是什么了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

erniem模型tokenizer得到的数据和模型输入都没有token_type_ids,但压缩API训练的时候默认模型输入要有token_type_ids

https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/trainer/trainer_compress.py#L386

Copy link
Collaborator

@wawltor wawltor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lugimzzz lugimzzz merged commit ab2bd21 into PaddlePaddle:develop Sep 5, 2022
@lugimzzz lugimzzz deleted the erniem branch September 5, 2022 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants