Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about FP16 training #841

Closed
Shuweis opened this issue Apr 12, 2022 · 8 comments
Closed

Some questions about FP16 training #841

Shuweis opened this issue Apr 12, 2022 · 8 comments
Assignees
Labels
kind/bug something isn't working priority/P0 highest priority status/need more info need more information from the creator
Milestone

Comments

@Shuweis
Copy link

Shuweis commented Apr 12, 2022

Hi! I see that mmediting is support for fp16 training , how can I use it?

@wangruohui
Copy link
Member

wangruohui commented Apr 12, 2022

Some codes are ready, but we haven't fully implemented this feature. you need to modify some source codes and possibly solve some potential bugs.

Currently, mmediting support fp16 based on mmcv.auto_fp16, if some model's forward function is wrapped with @auto_fp16, like this, it potentially supports this feature.

To enable fp16, you need to modify the fp16_enabled variable to True in the __init__, like here. Then you can try to train the model based config files.

Previously, there is a PR #320, but it is some how out-of-dated.

@Shuweis
Copy link
Author

Shuweis commented Apr 12, 2022

I turn fp16_enabled to True,but nothing happened. The training gpu memory did not become small.

@wangruohui wangruohui changed the title Some questions about training Some questions about FP16 training Apr 12, 2022
@wangruohui
Copy link
Member

Oh, my mistake.
I think you also need an FP16OptimzierHook like this.

Depending on whether you are using distributed or non-distributed training, you need to register the hook around line 179 or 304 in mmedit/apis/train.py.

@Shuweis
Copy link
Author

Shuweis commented Apr 12, 2022

@wangruohui
Copy link
Member

It's a dictionary like fp16 = dict(loss_scale=512.). The loss_scale variable is used to scale the loss, so that gradient will not underflow in FP16, generally 128 to 512.

At current stage, maybe mmdetection is a good reference. A good trick is to search some keywords in the repo.

@ArchipLab-LinfengZhang
Copy link

It's a dictionary like fp16 = dict(loss_scale=512.). The loss_scale variable is used to scale the loss, so that gradient will not underflow in FP16, generally 128 to 512.

At current stage, maybe mmdetection is a good reference. A good trick is to search some keywords in the repo.

Hello. I have tried to use fp16 but find the loss nan problem after around 300~500 iterations. I have tried to use loss_scale=512/128/64/32 but all of them didn't work. I have also tried gradient clip by the way. Do you have any ideas about how to solve this problem?

@zengyh1900
Copy link
Collaborator

@LeoXing1996
Please check this issue.

@zengyh1900 zengyh1900 added kind/bug something isn't working awating response priority/P0 highest priority and removed in-progress labels Oct 11, 2022
@zengyh1900 zengyh1900 added this to the 0.16.0 milestone Oct 11, 2022
@LeoXing1996
Copy link
Collaborator

Hey @Shuweis and @ArchipLab-LinfengZhang , currently MMEdit 1.x has already supported auto-fp16 training, and you are welcome to have a try.
If you still have any problems with FP16 training, please paste your configs and training log, and then we can help you better.

@LeoXing1996 LeoXing1996 added status/WIP work in progress normally and removed awaiting response labels Oct 12, 2022
@zengyh1900 zengyh1900 added status/need more info need more information from the creator and removed status/WIP work in progress normally labels Oct 14, 2022
@zengyh1900 zengyh1900 modified the milestones: 0.16.0, Backlog Oct 17, 2022
@open-mmlab open-mmlab locked and limited conversation to collaborators Oct 25, 2022
@zengyh1900 zengyh1900 converted this issue into discussion #1354 Oct 25, 2022

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
kind/bug something isn't working priority/P0 highest priority status/need more info need more information from the creator
Projects
None yet
Development

No branches or pull requests

5 participants