Some questions about FP16 training #841

Shuweis · 2022-04-12T05:24:49Z

Hi! I see that mmediting is support for fp16 training , how can I use it?

wangruohui · 2022-04-12T11:47:57Z

Some codes are ready, but we haven't fully implemented this feature. you need to modify some source codes and possibly solve some potential bugs.

Currently, mmediting support fp16 based on mmcv.auto_fp16, if some model's forward function is wrapped with @auto_fp16, like this, it potentially supports this feature.

To enable fp16, you need to modify the fp16_enabled variable to True in the __init__, like here. Then you can try to train the model based config files.

Previously, there is a PR #320, but it is some how out-of-dated.

Shuweis · 2022-04-12T12:01:06Z

I turn fp16_enabled to True，but nothing happened. The training gpu memory did not become small.

wangruohui · 2022-04-12T12:31:20Z

Oh, my mistake.
I think you also need an FP16OptimzierHook like this.

Depending on whether you are using distributed or non-distributed training, you need to register the hook around line 179 or 304 in mmedit/apis/train.py.

Shuweis · 2022-04-12T14:11:16Z

https://github.com/open-mmlab/mmdetection/blob/98949809b7179fab9391663ee5a4ab5978425f90/mmdet/apis/train.py#L153
what does fp16_cfg mean? Is it true or false?

wangruohui · 2022-04-13T03:44:19Z

It's a dictionary like fp16 = dict(loss_scale=512.). The loss_scale variable is used to scale the loss, so that gradient will not underflow in FP16, generally 128 to 512.

At current stage, maybe mmdetection is a good reference. A good trick is to search some keywords in the repo.

ArchipLab-LinfengZhang · 2022-10-04T12:54:07Z

It's a dictionary like fp16 = dict(loss_scale=512.). The loss_scale variable is used to scale the loss, so that gradient will not underflow in FP16, generally 128 to 512.

At current stage, maybe mmdetection is a good reference. A good trick is to search some keywords in the repo.

Hello. I have tried to use fp16 but find the loss nan problem after around 300~500 iterations. I have tried to use loss_scale=512/128/64/32 but all of them didn't work. I have also tried gradient clip by the way. Do you have any ideas about how to solve this problem?

zengyh1900 · 2022-10-09T11:52:34Z

@LeoXing1996
Please check this issue.

LeoXing1996 · 2022-10-12T05:49:15Z

Hey @Shuweis and @ArchipLab-LinfengZhang , currently MMEdit 1.x has already supported auto-fp16 training, and you are welcome to have a try.
If you still have any problems with FP16 training, please paste your configs and training log, and then we can help you better.

mm-assistant bot assigned wangruohui Apr 12, 2022

wangruohui changed the title ~~Some questions about training~~ Some questions about FP16 training Apr 12, 2022

zengyh1900 assigned LeoXing1996 and unassigned wangruohui Oct 9, 2022

zengyh1900 added the in-progress label Oct 9, 2022

zengyh1900 added kind/bug something isn't working awating response priority/P0 highest priority and removed in-progress labels Oct 11, 2022

zengyh1900 added this to the 0.16.0 milestone Oct 11, 2022

LeoXing1996 added status/WIP work in progress normally and removed awaiting response labels Oct 12, 2022

zengyh1900 added status/need more info need more information from the creator and removed status/WIP work in progress normally labels Oct 14, 2022

zengyh1900 modified the milestones: 0.16.0, Backlog Oct 17, 2022

open-mmlab locked and limited conversation to collaborators Oct 25, 2022

zengyh1900 converted this issue into discussion #1354 Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Some questions about FP16 training #841

Some questions about FP16 training #841

Shuweis commented Apr 12, 2022

wangruohui commented Apr 12, 2022 •

edited

Loading

Shuweis commented Apr 12, 2022

wangruohui commented Apr 12, 2022

Shuweis commented Apr 12, 2022

wangruohui commented Apr 13, 2022

ArchipLab-LinfengZhang commented Oct 4, 2022

zengyh1900 commented Oct 9, 2022

LeoXing1996 commented Oct 12, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Some questions about FP16 training #841

Some questions about FP16 training #841

Comments

Shuweis commented Apr 12, 2022

wangruohui commented Apr 12, 2022 • edited Loading

Shuweis commented Apr 12, 2022

wangruohui commented Apr 12, 2022

Shuweis commented Apr 12, 2022

wangruohui commented Apr 13, 2022

ArchipLab-LinfengZhang commented Oct 4, 2022

zengyh1900 commented Oct 9, 2022

LeoXing1996 commented Oct 12, 2022

This issue was moved to a discussion.

wangruohui commented Apr 12, 2022 •

edited

Loading