Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练过程报错 #33

Open
huang-chenhai opened this issue May 6, 2022 · 10 comments
Open

训练过程报错 #33

huang-chenhai opened this issue May 6, 2022 · 10 comments

Comments

@huang-chenhai
Copy link

[2022-05-05 23:18:34,655][        main.py][line: 280][    INFO] Epoch [1]       Iter [53880/184378]     Time 0.238 (0.412)   Data 0.000 (0.192)      Loss 0.0788 (0.0725)                                                 
[2022-05-05 23:18:39,059][        main.py][line: 280][    INFO] Epoch [1]       Iter [53900/184378]     Time 0.296 (0.220)   Data 0.000 (0.000)      Loss 0.0805 (0.0908)                                                 
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,
0,0], thread: [96,0,0] Assertion `input_val >= zero && input_val <= one` failed.                             /opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,
0,0], thread: [97,0,0] Assertion `input_val >= zero && input_val <= one` failed.  

您好,我按照您给的配置SLOWFAST_R101_ACAR_HR2O_DEPTH1.yaml训练,nproc_per_node=1其他为默认,数据集也是按您提供的工具分割出的图片,显卡为3080ti,报错代码如上。我debug了过程,发现

        ret = model(data)
        num_rois = ret['num_rois']
        outputs = ret['outputs']
        targets = ret['targets']

这个outputs出来的数据全是[nan,nan,nan,...]

使用SLOWFAST_R50_ACAR_HR2O.yaml这个配置好像可以正常运行,我不知道问题出在哪里,期待得到您的回复,谢谢!

@xiaozhucj
Copy link

请问下你遇到这个问题了吗
image

@xiaozhucj
Copy link

如果没有,请问下你修改了什么地方就可以正常运行

@huang-chenhai
Copy link
Author

问题已解决,单卡训练需要降低学习率,作者默认是8卡的学习率。如果是单卡跑实验,--nproc_per_node 1 这个参数要设置成1,你看看你是不是这个没改

@huang-chenhai
Copy link
Author

huang-chenhai commented Jul 7, 2022 via email

@xiaozhucj
Copy link

谢谢,已经解决了这个问题。

@yan-ctrl
Copy link

[2022-05-05 23:18:34,655][        main.py][line: 280][    INFO] Epoch [1]       Iter [53880/184378]     Time 0.238 (0.412)   Data 0.000 (0.192)      Loss 0.0788 (0.0725)                                                 
[2022-05-05 23:18:39,059][        main.py][line: 280][    INFO] Epoch [1]       Iter [53900/184378]     Time 0.296 (0.220)   Data 0.000 (0.000)      Loss 0.0805 (0.0908)                                                 
/opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,
0,0], thread: [96,0,0] Assertion `input_val >= zero && input_val <= one` failed.                             /opt/conda/conda-bld/pytorch_1646756402876/work/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [0,
0,0], thread: [97,0,0] Assertion `input_val >= zero && input_val <= one` failed.  

您好,我按照您给的配置SLOWFAST_R101_ACAR_HR2O_DEPTH1.yaml训练,nproc_per_node=1其他为默认,数据集也是按您提供的工具分割出的图片,显卡为3080ti,报错代码如上。我debug了过程,发现

        ret = model(data)
        num_rois = ret['num_rois']
        outputs = ret['outputs']
        targets = ret['targets']

这个outputs出来的数据全是[nan,nan,nan,...]

使用SLOWFAST_R50_ACAR_HR2O.yaml这个配置好像可以正常运行,我不知道问题出在哪里,期待得到您的回复,谢谢!

请问您是怎么解决的,您有用这个网络应用到自己的数据集吗

@yan-ctrl
Copy link

问题已解决,单卡训练需要降低学习率,作者默认是8卡的学习率。如果是单卡跑实验,--nproc_per_node 1 这个参数要设置成1,你看看你是不是这个没改

你好,请问你在复现这篇文章时,有看过https://github.com/Siyu-C/ACAR-Net/tree/master/annotations下的[ava_train_v2.2_with_fair_0.9.pkl](https://drive.google.com/file/d/1CsCUVxdxVyZ5vUM2eGzzV42wzKxPa7bK/view?usp=sharing)文件吗,里面的'labels': [{'bounding_box': [0.093, 0.033, 0.988, 0.978], 'label': [9, 14, 58], 'person_id': [52, 52, 52]}]}的label代表的是动作标签吗,因为自己对着官方数据集的ava_train_v2,2.csv的发现坐标和person_id能对应上,但是label是不一样的,但看着又像是动作标签,现在有些不明白,希望您能回复。

@huang-chenhai
Copy link
Author

huang-chenhai commented Aug 29, 2022 via email

@yan-ctrl
Copy link

yan-ctrl commented Oct 11, 2022 via email

@zhangweibin970807
Copy link

单卡跑实验,那个参数改成1就能跑

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2022年7月7日(星期四) 下午4:09 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [Siyu-C/ACAR-Net] 训练过程报错 (Issue #33) 如果没有,请问下你修改了什么地方就可以正常运行 — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

您好,请问你是把这个参数改成1就行了嘛,要不要同时降低学习率

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants