Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot start evaluation after running Train_MambaBCD.py #37

Closed
lin-ovvo1111 opened this issue Jun 12, 2024 · 12 comments
Closed

Cannot start evaluation after running Train_MambaBCD.py #37

lin-ovvo1111 opened this issue Jun 12, 2024 · 12 comments

Comments

@lin-ovvo1111
Copy link

在500个iter过后,开始starting evaluation,然后就卡在那里不动了,源代码中似乎是用test测试集进行评估的,不知道哪里出现问题了。

@ChenHongruixuan
Copy link
Owner

Hi, thank you for your question. Did you solve your issue? Evaluation stage takes time. If that still doesn't work, you can try lowering the batch size.

@NUAAZJY
Copy link

NUAAZJY commented Jun 15, 2024

类似的问题,训练一开始是正常的,但是在首次starting evaluation时报cuda内存错误,调整batchsize没有效果。看起来不像是GPU内存不足的问题,请问作者这种情况是不是需要增加一些关于cuda内存管理的代码。

CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 11.76 GiB total capacity; 10.47 GiB already allocated; 33.62 MiB free; 10.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@XiaoBaiHhy
Copy link

类似的问题,训练一开始是正常的,但是在首次starting evaluation时报cuda内存错误,调整batchsize没有效果。看起来不像是GPU内存不足的问题,请问作者这种情况是不是需要增加一些关于cuda内存管理的代码。

CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 11.76 GiB total capacity; 10.47 GiB already allocated; 33.62 MiB free; 10.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

请问这个问题你解决了吗?我也遇到这个问题了

@NUAAZJY
Copy link

NUAAZJY commented Jun 16, 2024

类似的问题,训练一开始是正常的,但是在首次starting evaluation时报cuda内存错误,调整batchsize没有效果。看起来不像是GPU内存不足的问题,请问作者这种情况是不是需要增加一些关于cuda内存管理的代码。

CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 11.76 GiB total capacity; 10.47 GiB already allocated; 33.62 MiB free; 10.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

请问这个问题你解决了吗?我也遇到这个问题了

还没有,加了清理128MB的碎片设置、还试了tiny版的模型也是一样

@ChenHongruixuan
Copy link
Owner

Hi guys,

Thank you so much for your question. May I ask which dataset are you running?

Best,

@NUAAZJY
Copy link

NUAAZJY commented Jun 17, 2024 via email

@ChenHongruixuan
Copy link
Owner

Are you running into this issue on both datasets, or just LEVIR-CD+?

@NUAAZJY
Copy link

NUAAZJY commented Jun 17, 2024

Are you running into this issue on both datasets, or just LEVIR-CD+?

both datasets

@ChenHongruixuan
Copy link
Owner

Hi,

That's quite weird. For the LEVIR-CD+ dataset, since the image size in it is 1024x1024, the problem may occur. Thus, you may need to crop it into smaller size by yourself. But evalution on the SYSU dataset should not have that problem. We have updated the code, please try to train again with the current version of the code.

Best,

@NUAAZJY
Copy link

NUAAZJY commented Jun 17, 2024 via email

@NUAAZJY
Copy link

NUAAZJY commented Jun 18, 2024

Hi,

That's quite weird. For the LEVIR-CD+ dataset, since the image size in it is 1024x1024, the problem may occur. Thus, you may need to crop it into smaller size by yourself. But evalution on the SYSU dataset should not have that problem. We have updated the code, please try to train again with the current version of the code.

Best,

你好,非常感谢你的帮助,我已经从成功在SYSU数据集上复现了BCD代码,且精度与论文相符,LEVIR-CD数据集我会裁剪成256*256版本后再去尝试。
但是在复现过程中我发现您的代码中可能存在两处小问题:
1、train_MambaCD.py中,作者可能是出于速度考虑,把第149-156行代码放在了for循环外,但是会使得Evaluation过程看起来像卡住了一样(实际上只是Evaluation过程比较慢);
2、infer_MambaCD.py中,第69-70行代码中的feature_map_saved_path参数没有定义,会导致infer程序报错。并且feature_map_saved_path参数似乎并未用到,删除69-70行代码后可正常运行。

@ChenHongruixuan
Copy link
Owner

ChenHongruixuan commented Jun 18, 2024

Hi,

你好,非常感谢你的帮助,我已经从成功在SYSU数据集上复现了BCD代码,且精度与论文相符,LEVIR-CD数据集我会裁剪成256*256版本后再去尝试。

Glad to hear that!

train_MambaCD.py中,作者可能是出于速度考虑,把第149-156行代码放在了for循环外,但是会使得Evaluation过程看起来像卡住了一样(实际上只是Evaluation过程比较慢);

The evaluation code is placed on the outside to get the final accuracy. To increase the speed of evaluation, you need to increase eval_batch_size. the current setting is 1.

infer_MambaCD.py中,第69-70行代码中的feature_map_saved_path参数没有定义,会导致infer程序报错。并且feature_map_saved_path参数似乎并未用到,删除69-70行代码后可正常运行。

Thank you for pointing this out. We will fix this error soon.

Best,

@ChenHongruixuan ChenHongruixuan pinned this issue Jun 18, 2024
@ChenHongruixuan ChenHongruixuan changed the title Train_MambaBCD.py运行后无法进行starting evaluation Cannot start evaluation after running Train_MambaBCD.py Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants