Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: invalid argument #5

Open
Morgansgun opened this issue Dec 16, 2023 · 6 comments
Open

RuntimeError: CUDA error: invalid argument #5

Morgansgun opened this issue Dec 16, 2023 · 6 comments

Comments

@Morgansgun
Copy link

Hello again!Thanks for your reply previously,I finally finished the train process.But recently when I try to run the test.sh,some errors come again.
I ran the test.sh as your Readme says,the file is:

SR3D_GPT='/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/data/Sr3D_release.csv'
PATH_OF_SCANNET_FILE='/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/data/scanresult/keep_all_points_with_global_scan_alignment/keep_all_points_with_global_scan_alignment.pkl'
PATH_OF_REFERIT3D_FILE=${SR3D_GPT}
PATH_OF_BERT='/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/data/bert'

VIEW_NUM=4
EPOCH=100
DATA_NAME=SR3D
EXT=ViewRefer_test
DECODER=4
NAME=${DATA_NAME}_${VIEW_NUM}view_${EPOCH}ep_${EXT}
TRAIN_FILE=train_referit3d

TYPE=reserved
python -u /home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/scripts/${TRAIN_FILE}.py \
--mode evaluate \
-scannet-file ${PATH_OF_SCANNET_FILE} \
-referit3D-file ${PATH_OF_REFERIT3D_FILE} \
--bert-pretrain-path ${PATH_OF_BERT} \
--log-dir logs/results/${NAME} \
--resume-path '/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/logs/results/SR3D_4view_100ep_ViewRefer/12-07-2023-11-19-59/checkpoints/best_model.pth'\
--model 'referIt3DNet_transformer' \
--unit-sphere-norm True \
--batch-size 6 \
--n-workers 4 \
--max-train-epochs ${EPOCH} \
--encoder-layer-num 3 \
--decoder-layer-num ${DECODER} \
--decoder-nhead-num 8 \
--view_number ${VIEW_NUM} \
--rotate_number 4 \
--label-lang-sup True > ./logs/results/${NAME}.log 2>&1 &

And the file can run for a while ,then it will break at the same place everytime :
100%|█████████▉| 1476/1478 [04:02<00:00, 6.34it/s]
100%|█████████▉| 1477/1478 [04:02<00:00, 6.32it/s]
100%|██████████| 1478/1478 [04:02<00:00, 7.01it/s]
100%|██████████| 1478/1478 [04:02<00:00, 6.09it/s]

0%| | 0/1478 [00:00<?, ?it/s]
0%| | 0/1478 [00:01<?, ?it/s]

And the error is :
Traceback (most recent call last):
File "/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/scripts/train_referit3d.py", line 291, in
args, out_file=out_file,tokenizer=tokenizer)
File "/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/analysis/deepnet_predictions.py", line 42, in analyze_predictions
net_stats = detailed_predictions_on_dataset(model, d_loader, args=args, device=device, FOR_VISUALIZATION=True,tokenizer=tokenizer)
File "/home/sd/anaconda3/envs/viewrefer/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/sd/Harddisk/sba/BS/ViewRefer3D-main/referit3d/models/referit3d_net_utils.py", line 205, in detailed_predictions_on_dataset
batch[k] = batch[k].to(device)
RuntimeError: CUDA error: invalid argument

Why the training process is smooth ,but the error occurs during the test?

@Ivan-Tang-3D
Copy link
Owner

The reason is that the analyze_predictions function of deepnet_predictions file is not involved in the training process but in the test process. My advice is to ipdb at the 205 line of referit3d_net_utils file. Because of the recent business, I would find the error case in the following days.

@Morgansgun
Copy link
Author

The reason is that the analyze_predictions function of deepnet_predictions file is not involved in the training process but in the test process. My advice is to ipdb at the 205 line of referit3d_net_utils file. Because of the recent business, I would find the error case in the following days.

Thanks for your reply! I will try your advice and see if it works.
Here I find another problem: in the file "prepare_referential_data.py" has"from referit3d.in_out.sr3d import load_sr3d_raw_data" line14.But I actually don't find the definition of load_sr3d_raw_data in sr3d.py,so it can't be imported.

@Ivan-Tang-3D
Copy link
Owner

Sorry, I push the wrong file, whose content is three_d_obejct.py. You could refer to this link: https://github.com/sega-hsj/MVT-3DVG/blob/main/referit3d/in_out/sr3d.py

@Ivan-Tang-3D
Copy link
Owner

I have revised the content of sr3d.py

@Morgansgun
Copy link
Author

I have revised the content of sr3d.py

OK!But I didn't find the way to slove the first error yet, I retrained a model, didn't work.

@Ivan-Tang-3D
Copy link
Owner

It might be related with k in batch[k]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants