Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练速度过慢 #83

Open
bonnie-cbw opened this issue Jun 28, 2021 · 5 comments
Open

训练速度过慢 #83

bonnie-cbw opened this issue Jun 28, 2021 · 5 comments

Comments

@bonnie-cbw
Copy link

求助求助!
您好,我使用的裁剪为800×800的DOTA数据集,ResNet50,按照您的教程一步步做的,环境的配置也没有什么问题,但是训练速度大概在2.8秒一张图。这个速度是不是有一些太慢了?想问问各位大佬知道有可能是什么原因吗?
谢谢!!!

@yangxue0827
Copy link
Member

没用上gpu吧,cfgs上需要设置一下

@bonnie-cbw
Copy link
Author

非常感谢您的解答!
这是我cfgs中一部分的设置
这个gpu_group我应该有设置呀,是设置的有什么问题吗?
我现在用的服务器是两块2080Ti的显卡

---------------------------------------- System_config

ROOT_PATH = os.path.abspath('../')
print(3*"++--")
print(ROOT_PATH)
GPU_GROUP = "0,1"
NUM_GPU = len(GPU_GROUP.strip().split(','))
SHOW_TRAIN_INFO_INTE = 20
SMRY_ITER = 2000
SAVE_WEIGHTS_INTE = 4000

SUMMARY_PATH = ROOT_PATH + '/output/summary'
TEST_SAVE_PATH = ROOT_PATH + '/tools/test_result'

if NET_NAME.startswith("resnet"):
weights_name = NET_NAME
elif NET_NAME.startswith("MobilenetV2"):
weights_name = "mobilenet/mobilenet_v2_1.0_224"
else:
raise Exception('net name must in [resnet_v1_101, resnet_v1_50, MobilenetV2]')

PRETRAINED_CKPT = ROOT_PATH + '/data/pretrained_weights/' + weights_name + '.ckpt'
TRAINED_CKPT = os.path.join(ROOT_PATH, 'output/trained_weights')

EVALUATE_DIR = ROOT_PATH + '/output/evaluate_result_pickle/'/home/xianyun/cbw/RetinaNet_Tensorflow_Rotation-master/tools/test_dota

EVALUATE_DIR = ROOT_PATH + '/tools/test_dota/'

@yangxue0827
Copy link
Member

那可能安装的是cpu版本的tensorflow,应该安装tensorflow-gpu。你check一下gpu有没有被调用以及tensorflow的版本。

@bonnie-cbw
Copy link
Author

好的!我检查一下!非常感谢!!!

@bonnie-cbw
Copy link
Author

确实是因为我安装的cpu版本的tf!这个问题已经解决了。非常感谢您的解答!
我还有一个问题想请教一下 我在训练过程中,经常会报以下的错误:

2021-06-28 15:16:43.519534: W tensorflow/core/framework/op_kernel.cc:1389] Unknown: IndexError: index -9223372036853286991 is out of bounds for axis 0 with size 1681218
Traceback (most recent call last):

File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 207, in call
ret = func(*args)

File "../libs/detection_oprations/anchor_target_layer_without_boxweight.py", line 38, in anchor_target_layer
max_overlaps = overlaps[np.arange(overlaps.shape[0]), argmax_overlaps_inds]

IndexError: index -9223372036853286991 is out of bounds for axis 0 with size 1681218

2021-06-28 15:16:43.684349: W tensorflow/core/kernels/queue_base.cc:277] _0_get_batch/input_producer: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684390: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684484: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684499: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684528: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684546: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684563: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684580: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684598: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684607: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684612: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684632: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684638: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684655: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684664: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684671: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
2021-06-28 15:16:43.684688: W tensorflow/core/kernels/queue_base.cc:277] _1_get_batch/batch/padding_fifo_queue: Skipping cancelled enqueue attempt with queue not closed
Traceback (most recent call last):
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.UnknownError: IndexError: index -9223372036853286991 is out of bounds for axis 0 with size 1681218
Traceback (most recent call last):

File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 207, in call
ret = func(*args)

File "../libs/detection_oprations/anchor_target_layer_without_boxweight.py", line 38, in anchor_target_layer
max_overlaps = overlaps[np.arange(overlaps.shape[0]), argmax_overlaps_inds]

IndexError: index -9223372036853286991 is out of bounds for axis 0 with size 1681218

 [[{{node tower_1/build_loss/PyFunc}}]]
 [[{{node tower_0/add_3}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "multi_gpu_train.py", line 354, in
train()
File "multi_gpu_train.py", line 317, in train
sess.run([train_op, global_step, total_loss_dict])
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: IndexError: index -9223372036853286991 is out of bounds for axis 0 with size 1681218
Traceback (most recent call last):

File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 207, in call
ret = func(*args)

File "../libs/detection_oprations/anchor_target_layer_without_boxweight.py", line 38, in anchor_target_layer
max_overlaps = overlaps[np.arange(overlaps.shape[0]), argmax_overlaps_inds]

IndexError: index -9223372036853286991 is out of bounds for axis 0 with size 1681218

 [[node tower_1/build_loss/PyFunc (defined at ../libs/networks/build_whole_network.py:233) ]]
 [[node tower_0/add_3 (defined at multi_gpu_train.py:232) ]]

Caused by op 'tower_1/build_loss/PyFunc', defined at:
File "multi_gpu_train.py", line 354, in
train()
File "multi_gpu_train.py", line 206, in train
gpu_id=i)
File "../libs/networks/build_whole_network.py", line 233, in build_whole_detection_network
tf.float32])
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 468, in py_func
func=func, inp=inp, Tout=Tout, stateful=stateful, eager=False, name=name)
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 282, in _internal_py_func
input=inp, token=token, Tout=Tout, name=name)
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/ops/gen_script_ops.py", line 151, in py_func
"PyFunc", input=input, token=token, Tout=Tout, name=name)
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1801, in init
self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): IndexError: index -9223372036853286991 is out of bounds for axis 0 with size 1681218
Traceback (most recent call last):

File "/home/xianyun/anaconda3/envs/rretinanet/lib/python3.5/site-packages/tensorflow/python/ops/script_ops.py", line 207, in call
ret = func(*args)

File "../libs/detection_oprations/anchor_target_layer_without_boxweight.py", line 38, in anchor_target_layer
max_overlaps = overlaps[np.arange(overlaps.shape[0]), argmax_overlaps_inds]

IndexError: index -9223372036853286991 is out of bounds for axis 0 with size 1681218

 [[node tower_1/build_loss/PyFunc (defined at ../libs/networks/build_whole_network.py:233) ]]
 [[node tower_0/add_3 (defined at multi_gpu_train.py:232) ]]

想问问您见过这个错误吗?您知道有什么解决的办法吗?
再次感谢!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants