Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

双卡环境使用指定单卡训练报错问题 #77

Closed
wz940216 opened this issue Nov 1, 2019 · 2 comments
Closed

双卡环境使用指定单卡训练报错问题 #77

wz940216 opened this issue Nov 1, 2019 · 2 comments
Assignees

Comments

@wz940216
Copy link

wz940216 commented Nov 1, 2019

报错环境ubuntu18.04 双2080ti显卡,nvidia-smi 显示 显卡驱动是418.88,同时cuda10.1 但是nvcc -V中显示cuda10.0。paddle版本是1.5.1。
代码在另一个单卡v100电脑中完美运行。放在双卡电脑中,首先另一张卡被他人占用了,使用export CUDA_VISIBLE_DEVICES=0 只用一张卡。运行train.py文件报错
Traceback (most recent call last):
File "pdseg/train.py", line 467, in
main(args)
File "pdseg/train.py", line 454, in main
train(cfg)
File "pdseg/train.py", line 235, in train
exe.run(startup_prog)
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/executor.py", line 651, in run
use_program_cache=use_program_cache)
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/fluid/executor.py", line 749, in run
exe.run(program.desc, scope, 0, True, True, fetch_var_name)
RuntimeError: function_attributes(): after cudaFuncGetAttributes: invalid device function
在另一张卡被占用的情况下使用双卡export CUDA_VISIBLE_DEVICES=0,1
报错如下
Traceback (most recent call last):
File "pdseg/train.py", line 467, in
main(args)
File "pdseg/train.py", line 454, in main
train(cfg)
File "pdseg/train.py", line 235, in train
exe.run(startup_prog)
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/executor.py", line 651, in run
use_program_cache=use_program_cache)
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/executor.py", line 749, in run
exe.run(program.desc, scope, 0, True, True, fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet: Invoke operator fill_constant error.
Python Callstacks:
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/framework.py", line 1842, in prepend_op
attrs=kwargs.get("attrs", None))
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/initializer.py", line 189, in call
stop_gradient=True)
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/framework.py", line 1625, in create_var
kwargs['initializer'](var, self)
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/layer_helper_base.py", line 383, in set_variable_initializer
initializer=initializer)
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/optimizer.py", line 317, in add_accumulator
var, initializer=Constant(value=float(fill_value)))
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/optimizer.py", line 760, in create_accumulators
self.add_accumulator(self.velocity_acc_str, p)
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/optimizer.py", line 364, in create_optimization_pass
[p[0] for p in parameters_and_grads])
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/optimizer.py", line 532, in apply_gradients
optimize_ops = self.create_optimization_pass(params_grads)
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/optimizer.py", line 562, in apply_optimize
optimize_ops = self.apply_gradients(params_grads)
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/optimizer.py", line 601, in minimize
loss, startup_program=startup_program, params_grads=params_grads)
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/dygraph/base.py", line 87, in impl
return func(*args, **kwargs)
File "/home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/paddle/ fluid/wrapped_decorator.py", line 25, in impl
return wrapped_func(args, kwargs)
File "</home/yangjing/anaconda3/envs/paddle/lib/python3.6/site-packages/decora tor.py:decorator-gen-20>", line 2, in minimize
File "/home/yangjing/PaddleSeg/pdseg/solver.py", line 85, in sgd_optimizer
optimizer.minimize(loss)
File "/home/yangjing/PaddleSeg/pdseg/solver.py", line 107, in optimise
return self.sgd_optimizer(lr_policy, loss)
File "/home/yangjing/PaddleSeg/pdseg/models/model_builder.py", line 182, in bu ild_model
decayed_lr = optimizer.optimise(avg_loss)
File "pdseg/train.py", line 230, in train
train_prog, startup_prog, phase=ModelPhase.TRAIN)
File "pdseg/train.py", line 454, in main
train(cfg)
File "pdseg/train.py", line 467, in
main(args)
C++ Callstacks:
Enforce failed. Expected allocating <= available, but received allocating:100684 65874 > available:8840675072.
Insufficient GPU memory to allocation. at [/paddle/paddle/fluid/platform/gpu_inf o.cc:262]
PaddlePaddle Call Stacks:
0 0x7fb492147438p void paddle::platform::EnforceNotMet::Initstd::string( std::string, char const
, int) + 360
1 0x7fb492147787p paddle::platform::EnforceNotMet::EnforceNotMet(std::stri ng const&, char const
, int) + 87
2 0x7fb4942ba8c6p paddle::platform::GpuMaxChunkSize() + 630
3 0x7fb49428eadap
4 0x7fb5178b9827p
5 0x7fb49428e17dp paddle::memory::legacy::GetGPUBuddyAllocator(int) + 109
6 0x7fb49428efc5p void
paddle::memory::legacy::Alloc<paddle::platform::CU DAPlace>(paddle::platform::CUDAPlace const&, unsigned long) + 37
7 0x7fb49428f505p paddle::memory::allocation::LegacyAllocator::AllocateImp l(unsigned long) + 421
8 0x7fb494283625p paddle::memory::allocation::AllocatorFacade::Alloc(boost ::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platf orm::CUDAPinnedPlace, boost::detail::variant::void
, boost::detail::variant::voi d
, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail: :variant::void
, boost::detail::variant::void
, boost::detail::variant::void
, b oost::detail::variant::void
, boost::detail::variant::void, boost::detail::vari ant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost: :detail::variant::void_, boost::detail::variant::void_, boost::detail::variant:: void_, boost::detail::variant::void_, boost::detail::variant::void_> const&, uns igned long) + 181
9 0x7fb4942837aap paddle::memory::allocation::AllocatorFacade::AllocShared (boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle: :platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::varian t::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::d etail::variant::void_, boost::detail::variant::void_, boost::detail::variant::vo id_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail ::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::var iant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const &, unsigned long) + 26
10 0x7fb493e55e2cp paddle::memory::AllocShared(boost::variant<paddle::platf orm::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, b oost::detail::variant::void_, boost::detail::variant::void_, boost::detail::vari ant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost: :detail::variant::void_, boost::detail::variant::void_, boost::detail::variant:: void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::deta il::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_ , boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::v ariant::void_, boost::detail::variant::void_> const&, unsigned long) + 44
11 0x7fb494256434p paddle::framework::Tensor::mutable_data(boost::variant<p addle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPi nnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost:: detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::v oid_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detai l::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::va riant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boos t::detail::variant::void_, boost::detail::variant::void_>, paddle::framework::pr oto::VarType_Type, unsigned long) + 148
12 0x7fb49256363ep paddle::operators::FillConstantKernel::Compute(pa ddle::framework::ExecutionContext const&) const + 494
13 0x7fb492566753p std::Function_handler<void (paddle::framework::Executio nContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform:: CUDAPlace, false, 0ul, paddle::operators::FillConstantKernel, paddle::ope rators::FillConstantKernel, paddle::operators::FillConstantKernel, paddle::operators::FillConstantKernel, paddle::operators::FillConstantKern elpaddle::platform::float16 >::operator()(char const*, char const*, int) const ::{lambda(paddle::framework::ExecutionContext const&)#1}>::M_invoke(std::Any_d ata const&, paddle::framework::ExecutionContext const&) + 35
14 0x7fb4941c61e7p paddle::framework::OperatorWithKernel::RunImpl(paddle::f ramework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::plat form::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::v ariant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boo st::detail::variant::void_, boost::detail::variant::void_, boost::detail::varian t::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::d etail::variant::void_, boost::detail::variant::void_, boost::detail::variant::vo id_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail ::variant::void_> const&, paddle::framework::RuntimeContext*) const + 375
15 0x7fb4941c65c1p paddle::framework::OperatorWithKernel::RunImpl(paddle::f ramework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::plat form::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_ , boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::v ariant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boo st::detail::variant::void_, boost::detail::variant::void_, boost::detail::varian t::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::d etail::variant::void_, boost::detail::variant::void_, boost::detail::variant::vo id_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail ::variant::void_> const&) const + 529
16 0x7fb4941c3bbcp paddle::framework::OperatorBase::Run(paddle::framework:: Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUP lace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::d etail::variant::void_, boost::detail::variant::void_, boost::detail::variant::vo id_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail ::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::var iant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost ::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant: :void_> const&) + 332
17 0x7fb4922d1deep paddle::framework::Executor::RunPreparedContext(paddle:: framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool) + 606
18 0x7fb4922d4dafp paddle::framework::Executor::Run(paddle::framework::Prog ramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::str ing, std::allocatorstd::string > const&, bool) + 143
19 0x7fb49213859dp
20 0x7fb492179826p
21 0x559fab5e1c54p _PyCFunction_FastCallDict + 340
22 0x559fab669c0ep
23 0x559fab68c75ap _PyEval_EvalFrameDefault + 778
24 0x559fab662e66p
25 0x559fab663ed6p
26 0x559fab669b95p
27 0x559fab68d51cp _PyEval_EvalFrameDefault + 4300
28 0x559fab662e66p
29 0x559fab663e73p
30 0x559fab669b95p
31 0x559fab68c75ap _PyEval_EvalFrameDefault + 778
32 0x559fab66329ep
33 0x559fab663ed6p
34 0x559fab669b95p
35 0x559fab68c75ap _PyEval_EvalFrameDefault + 778
36 0x559fab663c5bp
37 0x559fab669b95p
38 0x559fab68c75ap _PyEval_EvalFrameDefault + 778
39 0x559fab6649b9p PyEval_EvalCodeEx + 809
40 0x559fab66575cp PyEval_EvalCode + 28
41 0x559fab6e5744p
42 0x559fab6e5b41p PyRun_FileExFlags + 161
43 0x559fab6e5d43p PyRun_SimpleFileExFlags + 451
44 0x559fab6e9833p Py_Main + 1555
45 0x559fab5b388ep main + 238
46 0x7fb5174dab97p __libc_start_main + 231
47 0x559fab693160p
请问如何解决 感谢感谢!

@nepeplwu
Copy link
Collaborator

nepeplwu commented Nov 1, 2019

@wz940216 ,感谢反馈,从日志来看,单卡错误大概率是环境问题,可以参考问题查找解决方案。
双卡错误是内存不足,从你的描述来看,其中一张卡已经被其他人占用,出现内存不足的错误是可以预期的。

@nepeplwu nepeplwu self-assigned this Nov 4, 2019
@nepeplwu
Copy link
Collaborator

nepeplwu commented Sep 7, 2020

此issue在最近三个月内暂无更新,我们将关闭该问题。如有需要,请重新打开该issue。

@nepeplwu nepeplwu closed this as completed Sep 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants