-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory allocation failed #54
Comments
Thank you for creating, please let me confirm basics at first.
|
Thank you for reporting, please let us check it. |
I have 11GB memory GPU, so I tried to run NVCNet on this environment. Now, I'm confirming about g_loss_con. |
@TomonobuTsujikawa No communicator found. Running with a single process. If you run this with MPI processes, all processes will perform totally same. No communicator found. Running with a single process. If you run this with MPI processes, all processes will perform totally same. and I test the environment about:python -c "import nnabla_ext.cuda, nnabla_ext.cudnn" : |
Please provide me the results of following command: pip list | grep -e pip -e nnabla You can import nnabla correctly on single GPU environment, so I think it is a setup issue for multi GPUs. |
@TomonobuTsujikawa |
Hmm, it seems to be ok. Do you still have same error if you do the following? pip uninstall nnabla nnabla-ext-cuda110-nccl2-mpi3-1-6
pip install nnabla nnabla-ext-cuda110-nccl2-mpi3-1-6
mpirun -n 2 python main.py -c cudnn -d 0,1 --output_path log_new/baseline --batch_size 8 I will also check. |
@TomonobuTsujikawa No communicator found. Running with a single process. If you run this with MPI processes, all processes will perform totally same. |
@15755841658 I setup many environments to reproduce this error today, but I could not reproduce. cat /etc/os-release
dpkg -l | grep ^ii
conda --version
conda list
pip --version
pip list
nvidia-smi
set | grep -e LD_LIBRARY -e LD_PRELOAD
find /usr -name libmpi.so\* I think this is the minimum command if issue has been resolved. mpirun -n 2 python -c "import nnabla_ext.cudnn; from nnabla.ext_utils import get_extension_context; import nnabla.communicators as C; ctx = get_extension_context('cudnn', device_id='0'); C.MultiProcessDataParallelCommunicator(ctx)" |
Yes, your environment has issue, so minimum command fails. |
@TomonobuTsujikawa |
I checked your environment information, here is the list of problems need to be solved.
I cannot find nvidia driver/cuda/cudnn packages in your dpkg list, but you installed them by manually? Hmm, if you cannot upgrade OS environment, I think it is better to use docker container. docker pull nnabla/nnabla-ext-cuda-multi-gpu:py37-cuda110-mpi3.1.6-v1.29.0
docker run --rm -it -u $(id -u):$(id -g) --gpus all nnabla/nnabla-ext-cuda-multi-gpu:py37-cuda110-mpi3.1.6-v1.29.0
mpirun -n 2 python3 -c "import nnabla_ext.cudnn; from nnabla.ext_utils import get_extension_context; import nnabla.communicators as C; ctx = get_extension_context('cudnn', device_id='0'); C.MultiProcessDataParallelCommunicator(ctx)" If you cannot install docker, you need to build openmpi by yourself. Also, please refer nnabla install page: https://nnabla.org/install/ |
I had same trouble when i tied to setup another code repo, env:
after i install
hope helpful. |
I tried to train with 2 GPUs by docker, but after one epoch, memory errors in allocation occur. I am not sure what to check and what's wrong possibly.
The text was updated successfully, but these errors were encountered: