Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

CD Release Pipeline libmxnet.so Symbol Error #19917

Closed
Zha0q1 opened this issue Feb 18, 2021 · 4 comments
Closed

CD Release Pipeline libmxnet.so Symbol Error #19917

Zha0q1 opened this issue Feb 18, 2021 · 4 comments
Labels

Comments

@Zha0q1
Copy link
Contributor

Zha0q1 commented Feb 18, 2021

After #19870 master cd cu112 was able to build. However we have this symbol error in the test stage now

[2021-02-18T20:46:02.546Z] ImportError while loading conftest '/work/mxnet/tests/python/conftest.py'.

[2021-02-18T20:46:02.547Z] tests/python/conftest.py:22: in <module>

[2021-02-18T20:46:02.547Z]     import mxnet as mx

[2021-02-18T20:46:02.547Z] python/mxnet/__init__.py:23: in <module>

[2021-02-18T20:46:02.547Z]     from .context import Context, current_context, cpu, gpu, cpu_pinned

[2021-02-18T20:46:02.547Z] python/mxnet/context.py:20: in <module>

[2021-02-18T20:46:02.547Z]     from .base import _LIB

[2021-02-18T20:46:02.547Z] python/mxnet/base.py:293: in <module>

[2021-02-18T20:46:02.547Z]     _LIB = _load_lib()

[2021-02-18T20:46:02.547Z] python/mxnet/base.py:284: in _load_lib

[2021-02-18T20:46:02.547Z]     lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_LOCAL)

[2021-02-18T20:46:02.547Z] /opt/rh/rh-python36/root/usr/lib64/python3.6/ctypes/__init__.py:343: in __init__

[2021-02-18T20:46:02.547Z]     self._handle = _dlopen(self._name, mode)

[2021-02-18T20:46:02.547Z] E   OSError: /work/mxnet/python/mxnet/../../lib/libmxnet.so: undefined symbol: nvmlDeviceGetComputeRunningProcesses_v2

https://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/restricted-mxnet-cd%2Fmxnet-cd-release-job/detail/mxnet-cd-release-job/2523/pipeline/401

@ptrendx
Copy link
Member

ptrendx commented Feb 19, 2021

Yeah, this comes from the fact that the CUDA 11.2 image has 11.2 version of nvml.h, while the actual libnvml library is a part of the driver (and the driver on that machine is probably older and does not have that version of the function). If you look at the nvml.h version history here: https://github.com/NVIDIA/nvidia-settings/blob/master/src/nvml.h - the _v2 version of that function was added in the driver 450.66 - what is the version you use in the CI?

As a workaround you can just add the option -DUSE_NVML=0 to your cmake build to disable nvml.

@ptrendx
Copy link
Member

ptrendx commented Feb 19, 2021

The other approach (I believe the recommended one) is to use dlopen to load the nvml library at runtime, so that those additional symbols from nvml.h and the libnvidia-ml stub library in the build image do not contaminate the resulting binary (since that function is not even used by mxnet).

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Feb 19, 2021

@ptrendx Thanks for the info!I believe we use 450.51.05 while cu111 112 require 450.80.02 according to https://docs.nvidia.com/deploy/cuda-compatibility/index.html. I think the best way might be to update the nvidia driver on the gpu machines. This error only happens in cd probably because we do DUSE_NVML=OFF in ci..

@Zha0q1
Copy link
Contributor Author

Zha0q1 commented Mar 2, 2021

fixed by 19939

@Zha0q1 Zha0q1 closed this as completed Mar 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants