Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python package looks for library in wrong path #5106

Open
david-cortes opened this issue Mar 30, 2022 · 17 comments
Open

Python package looks for library in wrong path #5106

david-cortes opened this issue Mar 30, 2022 · 17 comments
Labels

Comments

@david-cortes
Copy link
Contributor

Trying to build the python package from source without installing it will somehow try to pick the wrong path for system libraries. I get an error about being unable to import scipy.sparse, even though I can import that library in the same session (this is after successfully building lib_lightgbm through the cmake system):

>>> import scipy.sparse
>>> import lightgbm
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/david/del/py_lightgbm/python-package/lightgbm/__init__.py", line 8, in <module>
    from .basic import Booster, Dataset, Sequence, register_logger
  File "/home/david/del/py_lightgbm/python-package/lightgbm/basic.py", line 126, in <module>
    _LIB = _load_lib()
  File "/home/david/del/py_lightgbm/python-package/lightgbm/basic.py", line 117, in _load_lib
    lib = ctypes.cdll.LoadLibrary(lib_path[0])
  File "/home/david/anaconda3/envs/py3/lib/python3.9/ctypes/__init__.py", line 460, in LoadLibrary
    return self._dlltype(name)
  File "/home/david/anaconda3/envs/py3/lib/python3.9/ctypes/__init__.py", line 382, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/david/anaconda3/envs/py3/lib/python3.9/site-packages/scipy/sparse/../../../../libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/david/del/py_lightgbm/lib_lightgbm.so)
@jmoralez
Copy link
Collaborator

Hi @david-cortes, thank you for your report. Can you provide some more details on how you installed the package? I've had this happened to me when I installed lightgbm from conda and then from pip and somehow the dlls get mixed. If you've already compiled the package yourself you can try python setup.py install --precompile which uses the produced artifacts. If you can provide some more details on your installation procedure we might be able to help a bit more.

@david-cortes
Copy link
Contributor Author

I didn't install it, I compiled it from source from the current master branch using the cmake system and tried to import it in python from its folder. Doing that imports it correctly in other setups.

@jmoralez
Copy link
Collaborator

I see. Did you have it installed already in your environment? I just tried:

conda create -n lgb_test -y python=3.9 scipy
conda activate lgb_test
python -c 'import lightgbm'

and it works correctly.

@david-cortes
Copy link
Contributor Author

I did have it installed from pip, and it works correctly if I import the pip-installed version, but it doesn't work in this setup if I import the non-installed version that's compiled from source.

@jmoralez
Copy link
Collaborator

I guess if you pip uninstall lightgbm you might be able to import the compiled version from the project folder.

@david-cortes
Copy link
Contributor Author

I guess if you pip uninstall lightgbm you might be able to import the compiled version from the project folder.

Still the same error after uninstalling the pip version in this conda environment.

@jmoralez
Copy link
Collaborator

jmoralez commented Apr 5, 2022

I think either of the two options I described (creating a new environment or using python setup.py install --precompile) should work. Is there a reason why you don't want to install it?

@david-cortes
Copy link
Contributor Author

I think either of the two options I described (creating a new environment or using python setup.py install --precompile) should work. Is there a reason why you don't want to install it?

Tried in different conda environments and got the same issue. Tried also installing it with python setup.py install --precompile, and still the same problem. I did not want to install it like that because python setup.py install does not register a pip install, doesn't have an uninstall option (although for lightgbm that's easy to do after the fact), and is currently deprecated AFAIK.

@StrikerRUS
Copy link
Collaborator

I did not want to install it like that because python setup.py install does not register a pip install, doesn't have an uninstall option

You can build a wheel file first and after that install it via pip

cd $BUILD_DIRECTORY/python-package && python setup.py bdist_wheel --plat-name=macosx --python-tag py3 || exit -1

pip install --user $BUILD_DIRECTORY/python-package/dist/*.whl || exit -1

@david-cortes
Copy link
Contributor Author

Still the same error from building a binary wheel and then installing it with pip.

@jameslamb jameslamb added the bug label Apr 23, 2022
@jameslamb
Copy link
Collaborator

While working on #5169, I ran into this issue and have been doing some investigation. I think I've identified a fully reproducible example (using a container) and a strategy for fixing this, based on what I've learned about the way that ctypes loads libraries.

Will post those details here in the next few days, when I have time. Just wanted to post to let others here know I'm actively looking into this.

@jameslamb
Copy link
Collaborator

jameslamb commented May 10, 2022

Ok, so! I've been able to collect my thoughts on this.


Reproducible Example

Given the following Dockerfile, pinned to the ubuntu:latest (Ubuntu 22.04) image as of two weeks aago.

Dockerfile (click me)
# pinning to specific version of ubuntu:22.04
FROM ubuntu@sha256:2a7dffab37165e8b4f206f61cfd984f8bb279843b070217f6ad310c9c31c9c7c

ENV CONDA=/root/miniforge \
    DEBIAN_FRONTEND=noninteractive \
    LANG="en_US.UTF-8" \
    LGB_COMMIT=416ecd5a8de1b2b9225ded3c919cb0d40ec0d9bd \
    LGB_SOURCE_DIR=/usr/local/src/LightGBM \
    PATH="/root/miniforge/bin:${PATH}" \
    PYTHON_VERSION=3.10

RUN apt-get update && \
    apt-get install \
        --no-install-recommends \
        -y \
            sudo && \
    sudo apt-get install \
        --no-install-recommends \
        -y \
            locales \
            software-properties-common && \
    sudo locale-gen ${LANG} && \
    sudo update-locale LANG=${LANG} && \
    sudo apt-get install \
        --no-install-recommends \
        -y \
            apt-utils \
            build-essential \
            ca-certificates \
            cmake \
            curl \
            git \
            iputils-ping \
            jq \
            libicu-dev \
            libcurl4 \
            libssl-dev \
            libunwind8 \
            locales \
            netcat \
            unzip \
            zip && \
    # install conda
    curl \
        -sL \
        -o miniforge.sh \
        https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-$(uname -m).sh && \
    sh miniforge.sh -b -p ${CONDA} && \
    conda config --set always_yes yes --set changeps1 no && \
    conda update -q -y conda && \
    git clone \
        --recursive \
        https://github.com/microsoft/LightGBM.git \
        "${LGB_SOURCE_DIR}" && \
    cd "${LGB_SOURCE_DIR}" && \
    git checkout ${LGB_COMMIT}

WORKDIR "${LGB_SOURCE_DIR}"
docker build \
    --no-cache \
    -t lgb-glibc-demo:local \
    - < ./Dockerfile

Installing lightgbm from source and loading it works without issue.

docker run \
    --rm \
    --workdir /usr/local/src/LightGBM/python-package \
    -it lgb-glibc-demo:local \
    /bin/bash -c "pip install . && python -c 'import lightgbm'"

But if you try conda install-ing libstdcxx-ng first, it will produce the error mentioned in this issue.

docker run \
    --rm \
    --workdir /usr/local/src/LightGBM/python-package \
    -it lgb-glibc-demo:local \
    /bin/bash -c "conda install -y -n base libstdcxx-ng && pip install . && python -c 'import lightgbm'"

OSError: /root/miniforge/bin/../lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /usr/local/src/LightGBM/python-package/compile/lib_lightgbm.so)

The use of a conda Python distribution + the presence of aa libstdc++.so.6 anywhere in conda's library paths can cause this error to be thrown.


Root Cause (short description)

When lightgbm is compiled, it uses the system gcc / g++ and links against /usr/lib/x86_64-linux-gnu/libstdc++.so.6, which contains symbols from versions of GLIBCXX as new as GLIBCXX_3.4.30.

When lib_lightgbm.so is later loaded in a conda distribution of Python, a libstdc++.so.6 in conda's lib/ directory is found first, and it only contains GLIBCXX symbols up to GLIBCXX_3.4.29.


Workarounds with no changes to LightGBM

1. Use conda's CMake and compilers to build LightGBM from source (click me)

From https://conda.io/projects/conda-build/en/latest/resources/compiler-tools.html#using-the-compiler-packages

Instead of gcc, the executable name of the compiler you use will be something like x86_64-conda_cos6-linux-gnu-gcc.

Many build tools such as make and CMake search by default for a compiler named simply gcc, so we set environment variables to point these tools to the correct compiler.

We set these variables in conda activate.d scripts, so any environment in which you will use the compilers must first be activated so the scripts will run. Conda-build does this activation for you using activation hooks installed with the compiler packages in CONDA_PREFIX/etc/conda/activate.d.

# install the problematic library
conda install -y -n base \
    libstdcxx-ng

# confirm that it results in a `libstdc++.so.6` being added in conda env
find / -name 'libstdc++.so.6'
# /root/miniforge/lib/libstdc++.so.6
# /root/miniforge/pkgs/libstdcxx-ng-11.2.0-he4da1e4_16/lib/libstdc++.so.6
# /usr/lib/x86_64-linux-gnu/libstdc++.so.6

# get conda compilers
conda install -y -n base \
    cmake \
    gcc_linux-64 \
    gxx_linux-64

# it's important to activate the target conda env, to set
# the relevant environment variables pointing to conda's compilers
source activate base

# you can see the effect of this by checking env variables
echo $CC
# /root/miniforge/bin/x86_64-conda-linux-gnu-cc

echo $CXX
# /root/miniforge/bin/x86_64-conda-linux-gnu-c++

cd /usr/local/src/LightGBM
pip uninstall -y lightgbm
rm -rf ./build
rm -f ./lib_lightgbm.so

cd ./python-package
pip install .

# confirm that importing works
python -c "import lightgbm; print(lightgbm.__version__)"
# 3.3.2.99

# confirm that the maximum GLIBCXX version is less than
# the one from the error message, and that the libstdc++.so.6 linked
# is the one from /root/miniforge
LIB_LIGHTGBM_IN_CONDA=$(
    find /root/miniforge -name 'lib_lightgbm.so' \
    | head -1
)
ldd -v \
    "${LIB_LIGHTGBM_IN_CONDA}"
2. point LD_PRELOAD at the non-conda lib/ directory prior to starting python (click me)
# install the problematic library
conda install -y -n base \
    libstdcxx-ng

# confirm that it resulted in a `libstdc++.so.6` being added in conda env
find / -name 'libstdc++.so.6'
# /root/miniforge/lib/libstdc++.so.6
# /root/miniforge/pkgs/libstdcxx-ng-11.2.0-he4da1e4_16/lib/libstdc++.so.6
# /usr/lib/x86_64-linux-gnu/libstdc++.so.6

# build LightGBM from source
cd /usr/local/src/LightGBM
pip uninstall -y lightgbm
rm -rf ./build
rm -f ./lib_lightgbm.so
cd ./python-package
pip install .

# try loading lightgbm (this will fail)
python -c "import lightgbm; print(lightgbm.__version__)"

# try loading lightgbm with LD_LIBRARY_PATH set to the same paths
# referenced in lib_lightgbm.so
LD_PRELOAD="${LD_PRELOAD}:/usr/lib/x86_64-linux-gnu/libstdc++.so.6" \
python -c "import lightgbm; print(lightgbm.__version__)"

NOTE: this cannot be done from inside Python. The following code will fail.

import os
os.environ["LD_PRELOAD"] = "/usr/lib/x86_64-linux-gnu/libstdc++.so.6"
import lightgbm
3. Modify `lib_lightgbm.so`'s DT_RPATH tag so that it points at the place where it found `libstdc++.so.6` (click me)

See https://man7.org/linux/man-pages/man3/dlopen.3.html and https://stackoverflow.com/a/20333550/3986677.

rpath is a way to embed a hint about where to find include dirs in a shared object.

cd /root/miniforge/lib/python3.9/site-packages/lightgbm/
cp lib_lightgbm.so lib_lightgbm2.so

# shows no rpath
chrpath -l lib_lightgbm2.so

# fails
python -c \
    "import ctypes; ctypes.cdll.LoadLibrary('/root/miniforge/lib/python3.9/site-packages/lightgbm/lib_lightgbm.so')"

# patch the rpath
patchelf --set-rpath '/usr/lib/x86_64-linux-gnu' lib_lightgbm2.so

# shows rpath
chrpath -l lib_lightgbm2.so

# succeeds!
python -c \
    "import ctypes; ctypes.cdll.LoadLibrary('/root/miniforge/lib/python3.9/site-packages/lightgbm/lib_lightgbm2.so')"

Root Cause (longer description)

I've found this topic very complicated (or at least, new to me), so have been capturing my running notes and example code snippets at https://github.com/jameslamb/lgb-glibc-demo.

Click below to see a summary of the issue that is more detailed than Root Cause (short description) but less detailed than my notes in that repo.

much longer description (click me)

Whenever lightgbm is loaded with import lightgbm, it uses ctypes.dll.LoadLibrary() to load its compiled library, lib_lightgbm.so.

lib = ctypes.cdll.LoadLibrary(lib_path[0])

The ctypes documentation desccribes this process in detail.

From "Finding shared libraries" in the ctypes docs (link)

When programming in a compiled language, shared libraries are accessed when compiling/linking a program, and when the program is run.

...the ctypes library loaders act like when a program is run, and call the runtime loader directly.

And from "loading shared libraries" (doc)

If you have an existing handle to an already loaded shared library, it can be passed as the handle named parameter, otherwise the underlying platform's dlopen or LoadLibrary function is used to load the library into the process, and to get a handle to it.

"underlying platform's dlopen" here refers to a standard C interface available on all operating systems.

For example, see https://man7.org/linux/man-pages/man3/dlopen.3.html for Linux.

From those docs, when searching for a library, the following are checked in order:

(ELF only) If the calling object (i.e., the shared library or executable from which dlopen() is called) contains a DT_RPATH tag, and does not contain a DT_RUNPATH tag, then the directories listed in the DT_RPATH tag are searched.

If, at the time that the program was started, the environment variable LD_LIBRARY_PATH was defined to contain a colon-separated list of directories, then these are searched.

(ELF only) If the calling object contains a DT_RUNPATH tag, then the directories listed in that tag are searched.

The cache file /etc/ld.so.cache (maintained by ldconfig(8)) is checked to see whether it contains an entry for filename.

The directories /lib and /usr/lib are searched (in that order).

conda tries very hard to ensure that its directories are searched first when dlopen() tries to load a library. One mechanism it uses for this is setting the DT_RUNPATH on its distribution of python.

Try this, using the container image built higher up in this description.

docker run \
    --rm \
    --workdir /usr/local/src/LightGBM/python-package \
    -it lgb-glibc-demo:local \
    /bin/bash

readelf -d /root/miniforge/bin/python \
| grep RPATH

Which yields the following.

 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN/../lib]

That says "look in /root/miniforce/lib/ first when loading libraries"!

If you look at the copy of python in a specific conda environment, you'll see something similar.

conda create --name test-env python=3.9
readelf -d /root/miniforge/envs/test-env/bin/python \
| grep RPATH

That shows the same output.

 0x000000000000000f (RPATH)              Library rpath: [$ORIGIN/../lib]

Which this time means "first look in /root/miniforge/envs/test-env/lib/ when loading libraries".

If you pip install lightgbm inside that environment, you'll see it gets linked against a libstdc++.so.6 outside of conda's lib/ directories.

source activate test-env
cd /usr/local/src/LightGBM/python-package
pip install .
LIB_LIGHTGBM_IN_CONDA=$(
    find /root/miniforge -name 'lib_lightgbm.so' \
    | head -1
)
# /root/miniforge/envs/test-env/lib/python3.9/site-packages/lightgbm/lib_lightgbm.so

ldd -v ${LIB_LIGHTGBM_IN_CONDA}

That output contains a lot of information, including the following key line:

libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f9ee66e2000)

That says "lib_lightgbm.so was linked against /lib/x86_64-linux-gnu/libstdc++.so.6". That directory is outside of the ones conda looks in first!!


Changes LightGBM could make to mitigate this

I think the most reliable, portable way for LightGBM to handle this is to attach a DT_RPATH to lib_lightgbm.so when it's compiled. That way, when dlopen() loads lib_lightgbm.so, it will first look in the same directories that the linker chose when compiling lib_lightgbm.so.

See CMake's docs on this at https://gitlab.kitware.com/cmake/community/-/wikis/doc/cmake/RPATH-handling#default-rpath-settings.

By default if you don't change any RPATH related settings, CMake will link the executables and shared libraries with full RPATH to all used libraries in the build tree. When installing, it will clear the RPATH of these targets so they are installed with an empty RPATH.

And https://gitlab.kitware.com/cmake/community/-/wikis/doc/cmake/RPATH-handling#always-full-rpath

CMAKE_INSTALL_RPATH_USE_LINK_PATH...If this option is enabled, all these directories except those which are also in the build tree will be added to the install RPATH automatically.

I haven't tested that yet, but I think it's worth exploring to try to mitigate this issue.

@david-cortes
Copy link
Contributor Author

david-cortes commented May 10, 2022

Both lib_lightgbm.so and libstdc++ are shared libraries. Why would you load them with ctypes instead of letting the linker do the job?

EDIT: Actually from a look at the code, using the linker from python's configured compiler would imply some large changes in the setup and compilation logic.

@StrikerRUS
Copy link
Collaborator

@jameslamb Wow, brilliant investigation, thanks a lot!

If attaching DT_RPATH to lib_lightgbm.so is just a tip and not a strict rule, I think we can investigate this approach.

I remember this conda behavior was the reason why we statically link libstdc++ on Windows during compiling with MinGW:
#899

LightGBM/CMakeLists.txt

Lines 323 to 325 in 6de9baf

if(WIN32 AND MINGW)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -static-libstdc++")
endif()

@jameslamb
Copy link
Collaborator

@david-cortes more details on "letting the linker do the job" and "using the linker from python's configured compiler" (like a link to an example or relevant documentation) would be greatly appreciated. I don't know what pattern you're referring to.

@david-cortes
Copy link
Contributor Author

The current logic is to use a .py file to load a cmake-generated shared object (.so / .lib) through ctypes. A potentially better approach would be to make the python extension contain a python-importable shared object (.so / .pyd) that would be compiled by the python setup script and linked to the lib_lightgbm.so shared object that's generated by the cmake build system.

That way, if e.g. running it on windows, there would be a lib_lightgbm.lib configured to link to a given msvc .dll, then an importable .pyd file that's linked to that lib_lightgbm.lib and to a potentially different c++ standard library dll, and both would be able to load symbols despite this difference because the linker will take care of that when generating the compiled files (assuming that it receives all the correct arguments and paths).

@jameslamb
Copy link
Collaborator

Thanks for that information. I'm not familiar with that pattern for Python projects, will look around for some examples and documentation on it. As discussed in #5061 , I think it's possible that the package's strategy for compiling lib_lightgbm might need to change substantially in the future.

BUT...I also think, based on my investigation above, that setting DT_RPATH on lib_lightgbm.so to point to the locations where the linker found libraries at compile time might be a quick way to make the issue reported here less likely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants