Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the issue missing dependencies in the Dockerfile and pip #2240

Merged
merged 31 commits into from
Aug 19, 2024

Conversation

ColorfulDick
Copy link
Contributor

Motivation

fix issue of Dockerfile missing flash-attn、timm、nccl.

@@ -1,3 +1,6 @@
gradio
protobuf
tritonclient[grpc]
timm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't add them in serve.txt.
They are dependencies of VLMs models, not LLMs. What's more, they are not common dependencies among VLM models.

@@ -1,3 +1,6 @@
gradio
protobuf
tritonclient[grpc]
timm
flash-attn
openai
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can accept openai

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put openai to runtime.txt

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ColorfulDick THIS comment is not resolved.

services:
lmdeploy:
container_name: lmdeploy
image: openmmlab/lmdeploy-builder:cuda12.2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

openmmlab/lmdeploy-builder is ONLY for building whl packages. It is not an image to deploy a model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will change "openmmlab/lmdeploy-builder:cuda12.2" to "openmmlab/lmdeploy:latest"

@lvhan028 lvhan028 requested a review from RunningLeon August 6, 2024 06:32
@@ -18,3 +18,4 @@ torchvision<=0.18.1,>=0.15.0
transformers
triton>=2.1.0,<=2.3.1; sys_platform == "linux"
uvicorn
flash-attn
Copy link
Collaborator

@lvhan028 lvhan028 Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not acceptable. Please remove it.
runtime.txt is restricted to the inference of LLM models

…he minimum supported CUDA version if 12.2; if nvidia device version on host machine higher than images,torch in images will not work
@ColorfulDick ColorfulDick requested a review from lvhan028 August 8, 2024 09:08
@@ -0,0 +1,64 @@
ARG CUDA_VERSION=cu12

FROM nvidia/cuda:12.2.0-devel-ubuntu22.04 AS cu12
Copy link
Collaborator

@lvhan028 lvhan028 Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use openmmlab/lmdeploy:latest-cu11 and openmmlab/lmdeploy:latest-cu12, respectively, so that we don't need to build lmdeploy and its dependencies any more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use openmmlab/lmdeploy:latest-cu11 and openmmlab/lmdeploy:latest-cu12, respectively, so that we don't need to build lmdeploy and its dependencies any more.

i have change the base image in InternVL_Dockerfile

@ColorfulDick ColorfulDick requested a review from lvhan028 August 8, 2024 14:54
@lvhan028 lvhan028 added the Bug:P1 label Aug 8, 2024
@@ -46,7 +54,8 @@ RUN cd /opt/lmdeploy &&\
ninja -j$(nproc) && ninja install &&\
cd .. &&\
python3 -m pip install -e . &&\
rm -rf build
rm -rf build &&\
rm -rf ~/.cache
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove "~/cache" again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove "~/cache" again

for remove the pip cache and reduce image size;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we keep only one "rm -rf ~/.cache"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we keep only one "rm -rf ~/.cache"?

bacause union file system in docker, every image level(every command RUN or files change in Dockerfile) is independent, Changes to the next level will not materially affect the files of the previous layer; if i only execute "rm -rf ~/.cache" at the fifth RUN will not delete the cache files of the second RUN;

@ColorfulDick ColorfulDick requested a review from lvhan028 August 14, 2024 08:02
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb && \
dpkg -i cuda-keyring_1.0-1_all.deb && \
apt update -y && \
apt install libnccl2 libnccl-dev
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nccl is already installed as you can see

docker run -it --rm nvidia/cuda:12.4.1-devel-ubuntu22.04
root@b40146dae1b8:/# apt list --installed |  grep nccl

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnccl-dev/now 2.21.5-1+cuda12.4 amd64 [installed,local]
libnccl2/now 2.21.5-1+cuda12.4 amd64 [installed,local]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nccl is already installed as you can see

docker run -it --rm nvidia/cuda:12.4.1-devel-ubuntu22.04
root@b40146dae1b8:/# apt list --installed |  grep nccl

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

libnccl-dev/now 2.21.5-1+cuda12.4 amd64 [installed,local]
libnccl2/now 2.21.5-1+cuda12.4 amd64 [installed,local]

if we don't reset nccl,there are some bug will happend, i have add envirment variable:

RUN export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} && \
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

but isn't useful:

#0 491.3 ../generate.sh: 6: [: unexpected operator
#0 492.7 -- The CXX compiler identification is GNU 11.4.0
#0 496.0 -- The CUDA compiler identification is NVIDIA 12.2.140
#0 496.2 -- Detecting CXX compiler ABI info
#0 496.4 -- Detecting CXX compiler ABI info - done
#0 496.5 -- Check for working CXX compiler: /usr/bin/c++ - skipped
#0 496.5 -- Detecting CXX compile features
#0 496.5 -- Detecting CXX compile features - done
#0 496.5 -- Detecting CUDA compiler ABI info
#0 497.9 -- Detecting CUDA compiler ABI info - done
#0 498.1 -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
#0 498.1 -- Detecting CUDA compile features
#0 498.1 -- Detecting CUDA compile features - done
#0 498.1 CMake Warning (dev) at CMakeLists.txt:18 (find_package):
#0 498.1   Policy CMP0146 is not set: The FindCUDA module is removed.  Run "cmake
#0 498.1   --help-policy CMP0146" for policy details.  Use the cmake_policy command to
#0 498.1   set the policy and suppress this warning.
#0 498.1 
#0 498.1 This warning is for project developers.  Use -Wno-dev to suppress it.
#0 498.1 
#0 498.1 -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
#0 498.3 -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
#0 498.3 -- Found Threads: TRUE
#0 498.3 -- Found CUDA: /usr/local/cuda (found suitable version "12.2", minimum required is "10.2")
#0 498.4 CUDA_VERSION 12.2 is greater or equal than 11.0, enable -DENABLE_BF16 flag
#0 498.4 -- Add DBUILD_MULTI_GPU, requires MPI and NCCL
#0 499.0 -- Found MPI_CXX: /usr/local/openmpi/lib/libmpi.so (found version "3.1")
#0 499.0 -- Found MPI: TRUE (found version "3.1")
#0 499.0 CMake Error at /opt/py3/lib/python3.10/site-packages/cmake/data/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
#0 499.0   Could NOT find NCCL (missing: NCCL_INCLUDE_DIRS NCCL_LIBRARIES)
#0 499.0 Call Stack (most recent call first):
#0 499.0   /opt/py3/lib/python3.10/site-packages/cmake/data/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
#0 499.0   cmake/Modules/FindNCCL.cmake:126 (find_package_handle_standard_args)
#0 499.0   CMakeLists.txt:87 (find_package)
#0 499.0 
#0 499.0 
#0 499.0 -- Configuring incomplete, errors occurred!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ColorfulDick
may be you can cp headers and libs to cuda directory as a workaround. Anyway, we cannot reproduce this and we may need to removing installing libnccl in dockerfile

cp /usr/include/nccl.h /usr/local/cuda/include/nccl.h
cp /usr/lib/x86_64-linux-gnu/libnccl* /usr/local/cuda/lib64/

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following @RunningLeon instructions, nccl is already installed

/usr/lib/x86_64-linux-gnu/libnccl.so
/usr/lib/x86_64-linux-gnu/libnccl_static.a
/usr/lib/x86_64-linux-gnu/libnccl.so.2
/usr/lib/x86_64-linux-gnu/libnccl.so.2.21.5

And find_package(NCCL) can be successful

I also pulled docker.io/nvidia/cuda:12.2.2-devel-ubuntu22.04, which also contains NCCL

/usr/lib/x86_64-linux-gnu/libnccl.so
/usr/lib/x86_64-linux-gnu/libnccl_static.a
/usr/lib/x86_64-linux-gnu/libnccl.so.2
/usr/lib/x86_64-linux-gnu/libnccl.so.2.19.3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ColorfulDick may be you can cp headers and libs to cuda directory as a workaround. Anyway, we cannot reproduce this and we may need to removing installing libnccl in dockerfile

cp /usr/include/nccl.h /usr/local/cuda/include/nccl.h
cp /usr/lib/x86_64-linux-gnu/libnccl* /usr/local/cuda/lib64/

i know what happned now, some tag of image nvidia/cuda have never had nccl; for example:

docker run nvidia/cuda:12.2.0-devel-ubuntu22.04 find / -name "nccl.h"

it can't find anything;
but when i run tag 12.4.1-devel-ubuntu22.04:

docker run nvidia/cuda:12.4.1-devel-ubuntu22.04 find / -name "nccl.h"

it will get:

/usr/include/nccl.h

so i suggest add nccl installing in Dockerfile, because we don't know if it have nccl in future version of image nvidia/cuda;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a good idea.
On one hand, it might cause a version mismatch for the docker images that have already pre-installed NCCL.
On the other, it burdens our maintenance.

If users' cuda version is lower than 12.4, we suggest they use the cu11 docker image.
If users decide to change the base image, like cu12.2.0, I believe they can maintain it on their own

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a good idea. On one hand, it might cause a version mismatch for the docker images that have already pre-installed NCCL. On the other, it burdens our maintenance.

If users' cuda version is lower than 12.4, we suggest they use the cu11 docker image. If users decide to change the base image, like cu12.2.0, I believe they can maintain it on their own

i remove the nccl installing in Dockerfile;

but i suggest to change base image tag to nvidia/cuda:12.0.0-devel-ubuntu22.04 that will support all cuda12 subverision on host machine;

because some pip package like flash-attn will depnend on cuda runtime with specific version,if a flash-attn package was complier install in cuda11 image,it will depend on libcudart.so.11.0 of cuda runtime in host machine ,if cuda version is 12 the host machine only have libcudart.so.12.0 and flash-attn will not work successful.

all in all,if image's cudatoolkit version is cuda11,some package will not work in host machine which install cuda12;

ColorfulDick and others added 2 commits August 16, 2024 18:00
remove nccl installation since it is already in the docker image
Copy link
Collaborator

@RunningLeon RunningLeon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit 26c00ab into InternLM:main Aug 19, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants