-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix the issue missing dependencies in the Dockerfile and pip #2240
Conversation
requirements/serve.txt
Outdated
@@ -1,3 +1,6 @@ | |||
gradio | |||
protobuf | |||
tritonclient[grpc] | |||
timm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't add them in serve.txt.
They are dependencies of VLMs models, not LLMs. What's more, they are not common dependencies among VLM models.
requirements/serve.txt
Outdated
@@ -1,3 +1,6 @@ | |||
gradio | |||
protobuf | |||
tritonclient[grpc] | |||
timm | |||
flash-attn | |||
openai |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can accept openai
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put openai
to runtime.txt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ColorfulDick THIS comment is not resolved.
docker-compose.yml
Outdated
services: | ||
lmdeploy: | ||
container_name: lmdeploy | ||
image: openmmlab/lmdeploy-builder:cuda12.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openmmlab/lmdeploy-builder
is ONLY for building whl packages. It is not an image to deploy a model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will change "openmmlab/lmdeploy-builder:cuda12.2" to "openmmlab/lmdeploy:latest"
requirements/runtime.txt
Outdated
@@ -18,3 +18,4 @@ torchvision<=0.18.1,>=0.15.0 | |||
transformers | |||
triton>=2.1.0,<=2.3.1; sys_platform == "linux" | |||
uvicorn | |||
flash-attn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not acceptable. Please remove it.
runtime.txt is restricted to the inference of LLM models
…he minimum supported CUDA version if 12.2; if nvidia device version on host machine higher than images,torch in images will not work
docker/InternVL_Dockerfile
Outdated
@@ -0,0 +1,64 @@ | |||
ARG CUDA_VERSION=cu12 | |||
|
|||
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04 AS cu12 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use openmmlab/lmdeploy:latest-cu11
and openmmlab/lmdeploy:latest-cu12
, respectively, so that we don't need to build lmdeploy and its dependencies any more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use
openmmlab/lmdeploy:latest-cu11
andopenmmlab/lmdeploy:latest-cu12
, respectively, so that we don't need to build lmdeploy and its dependencies any more.
i have change the base image in InternVL_Dockerfile
docker/Dockerfile
Outdated
@@ -46,7 +54,8 @@ RUN cd /opt/lmdeploy &&\ | |||
ninja -j$(nproc) && ninja install &&\ | |||
cd .. &&\ | |||
python3 -m pip install -e . &&\ | |||
rm -rf build | |||
rm -rf build &&\ | |||
rm -rf ~/.cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove "~/cache" again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove "~/cache" again
for remove the pip cache and reduce image size;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we keep only one "rm -rf ~/.cache"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we keep only one "rm -rf ~/.cache"?
bacause union file system in docker, every image level(every command RUN or files change in Dockerfile) is independent, Changes to the next level will not materially affect the files of the previous layer; if i only execute "rm -rf ~/.cache" at the fifth RUN will not delete the cache files of the second RUN;
merge latest main and minor fix
docker/Dockerfile
Outdated
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb && \ | ||
dpkg -i cuda-keyring_1.0-1_all.deb && \ | ||
apt update -y && \ | ||
apt install libnccl2 libnccl-dev |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nccl is already installed as you can see
docker run -it --rm nvidia/cuda:12.4.1-devel-ubuntu22.04
root@b40146dae1b8:/# apt list --installed | grep nccl
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
libnccl-dev/now 2.21.5-1+cuda12.4 amd64 [installed,local]
libnccl2/now 2.21.5-1+cuda12.4 amd64 [installed,local]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nccl is already installed as you can see
docker run -it --rm nvidia/cuda:12.4.1-devel-ubuntu22.04
root@b40146dae1b8:/# apt list --installed | grep nccl WARNING: apt does not have a stable CLI interface. Use with caution in scripts. libnccl-dev/now 2.21.5-1+cuda12.4 amd64 [installed,local] libnccl2/now 2.21.5-1+cuda12.4 amd64 [installed,local]
if we don't reset nccl,there are some bug will happend, i have add envirment variable:
RUN export PATH=/usr/local/cuda/bin${PATH:+:${PATH}} && \
export LD_LIBRARY_PATH=/usr/local/cuda/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
but isn't useful:
#0 491.3 ../generate.sh: 6: [: unexpected operator
#0 492.7 -- The CXX compiler identification is GNU 11.4.0
#0 496.0 -- The CUDA compiler identification is NVIDIA 12.2.140
#0 496.2 -- Detecting CXX compiler ABI info
#0 496.4 -- Detecting CXX compiler ABI info - done
#0 496.5 -- Check for working CXX compiler: /usr/bin/c++ - skipped
#0 496.5 -- Detecting CXX compile features
#0 496.5 -- Detecting CXX compile features - done
#0 496.5 -- Detecting CUDA compiler ABI info
#0 497.9 -- Detecting CUDA compiler ABI info - done
#0 498.1 -- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
#0 498.1 -- Detecting CUDA compile features
#0 498.1 -- Detecting CUDA compile features - done
#0 498.1 CMake Warning (dev) at CMakeLists.txt:18 (find_package):
#0 498.1 Policy CMP0146 is not set: The FindCUDA module is removed. Run "cmake
#0 498.1 --help-policy CMP0146" for policy details. Use the cmake_policy command to
#0 498.1 set the policy and suppress this warning.
#0 498.1
#0 498.1 This warning is for project developers. Use -Wno-dev to suppress it.
#0 498.1
#0 498.1 -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
#0 498.3 -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
#0 498.3 -- Found Threads: TRUE
#0 498.3 -- Found CUDA: /usr/local/cuda (found suitable version "12.2", minimum required is "10.2")
#0 498.4 CUDA_VERSION 12.2 is greater or equal than 11.0, enable -DENABLE_BF16 flag
#0 498.4 -- Add DBUILD_MULTI_GPU, requires MPI and NCCL
#0 499.0 -- Found MPI_CXX: /usr/local/openmpi/lib/libmpi.so (found version "3.1")
#0 499.0 -- Found MPI: TRUE (found version "3.1")
#0 499.0 CMake Error at /opt/py3/lib/python3.10/site-packages/cmake/data/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:233 (message):
#0 499.0 Could NOT find NCCL (missing: NCCL_INCLUDE_DIRS NCCL_LIBRARIES)
#0 499.0 Call Stack (most recent call first):
#0 499.0 /opt/py3/lib/python3.10/site-packages/cmake/data/share/cmake-3.30/Modules/FindPackageHandleStandardArgs.cmake:603 (_FPHSA_FAILURE_MESSAGE)
#0 499.0 cmake/Modules/FindNCCL.cmake:126 (find_package_handle_standard_args)
#0 499.0 CMakeLists.txt:87 (find_package)
#0 499.0
#0 499.0
#0 499.0 -- Configuring incomplete, errors occurred!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ColorfulDick
may be you can cp headers and libs to cuda directory as a workaround. Anyway, we cannot reproduce this and we may need to removing installing libnccl in dockerfile
cp /usr/include/nccl.h /usr/local/cuda/include/nccl.h
cp /usr/lib/x86_64-linux-gnu/libnccl* /usr/local/cuda/lib64/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following @RunningLeon instructions, nccl is already installed
/usr/lib/x86_64-linux-gnu/libnccl.so
/usr/lib/x86_64-linux-gnu/libnccl_static.a
/usr/lib/x86_64-linux-gnu/libnccl.so.2
/usr/lib/x86_64-linux-gnu/libnccl.so.2.21.5
And find_package(NCCL)
can be successful
I also pulled docker.io/nvidia/cuda:12.2.2-devel-ubuntu22.04
, which also contains NCCL
/usr/lib/x86_64-linux-gnu/libnccl.so
/usr/lib/x86_64-linux-gnu/libnccl_static.a
/usr/lib/x86_64-linux-gnu/libnccl.so.2
/usr/lib/x86_64-linux-gnu/libnccl.so.2.19.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ColorfulDick may be you can cp headers and libs to cuda directory as a workaround. Anyway, we cannot reproduce this and we may need to removing installing libnccl in dockerfile
cp /usr/include/nccl.h /usr/local/cuda/include/nccl.h cp /usr/lib/x86_64-linux-gnu/libnccl* /usr/local/cuda/lib64/
i know what happned now, some tag of image nvidia/cuda have never had nccl; for example:
docker run nvidia/cuda:12.2.0-devel-ubuntu22.04 find / -name "nccl.h"
it can't find anything;
but when i run tag 12.4.1-devel-ubuntu22.04:
docker run nvidia/cuda:12.4.1-devel-ubuntu22.04 find / -name "nccl.h"
it will get:
/usr/include/nccl.h
so i suggest add nccl installing in Dockerfile, because we don't know if it have nccl in future version of image nvidia/cuda;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a good idea.
On one hand, it might cause a version mismatch for the docker images that have already pre-installed NCCL.
On the other, it burdens our maintenance.
If users' cuda version is lower than 12.4, we suggest they use the cu11 docker image.
If users decide to change the base image, like cu12.2.0, I believe they can maintain it on their own
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a good idea. On one hand, it might cause a version mismatch for the docker images that have already pre-installed NCCL. On the other, it burdens our maintenance.
If users' cuda version is lower than 12.4, we suggest they use the cu11 docker image. If users decide to change the base image, like cu12.2.0, I believe they can maintain it on their own
i remove the nccl installing in Dockerfile;
but i suggest to change base image tag to nvidia/cuda:12.0.0-devel-ubuntu22.04 that will support all cuda12 subverision on host machine;
because some pip package like flash-attn will depnend on cuda runtime with specific version,if a flash-attn package was complier install in cuda11 image,it will depend on libcudart.so.11.0 of cuda runtime in host machine ,if cuda version is 12 the host machine only have libcudart.so.12.0 and flash-attn will not work successful.
all in all,if image's cudatoolkit version is cuda11,some package will not work in host machine which install cuda12;
remove nccl installation since it is already in the docker image
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Motivation
fix issue of Dockerfile missing flash-attn、timm、nccl.