Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocm docker github action build failed #2408

Open
KagurazakaNyaa opened this issue Jun 14, 2024 · 14 comments
Open

rocm docker github action build failed #2408

KagurazakaNyaa opened this issue Jun 14, 2024 · 14 comments

Comments

@KagurazakaNyaa
Copy link
Contributor

Describe the bug
action: Create and publish docker image run failed

https://github.com/TabbyML/tabby/actions/runs/9506018585
release-docker (rocm)
The hosted runner: GitHub Actions 15 lost communication with the server. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

Additional context
In PR #2043, I attempted to update the version of action. In my fork, it can be built normally; however, after merging, it is still unable to build rocm docker images normally. It's recommended that check if a self-hosted Action Runner has been configured incorrectly.

@wsxiaoys
Copy link
Member

adding revert pr to help isolate the problem:

#2409

@KagurazakaNyaa
Copy link
Contributor Author

adding revert pr to help isolate the problem:

#2409

Reverting PR #2043 does not isolate this issue, because the issue existed before PR #2043. PR #2043 confirms that the issue is in the configuration outside the code repository.

@wsxiaoys
Copy link
Member

Right - just created a branch without #2403 to check the latest successful ROCm image build version and to compare.

@KagurazakaNyaa
Copy link
Contributor Author

The same action workflow but without push image can be executed normally at https://github.com/KagurazakaNyaa/tabby/actions/runs/9496954836. This fork uses GitHub's default runner instead of the self-hosted runner.
From the error message in this issue, it seems to be a problem with the action runner rather than the workflow.
Is this repository using a GitHub-hosted runner or a self-hosted runner?

@rudiservo
Copy link

rudiservo commented Jun 18, 2024

Also the ROCm version is kind of outdated with 5.7.1 although compatible with older cards, version 6.1.2 is out and has massive improvements in the newer cards, I don't know how much this can affect the model performance.

I tried compiling the 0.12.0 tag and I get this error with my registry and also tried local with this command
command: serve --model /data/models/rudiservo/StarCoder2-15b-Instruct-v0.1-Q8 --device rocm --no-webserver

tabby_1  | The application panicked (crashed).
tabby_1  | Message:  Invalid model_id <TabbyML/Nomic-Embed-Text>
tabby_1  | Location: crates/tabby-common/src/registry.rs:108
tabby_1  | 
tabby_1  |   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
tabby_1  |                                 ⋮ 7 frames hidden ⋮                               
tabby_1  |    8: tabby_common::registry::ModelRegistry::get_model_info::h4cf4522936634953
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |    9: tabby_download::download_model::{{closure}}::h8da4574c84d31459
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   10: tabby::services::model::download_model_if_needed::{{closure}}::h88e90df5ccbc9220
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   11: tabby::serve::main::{{closure}}::h895907983720205f
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   12: tokio::runtime::park::CachedParkThread::block_on::h69f0496402a974e5
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   13: tabby::main::h244e2d137a039971
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   14: std::sys_common::backtrace::__rust_begin_short_backtrace::h37fe2660d85af9e6
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   15: std::rt::lang_start::{{closure}}::hfc465164803e6038
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   16: std::rt::lang_start_internal::h3ed4fe7b2f419135
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   17: main<unknown>
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  |   18: __libc_start_call_main<unknown>
tabby_1  |       at ./csu/../sysdeps/nptl/libc_start_call_main.h:58
tabby_1  |   19: __libc_start_main_impl<unknown>
tabby_1  |       at ./csu/../csu/libc-start.c:392
tabby_1  |   20: _start<unknown>
tabby_1  |       at <unknown source file>:<unknown line>
tabby_1  | 

@rudiservo
Copy link

docker ROCm is still not building, latest version is 0.11.

@wsxiaoys
Copy link
Member

wsxiaoys commented Jul 1, 2024

Hi - we turned off the rocm build as our github action runner is not able to complete it - as an alternative, I recommend use vulkan backend instead for amd gpu deployments.

@rudiservo
Copy link

@wsxiaoys well llamacpp rocm docker builds are also failing, but metal are ok.
I am going to try and fix the llamacpp build then check if you have a similar issue or something I can quick fix.

@rudiservo
Copy link

@wsxiaoys so I figured one part, but I am kind of hitting a wall, maybe some config is missing?!

On build.rs you need to change this
config.define("LLAMA_HIPBLAS", "ON");
to
config.define("GGML_HIPBLAS", "ON");

and add this for future compatibility with rocm an and future proof for 6.1.2

config.define( "CMAKE_HIP_COMPILER", format!("{}/llvm/bin/clang++", rocm_root), ); config.define( "CMAKE_HIP_COMPILER", format!("{}/llvm/bin/clang++", rocm_root), ); config.define( "HIPCXX", format!("{}/llvm/bin/clang", rocm_root), ); config.define( "HIP_PATH", format!("{}", rocm_root), );

but now I get this error
WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: /opt/tabby/bin/llama-server: error while loading shared libraries: libomp.so: cannot open shared object file: No such file or directory

can't figure out why llama-server can't access libomp.so.

I managed to build llamacpp llama-server with rocm docker 5.7.1 and 6.1.2 and it runs great.

Everything was tested today with tabby's master branch.

Any pointers why this happens?

@JayPi4c
Copy link

JayPi4c commented Jul 12, 2024

I tried to get the v0.13.1 working with an AMD GPU and came across the very same Warning as @rudiservo (WARN llama_cpp_server::supervisor: crates/llama-cpp-server/src/supervisor.rs:111: <embedding>: /opt/tabby/bin/llama-server: error while loading shared libraries: libomp.so: cannot open shared object file: No such file or directory)
I had a look into the container and same as llama_cpp_server, I also could not find libomp.so or anything related like libomp.so.5 or something. So I tried adding libomp-dev to the packages that need to be installed for the runtime image. This installs /usr/lib/x86_64-linux-gnu/libomp.so.5 (among other stuff?). Now creating a symlink does in fact solve the problem and I was able to run tabby v0.13.1 built from Dockerfile.rocm.
So again, what it comes down to is this part in the runtime image:

RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    git \
    curl \
    openssh-client \
    ca-certificates \
    libssl3 \
    rocblas \
    hipblas \
    libgomp1 \
    # add the package that provides libomp.so
    libomp-dev \
    && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/* && \
    #  create the symlink
    ln -s /usr/lib/x86_64-linux-gnu/libomp.so.5 /usr/lib/x86_64-linux-gnu/libomp.so

I want to stress that I have no experience with any of this. It's just tinkering around and getting tabby with amd to work on my machine, so I don't know if libomp-dev is the appropriate package for this problem or if other packages do already install libomp.so and it's just not available on PATH or wherever it needs to be defined. Also it feels wrong to be required to manually create a symlink.

@rudiservo
Copy link

@JayPi4c That is strange, libomp exists in the /opt/rocm.
With my llamacpp docker (not the tabby) it works fine, but I it was made with make, not cmake... maybe it's an option in cmake that does not set the /opt/rocm.

root@3a7c21116e01:/app# find -L /opt -name "libomp.so"
/opt/rocm/lib/llvm/lib/libomp.so
/opt/rocm/lib/llvm/lib-debug/libomp.so
/opt/rocm/llvm/lib/libomp.so
/opt/rocm/llvm/lib-debug/libomp.so
/opt/rocm-6.1.2/lib/llvm/lib/libomp.so
/opt/rocm-6.1.2/lib/llvm/lib-debug/libomp.so
/opt/rocm-6.1.2/llvm/lib/libomp.so
/opt/rocm-6.1.2/llvm/lib-debug/libomp.so

@JayPi4c
Copy link

JayPi4c commented Jul 12, 2024

Thanks! I did not know about /opt/rocm.

Currently PATH looks like this:

root@d17bd20c90d1:/# echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/tabby/bin

So there is no reference to /opt/rocm. But there is also no reference to /usr/lib/x86_64-linux-gnu as well and still libomp.so is picked up from there. I also quickly checked and simply adding /opt/rocm/lib/llvm/lib to PATH does not solve the problem. So I guess, there needs to be some other configuration to point llama-server to the correct location of libomp.so.
Sadly I don't know cpp, rust and its build-tools, so I don't know where to put the reference. But what I found was a documentation on how to use rocm with cmake. There they say something about CMAKE_PREFIX_PATH. But again, I don't know what to do with this information.

@rudiservo
Copy link

@JayPi4c well there is in the Makefile of llamacpp

	MK_LDFLAGS += -L$(ROCM_PATH)/lib -Wl,-rpath=$(ROCM_PATH)/lib
	MK_LDFLAGS += -L$(ROCM_PATH)/lib64 -Wl,-rpath=$(ROCM_PATH)/lib64
	MK_LDFLAGS += -lhipblas -lamdhip64 -lrocblas

I can understand makefiles, with cmake I am going to admit my ignorance.

I do not know if these flags are even passed to the cpp with Cmake.

In the make file it is swaped by LDFLAGS

override LDFLAGS := $(MK_LDFLAGS) $(LDFLAGS)

in llama-server it is passed by

$(CXX) $(CXXFLAGS) $(filter-out %.h %.hpp $<,$^) -Iexamples/server $(call GET_OBJ_FILE, $<) -o $@ $(LDFLAGS) $(LWINSOCK2)

well time to learn cmake and figure out what is going on.

@rudiservo
Copy link

I think I might have found the issue.

Going to try and compile llamacpp with cmake to test.

I think that in llamacpp cmake/llama-config.cmake.in you have the GGML_HIPBLAS variable that has find_package, but does not add rocm path as an add_library.

I will refer to this issue in llamacpp that I opened.
ggerganov/llama.cpp#8213

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants