Skip to content

Commit

Permalink
Always use tinyBLAS with AMD GPUs on Windows
Browse files Browse the repository at this point in the history
When llamafile uses hipBLAS with ROCm SDK 5.7.1 on Windows10 the process
crashes shortly after tokens start getting printed. This is possibly the
worst heisenbug I've ever seen in my career. It seems to to crash in AMD
code, in a separate thread, inside hipGraphicsUnregisterResource, when a
vqmovdqu instruction is being executed. While this happens, cosmo's main
thread is usually doing something like std::string and std::locale stuff
which appears unrelated. Could possibly be related to C++ exceptions and
thread-local storage. Using --tinyblas appears to make it go away, but I
can't say for certain it has anything to do with hipBLAS, since it might
simply not manifest itself, because the binary footprint, stack, or heap
memory layout changed. Let's keep our fingers crossed that tinyBLAS will
save us from this issue. Note also that no one else has reported the bug
even though it's been impacting me for months.
  • Loading branch information
jart committed Jun 22, 2024
1 parent a28250b commit 60404a8
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 7 deletions.
6 changes: 4 additions & 2 deletions llama.cpp/ggml-cuda.cu
Original file line number Diff line number Diff line change
Expand Up @@ -10884,8 +10884,8 @@ static ggml_cuda_device_info ggml_cuda_init() {
// Workaround for a rocBLAS bug when using multiple graphics cards:
// https://github.com/ROCmSoftwarePlatform/rocBLAS/issues/1346
#ifndef GGML_USE_TINYBLAS
rocblas_initialize();
CUDA_CHECK(cudaDeviceSynchronize());
// rocblas_initialize(); // already called
// CUDA_CHECK(cudaDeviceSynchronize());
#endif
#endif

Expand Down Expand Up @@ -13507,7 +13507,9 @@ GGML_CALL static enum ggml_status ggml_backend_cuda_graph_compute(ggml_backend_t
GGML_ASSERT(stat == cudaSuccess);
}
// Launch graph
printf("cudaGraphLaunch begin\n");
CUDA_CHECK(cudaGraphLaunch(cuda_ctx->cuda_graph->instance, cuda_ctx->stream()));
printf("cudaGraphLaunch done\n");
#else
graph_evaluated_or_captured = true;
#endif // USE_CUDA_GRAPH
Expand Down
17 changes: 12 additions & 5 deletions llamafile/cuda.c
Original file line number Diff line number Diff line change
Expand Up @@ -559,7 +559,14 @@ static bool compile_amd_windows(const char *clangxx, const char *dso, const char
(char *)offload_arch,
"-Wno-ignored-attributes",
"-D_CRT_SECURE_NO_WARNINGS",
COMMON_FLAGS,
"-DGGML_BUILD=1",
"-DGGML_SHARED=1",
"-DGGML_MULTIPLATFORM",
"-DGGML_CUDA_DMMV_X=32",
"-DK_QUANTS_PER_ITERATION=2",
"-DGGML_CUDA_PEER_MAX_BATCH_SIZE=128",
"-DGGML_CUDA_MMV_Y=1",
"-DGGML_USE_TINYBLAS",
"-o",
(char *)tmpdso,
(char *)src,
Expand All @@ -571,10 +578,10 @@ static bool compile_amd_windows(const char *clangxx, const char *dso, const char
"-amdgpu-early-inline-all=true",
"-isystem",
gc(xasprintf("%s/include", hip_path)),
BLAS_ONLY("-l"),
BLAS_ONLY(gc(xasprintf("%s/lib/hipblas.%s", hip_path, lib))),
BLAS_ONLY("-l"),
BLAS_ONLY(gc(xasprintf("%s/lib/rocblas.%s", hip_path, lib))),
/* BLAS_ONLY("-l"), */
/* BLAS_ONLY(gc(xasprintf("%s/lib/hipblas.%s", hip_path, lib))), */
/* BLAS_ONLY("-l"), */
/* BLAS_ONLY(gc(xasprintf("%s/lib/rocblas.%s", hip_path, lib))), */
"-l",
gc(xasprintf("%s/lib/amdhip64.%s", hip_path, lib)),
"-lkernel32",
Expand Down

0 comments on commit 60404a8

Please sign in to comment.