Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

Git llama.cpp with python bindings.
# Expected Behavior
Inference works like before.

# Current Behavior

Inference fails and llama.cpp crashes.


# Environment and Context

python 3.10 / cuda 11.8

# Failure Information (for bugs)
```

llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.26 MB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required  =  140.89 MB
llm_load_tensors: offloading 80 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 83/83 layers to GPU
llm_load_tensors: VRAM used: 39362.61 MB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4096
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 1280.00 MB
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_build_graph: non-view tensors processed: 1844/1844
llama_new_context_with_model: compute buffer total size = 574.63 MB
llama_new_context_with_model: VRAM scratch buffer: 568.00 MB
llama_new_context_with_model: total VRAM used: 41210.61 MB (model: 39362.61 MB, context: 1848.00 MB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
2023-11-02 17:16:43 INFO:Loaded the model in 37.54 seconds.
Enabled NVLINK P2P 0->1
Enabled NVLINK P2P 1->0

CUDA error 1 at /home/supermicro/ai/llama-cpp-python-gguf-cuda/vendor/llama.cpp/ggml-cuda.cu:7068: invalid argument
current device: 1
```

# Relevant Code

I have some printf's for Nvllink as you see so the line numbers are a little off but here is the snippet that set sit off. 

```

                // copy src0, src1 to device if necessary
                if (src1->backend == GGML_BACKEND_GPU && src1_is_contiguous) {
                    if (id != g_main_device) {
                        if (convert_src1_to_q8_1) {
                            char * src1_ddq_i_source = src1_ddq[g_main_device] + src1_ddq_i_offset;
                     ****>       CUDA_CHECK(cudaMemcpyAsync(src1_ddq_i, src1_ddq_i_source, src1_ncols*src1_padded_col_size*q8_1_ts/q8_1_bs,
                                                    cudaMemcpyDeviceToDevice, stream));
                        } else {
                            float * src1_ddf_i_source = (float *) src1_extra->data_device[g_main_device];
                            src1_ddf_i_source += (i0*ne11 + src1_col_0) * ne10;
                            CUDA_CHECK(cudaMemcpyAsync(src1_ddf_i, src1_ddf_i_source, src1_ncols*ne10*sizeof(float),
                                                    cudaMemcpyDeviceToDevice, stream));
                        }
                    }
```

One of the args to cudamemcpy async is invalid. I haven't checked yet which one does it. The day before, it was trying to allocate 5TB of system ram after loading the model but subsequent commits fixed that up. Waited a little to see if that would happen with this since the code is so new and I can't access github from that machine so I have to bring the logs here.


It does it with both P40s and 3090s and is independent of whether I force MMQ or not.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument #3930

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Relevant Code

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument #3930

Description

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Relevant Code

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions