perf(cuBLAS): store device pointers in ggml_tensor #1194

jon-chuang · 2023-04-26T18:28:20Z

We do not need to do (DTH/HTD) copy of tensor data. This is like how pytorch does it.

In self-attention, the kv cache could still be on host, but the host launches kernels on the device data based on the cache.

The ggml_tensor provides methods to sync to the operator device type to hide complexity.

Unfortunately, lazy sync is not the smartest way - knowing the full compute graph is much better to identify sync points; then one can overlap copy and compute wherever sync is required.

example:

if graph.sync_required(&tensor) {
  cudaAsyncCopy(...); // e.g. DTH
}

The text was updated successfully, but these errors were encountered:

github-actions · 2024-04-09T01:09:47Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

jon-chuang changed the title ~~perf(cuBLAS): store device pointers in ggml_tensor~~ perf(cuBLAS): store device pointers in ggml_tensor; lazily copy Apr 26, 2023

jon-chuang changed the title ~~perf(cuBLAS): store device pointers in ggml_tensor; lazily copy~~ perf(cuBLAS): store device pointers in ggml_tensor; lazily copy based on operator CUDA support Apr 26, 2023

jon-chuang changed the title ~~perf(cuBLAS): store device pointers in ggml_tensor; lazily copy based on operator CUDA support~~ perf(cuBLAS): store device pointers in ggml_tensor Apr 26, 2023

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(cuBLAS): store device pointers in ggml_tensor #1194

perf(cuBLAS): store device pointers in ggml_tensor #1194

jon-chuang commented Apr 26, 2023 •

edited

Loading

github-actions bot commented Apr 9, 2024

perf(cuBLAS): store device pointers in ggml_tensor #1194

perf(cuBLAS): store device pointers in ggml_tensor #1194

Comments

jon-chuang commented Apr 26, 2023 • edited Loading

github-actions bot commented Apr 9, 2024

jon-chuang commented Apr 26, 2023 •

edited

Loading