You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We do not need to do (DTH/HTD) copy of tensor data. This is like how pytorch does it.
In self-attention, the kv cache could still be on host, but the host launches kernels on the device data based on the cache.
The ggml_tensor provides methods to sync to the operator device type to hide complexity.
Unfortunately, lazy sync is not the smartest way - knowing the full compute graph is much better to identify sync points; then one can overlap copy and compute wherever sync is required.
example:
if graph.sync_required(&tensor) {
cudaAsyncCopy(...); // e.g. DTH
}
The text was updated successfully, but these errors were encountered:
jon-chuang
changed the title
perf(cuBLAS): store device pointers in ggml_tensor
perf(cuBLAS): store device pointers in ggml_tensor; lazily copy
Apr 26, 2023
jon-chuang
changed the title
perf(cuBLAS): store device pointers in ggml_tensor; lazily copy
perf(cuBLAS): store device pointers in ggml_tensor; lazily copy based on operator CUDA support
Apr 26, 2023
jon-chuang
changed the title
perf(cuBLAS): store device pointers in ggml_tensor; lazily copy based on operator CUDA support
perf(cuBLAS): store device pointers in ggml_tensor
Apr 26, 2023
We do not need to do (DTH/HTD) copy of tensor data. This is like how pytorch does it.
In self-attention, the kv cache could still be on host, but the host launches kernels on the device data based on the cache.
The
ggml_tensor
provides methods to sync to the operator device type to hide complexity.Unfortunately, lazy sync is not the smartest way - knowing the full compute graph is much better to identify sync points; then one can overlap copy and compute wherever sync is required.
example:
The text was updated successfully, but these errors were encountered: