-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
65B model eventually fails with "ggml_new_tensor_impl: not enough space in the scratch memory" #1152
Comments
When the contexts swap occurs and it has to re-evaluate the second half of the context (i.e. The solution is:
diff --git a/llama.cpp b/llama.cpp
index 8c1d657..e860ea1 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -54,7 +54,7 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH0()
{ MODEL_7B, 512ull * MB },
{ MODEL_13B, 512ull * MB },
{ MODEL_30B, 512ull * MB },
- { MODEL_65B, 512ull * MB },
+ { MODEL_65B, 2048ull * MB },
};
return _MEM_REQ_SCRATCH0;
}
@@ -65,7 +65,7 @@ static const std::map<e_model, size_t> & MEM_REQ_SCRATCH1()
{ MODEL_7B, 512ull * MB },
{ MODEL_13B, 512ull * MB },
{ MODEL_30B, 512ull * MB },
- { MODEL_65B, 512ull * MB },
+ { MODEL_65B, 2048ull * MB },
};
return _MEM_REQ_SCRATCH1;
}
@@ -1290,7 +1290,7 @@ static bool llama_eval_internal(
mem_per_token = ggml_used_mem(ctx0)/N;
}
-#if 0
+#if 1
printf("\n%s: used_mem = %.3f MB, scratch -- %.3f MB %.3f MB\n", __func__,
ggml_used_mem(ctx0)/1024.0/1024.0,
lctx.get_buf_max_mem(0)/1024.0/1024.0,
It's a very sloppy process for determining the necessary scratch buffer size. Will try to improve this in the future. P.S. I just bumped the buffers to |
Thanks! Is there some way I can generate a prompt of exactly 1024 tokens? E.g. maybe some character sequence that I could repeat 1024 times? |
I'm running the 65B model on a machine with 256 gigabytes of (CPU) ram, with context size set to 2048. The same thing happens with both llama65b and alpaca65b, every single time I run it in interactive mode: it works fine for a while, but eventually fails with:
ggml_new_tensor_impl: not enough space in the scratch memory
Segmentation fault (core dumped)
Maybe it's using up more and more ram over time, until it runs out?
The exact params:
llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 8192
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 64
llama_model_load_internal: n_layer = 80
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 4 (mostly Q4_1, some F16)
llama_model_load_internal: n_ff = 22016
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size = 146.86 KB
llama_model_load_internal: mem required = 41477.67 MB (+ 5120.00 MB per state)
llama_init_from_file: kv self size = 5120.00 MB
system_info: n_threads = 16 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0
|
main: interactive mode on.
sampling: temp = 1.000000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 2048, n_batch = 8, n_predict = -1, n_keep = 0
The text was updated successfully, but these errors were encountered: