Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to generate constant output #175

Closed
rikoras opened this issue Mar 27, 2024 · 2 comments
Closed

Unable to generate constant output #175

rikoras opened this issue Mar 27, 2024 · 2 comments
Labels
bug-unconfirmed Unconfirmed bugs

Comments

@rikoras
Copy link

rikoras commented Mar 27, 2024

Prerequisites

Before submitting your issue, please ensure the following:

  • [√] I am running the latest version of PowerInfer. Development is rapid, and as of now, there are no tagged versions.
  • [√] I have carefully read and followed the instructions in the README.md.
  • [√] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).

Problem description

I am conducting a series of performance analyses on PowerInfer. Out of consideration of stability, I need to obtain the same output after each execution. I have refered to #109, but it is not working.

Command

./main -m ../../models/llama-re-lu-7b-sparse/llama-7b-re-lu.powerinfer.gguf --temp 0 -n 256 --seed 0 -t 8 --top-k 1 -p "Here is a code to calculate the first 20 primes"

Current behaviour

./main -m ../../models/llama-relu-7b-sparse/llama-7b-relu.powerinfer.gguf --temp 0 -n 256 --seed 0 -t 8 --top-k 1 -p "Here is a code to calculate the first 20 primes"
Log start
main: build = 1572 (47e9d7e)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
main: seed = 0
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9
llama_model_loader: loaded meta data with 18 key-value pairs and 355 tensors from ../../models/llama-relu-7b-sparse/llama-7b-relu.powerinfer.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor 0: token_embd.weight f16 [ 4096, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 7: blk.0.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 10: blk.1.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 11: blk.1.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 12: blk.1.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 13: blk.1.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 14: blk.1.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 15: blk.1.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 16: blk.1.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 19: blk.2.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 20: blk.2.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 21: blk.2.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 22: blk.2.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 23: blk.2.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 24: blk.2.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 25: blk.2.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 28: blk.3.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 29: blk.3.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 30: blk.3.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 31: blk.3.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 32: blk.3.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 33: blk.3.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 34: blk.3.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 37: blk.4.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 38: blk.4.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 39: blk.4.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 40: blk.4.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 41: blk.4.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 42: blk.4.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 43: blk.4.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 46: blk.5.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 47: blk.5.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 48: blk.5.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 49: blk.5.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 50: blk.5.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 51: blk.5.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 52: blk.5.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 53: blk.5.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 54: blk.5.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 55: blk.6.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 56: blk.6.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 57: blk.6.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 58: blk.6.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 59: blk.6.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 60: blk.6.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 61: blk.6.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 62: blk.6.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 63: blk.6.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 64: blk.7.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 65: blk.7.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 66: blk.7.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 67: blk.7.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 68: blk.7.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 69: blk.7.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 70: blk.7.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 71: blk.7.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 72: blk.7.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 73: blk.8.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 74: blk.8.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 75: blk.8.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 76: blk.8.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 77: blk.8.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 78: blk.8.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 79: blk.8.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 80: blk.8.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 81: blk.8.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 82: blk.9.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 83: blk.9.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 84: blk.9.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 85: blk.9.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 86: blk.9.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 87: blk.9.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 88: blk.9.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 89: blk.9.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 90: blk.9.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 91: blk.10.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 92: blk.10.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 93: blk.10.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 94: blk.10.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 95: blk.10.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 96: blk.10.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 97: blk.10.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 98: blk.10.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 99: blk.10.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 100: blk.11.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 101: blk.11.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 102: blk.11.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 103: blk.11.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 104: blk.11.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 105: blk.11.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 106: blk.11.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 107: blk.11.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 108: blk.11.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 109: blk.12.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 110: blk.12.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 111: blk.12.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 112: blk.12.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 113: blk.12.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 114: blk.12.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 115: blk.12.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 116: blk.12.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 117: blk.12.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 118: blk.13.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 119: blk.13.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 120: blk.13.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 121: blk.13.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 122: blk.13.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 123: blk.13.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 124: blk.13.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 127: blk.14.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 128: blk.14.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 129: blk.14.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 130: blk.14.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 131: blk.14.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 132: blk.14.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 133: blk.14.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 134: blk.14.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 135: blk.14.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 136: blk.15.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 137: blk.15.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 138: blk.15.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 139: blk.15.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 140: blk.15.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 141: blk.15.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 142: blk.15.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 143: blk.15.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 144: blk.15.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 145: blk.16.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 146: blk.16.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 147: blk.16.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 148: blk.16.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 149: blk.16.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 150: blk.16.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 151: blk.16.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 152: blk.16.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 153: blk.16.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 154: blk.17.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 155: blk.17.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 156: blk.17.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 157: blk.17.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 158: blk.17.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 159: blk.17.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 160: blk.17.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 161: blk.17.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 162: blk.17.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 163: blk.18.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 164: blk.18.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 165: blk.18.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 166: blk.18.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 167: blk.18.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 168: blk.18.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 169: blk.18.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 170: blk.18.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 171: blk.18.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 172: blk.19.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 173: blk.19.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 174: blk.19.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 175: blk.19.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 176: blk.19.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 177: blk.19.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 178: blk.19.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 179: blk.19.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 180: blk.19.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 181: blk.20.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 182: blk.20.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 183: blk.20.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 184: blk.20.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 185: blk.20.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 186: blk.20.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 187: blk.20.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 188: blk.20.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 189: blk.20.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 190: blk.21.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 191: blk.21.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 192: blk.21.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 193: blk.21.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 194: blk.21.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 195: blk.21.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 196: blk.21.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 197: blk.21.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 198: blk.21.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 199: blk.22.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 200: blk.22.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 201: blk.22.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 202: blk.22.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 203: blk.22.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 204: blk.22.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 205: blk.22.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 206: blk.22.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 207: blk.22.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 208: blk.23.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 209: blk.23.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 210: blk.23.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 211: blk.23.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 212: blk.23.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 213: blk.23.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 214: blk.23.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 215: blk.23.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 216: blk.23.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 217: blk.24.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 218: blk.24.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 219: blk.24.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 220: blk.24.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 221: blk.24.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 222: blk.24.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 223: blk.24.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 224: blk.24.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 225: blk.24.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 226: blk.25.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 227: blk.25.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 228: blk.25.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 229: blk.25.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 230: blk.25.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 231: blk.25.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 232: blk.25.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 233: blk.25.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 234: blk.25.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 235: blk.26.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 236: blk.26.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 237: blk.26.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 238: blk.26.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 239: blk.26.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 240: blk.26.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 241: blk.26.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 242: blk.26.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 243: blk.26.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 244: blk.27.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 245: blk.27.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 246: blk.27.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 247: blk.27.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 248: blk.27.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 249: blk.27.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 250: blk.27.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 251: blk.27.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 252: blk.27.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 253: blk.28.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 254: blk.28.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 255: blk.28.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 256: blk.28.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 257: blk.28.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 258: blk.28.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 259: blk.28.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 260: blk.28.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 261: blk.28.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 262: blk.29.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 263: blk.29.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 264: blk.29.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 265: blk.29.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 266: blk.29.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 267: blk.29.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 268: blk.29.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 269: blk.29.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 270: blk.29.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 271: blk.30.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 272: blk.30.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 273: blk.30.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 274: blk.30.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 275: blk.30.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 276: blk.30.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 277: blk.30.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 278: blk.30.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 279: blk.30.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 280: blk.31.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 281: blk.31.attn_k.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 282: blk.31.attn_v.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 283: blk.31.attn_output.weight f16 [ 4096, 4096, 1, 1 ]
llama_model_loader: - tensor 284: blk.31.ffn_gate.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 285: blk.31.ffn_up.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 286: blk.31.ffn_down_t.weight f16 [ 4096, 11008, 1, 1 ]
llama_model_loader: - tensor 287: blk.31.attn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 288: blk.31.ffn_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 289: output_norm.weight f32 [ 4096, 1, 1, 1 ]
llama_model_loader: - tensor 290: output.weight f16 [ 4096, 32000, 1, 1 ]
llama_model_loader: - tensor 291: blk.0.fc1.weight f16 [ 4096, 1024, 1, 1 ]
llama_model_loader: - tensor 292: blk.0.fc2.weight f16 [ 1024, 11008, 1, 1 ]
llama_model_loader: - tensor 293: blk.1.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 294: blk.1.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 295: blk.2.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 296: blk.2.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 297: blk.3.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 298: blk.3.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 299: blk.4.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 300: blk.4.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 301: blk.5.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 302: blk.5.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 303: blk.6.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 304: blk.6.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 305: blk.7.fc1.weight f16 [ 4096, 1536, 1, 1 ]
llama_model_loader: - tensor 306: blk.7.fc2.weight f16 [ 1536, 11008, 1, 1 ]
llama_model_loader: - tensor 307: blk.8.fc1.weight f16 [ 4096, 1536, 1, 1 ]
llama_model_loader: - tensor 308: blk.8.fc2.weight f16 [ 1536, 11008, 1, 1 ]
llama_model_loader: - tensor 309: blk.9.fc1.weight f16 [ 4096, 1024, 1, 1 ]
llama_model_loader: - tensor 310: blk.9.fc2.weight f16 [ 1024, 11008, 1, 1 ]
llama_model_loader: - tensor 311: blk.10.fc1.weight f16 [ 4096, 1024, 1, 1 ]
llama_model_loader: - tensor 312: blk.10.fc2.weight f16 [ 1024, 11008, 1, 1 ]
llama_model_loader: - tensor 313: blk.11.fc1.weight f16 [ 4096, 1024, 1, 1 ]
llama_model_loader: - tensor 314: blk.11.fc2.weight f16 [ 1024, 11008, 1, 1 ]
llama_model_loader: - tensor 315: blk.12.fc1.weight f16 [ 4096, 1280, 1, 1 ]
llama_model_loader: - tensor 316: blk.12.fc2.weight f16 [ 1280, 11008, 1, 1 ]
llama_model_loader: - tensor 317: blk.13.fc1.weight f16 [ 4096, 1280, 1, 1 ]
llama_model_loader: - tensor 318: blk.13.fc2.weight f16 [ 1280, 11008, 1, 1 ]
llama_model_loader: - tensor 319: blk.14.fc1.weight f16 [ 4096, 1536, 1, 1 ]
llama_model_loader: - tensor 320: blk.14.fc2.weight f16 [ 1536, 11008, 1, 1 ]
llama_model_loader: - tensor 321: blk.15.fc1.weight f16 [ 4096, 1536, 1, 1 ]
llama_model_loader: - tensor 322: blk.15.fc2.weight f16 [ 1536, 11008, 1, 1 ]
llama_model_loader: - tensor 323: blk.16.fc1.weight f16 [ 4096, 1536, 1, 1 ]
llama_model_loader: - tensor 324: blk.16.fc2.weight f16 [ 1536, 11008, 1, 1 ]
llama_model_loader: - tensor 325: blk.17.fc1.weight f16 [ 4096, 1536, 1, 1 ]
llama_model_loader: - tensor 326: blk.17.fc2.weight f16 [ 1536, 11008, 1, 1 ]
llama_model_loader: - tensor 327: blk.18.fc1.weight f16 [ 4096, 1536, 1, 1 ]
llama_model_loader: - tensor 328: blk.18.fc2.weight f16 [ 1536, 11008, 1, 1 ]
llama_model_loader: - tensor 329: blk.19.fc1.weight f16 [ 4096, 1792, 1, 1 ]
llama_model_loader: - tensor 330: blk.19.fc2.weight f16 [ 1792, 11008, 1, 1 ]
llama_model_loader: - tensor 331: blk.20.fc1.weight f16 [ 4096, 1792, 1, 1 ]
llama_model_loader: - tensor 332: blk.20.fc2.weight f16 [ 1792, 11008, 1, 1 ]
llama_model_loader: - tensor 333: blk.21.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 334: blk.21.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 335: blk.22.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 336: blk.22.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 337: blk.23.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 338: blk.23.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 339: blk.24.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 340: blk.24.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 341: blk.25.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 342: blk.25.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 343: blk.26.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 344: blk.26.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 345: blk.27.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 346: blk.27.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 347: blk.28.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 348: blk.28.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 349: blk.29.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 350: blk.29.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 351: blk.30.fc1.weight f16 [ 4096, 2048, 1, 1 ]
llama_model_loader: - tensor 352: blk.30.fc2.weight f16 [ 2048, 11008, 1, 1 ]
llama_model_loader: - tensor 353: blk.31.fc1.weight f16 [ 4096, 1536, 1, 1 ]
llama_model_loader: - tensor 354: blk.31.fc2.weight f16 [ 1536, 11008, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: general.file_type u32
llama_model_loader: - kv 11: tokenizer.ggml.model str
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr
llama_model_loader: - kv 13: tokenizer.ggml.scores arr
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type f16: 290 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly F16
llm_load_print_meta: model params = 7.57 B
llm_load_print_meta: model size = 14.11 GiB (16.00 BPW)
llm_load_print_meta: general.name = syx
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
llm_load_sparse_model_tensors: ggml ctx size = 0.13 MB
llm_load_sparse_model_tensors: using CUDA for GPU acceleration
llm_load_sparse_model_tensors: offloaded layers from VRAM budget(7090864128 bytes): 33/32
llm_load_sparse_model_tensors: mem required = 14446.15 MB
llm_load_sparse_model_tensors: VRAM used: 5939.52 MB
....................................................................................................
invoking powerinfer Python module to generate gpu split for 566.86 MiB of VRAM
/home/rikora/anaconda3/envs/meta_kotoba/bin/python3: No module named powerinfer
llm_load_gpu_split_with_budget: error: failed to generate gpu split
llm_load_gpu_split: error: failed to generate gpu split, an empty one will be used
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (6.02 ms)
llm_load_gpu_split: offloaded 0.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size = 256.00 MB
llama_build_graph: non-view tensors processed: 548/836
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: ggml-org/llama.cpp#3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 6.91 MB
llama_new_context_with_model: VRAM scratch buffer: 5.34 MB
llama_new_context_with_model: total VRAM used: 6200.86 MB (model: 5939.52 MB, context: 261.34 MB)

system_info: n_threads = 8 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 1, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 512, n_batch = 32, n_predict = 256, n_keep = 0

Here is a code to calculate the first 20 primes.

def prime_sieve(n):
    primes = []
    for i in range(1, n+1):
        if not (i % 2) and not i in primes:
            primes.append(i)
    return primes

[end of text]

llama_print_timings: load time = 1080.21 ms
llama_print_timings: sample time = 6.95 ms / 68 runs ( 0.10 ms per token, 9785.58 tokens per second)
llama_print_timings: prompt eval time = 253.00 ms / 14 tokens ( 18.07 ms per token, 55.34 tokens per second)
llama_print_timings: eval time = 5391.85 ms / 67 runs ( 80.48 ms per token, 12.43 tokens per second)
llama_print_timings: total time = 5668.73 ms
Log end

For the second execution with the same former part, I got different output text:

Here is a code to calculate the first 20 primes.

def prime_sieve(n):
    """
    Generate a list of primes up to n, using the sieve of Eratosthenes.
    
    Args:
        n (int): The upper limit for the primes.
        
    Returns:
        A list of primes up to n.
    """
    primes = [True] * (n // 2) + [False] * (n // 2)
    
    # Mark all multiples of each prime as false.
    for i in range(1, n // 2):
        if primes[i // 2]:
            primes[i // 2] = False
            
    # Mark the first prime as true.
    primes[0] = True
    
    return [primes[i // 2]] * (n // 2) + [False] * (n // 2)

[end of text]

I wonder if the predictors have an effect on sampling.

Environment

This inconsistent does NOT appear on another device with:

  • CPU:i3-12100k
  • GPU:2080Ti 22G
  • DRAM:16G
  • same as preceding
@rikoras rikoras added the bug-unconfirmed Unconfirmed bugs label Mar 27, 2024
@YixinSong-e
Copy link
Collaborator

YixinSong-e commented Mar 27, 2024

Actually this is because of our sparse down operator in FFN. We utilize axpy to implement a matmul operator. In this process, the output is composed of many concurrent add operator, which will introduce slight fluctuation. For a stable output, it's advised to use PowerInfer with pure CPU inference using a single thread.

@rikoras
Copy link
Author

rikoras commented Apr 1, 2024

Actually this is because of our sparse down operator in FFN. We utilize axpy to implement a matmul operator. In this process, the output is composed of many concurrent add operator, which will introduce slight fluctuation. For a stable output, it's advised to use PowerInfer with pure CPU inference using a single thread.

That makes it very clear! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed Unconfirmed bugs
Projects
None yet
Development

No branches or pull requests

2 participants