llama.cpp on Nvidia RTX-3500, RTX-A4500 dual, RTX-4090 dual #10

obriensystems · 2024-02-07T04:12:02Z

see #7

test
git clone https://github.com/ggerganov/llama.cpp
model
https://huggingface.co/TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF
https://huggingface.co/TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF/blob/main/capybarahermes-2.5-mistral-7b.Q8_0.gguf

using w64devkit on Lenovo P1gen6 RTX-3500 12G
https://github.com/skeeto/w64devkit/releases

C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "describe quantum computing at Google" -n 400 -e
Log start
main: build = 2060 (5ed26e1f)
main: built with cc (GCC) 13.2.0 for x86_64-w64-mingw32
main: seed  = 1707279545
llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from models/capybarahermes-2.5-mistral-7b.Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = argilla_capybarahermes-2.5-mistral-7b
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 7
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32002]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32002]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32002]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 32000
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q8_0:  226 tensors
llm_load_vocab: special tokens definition check successful ( 261/32002 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32002
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q8_0
llm_load_print_meta: model params     = 7.24 B
llm_load_print_meta: model size       = 7.17 GiB (8.50 BPW)
llm_load_print_meta: general.name     = argilla_capybarahermes-2.5-mistral-7b
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 32000 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.11 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  7338.66 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:        CPU compute buffer size =    79.20 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 describe quantum computing at Google

In a new paper published in Nature, researchers from Google have announced that they have achieved “quantum supremacy” for the first time ever. What does this mean? Well, it’s a pretty big deal in the world of science and technology. Let me explain!

Quantum computing is a type of computing that uses quantum mechanics to process information. Quantum computers can perform certain calculations much faster than classical computers because they use qubits instead of bits. A bit can be either 0 or 1, while a qubit can be both 0 and 1 simultaneously thanks to a phenomenon called superposition. This allows quantum computers to solve complex problems that would take classical computers an unrealistic amount of time to solve.

Google’s new paper describes how they were able to use their 53-qubit Sycamore processor to perform a specific type of calculation in just 200 seconds, while the researchers estimate that it would take the most powerful supercomputers thousands of years to perform the same calculation. This is what Google calls “quantum supremacy” – a situation where a quantum computer can solve a problem that a classical computer simply cannot solve within a reasonable amount of time.

The specific calculation that Google used in their experiment is called a random circuit, which involves creating a large number of random operations on the qubits and then measuring the final state of the system to see if it matches a particular pattern. This type of calculation is not particularly useful on its own, but it does provide a way to measure how much computational power a quantum computer has compared to classical computers.

The achievement of quantum supremacy is significant because it shows that quantum computers can indeed outperform classical computers in certain situations. It also represents a major milestone in the development of quantum computing technology, which could have profound implications for fields like cryptography, machine learning, and materials science. However, it should be noted
llama_print_timings:        load time =    2513.29 ms
llama_print_timings:      sample time =     125.45 ms /   400 runs   (    0.31 ms per token,  3188.60 tokens per second)
llama_print_timings: prompt eval time =     636.99 ms /     6 tokens (  106.17 ms per token,     9.42 tokens per second)
llama_print_timings:        eval time =   91047.02 ms /   399 runs   (  228.19 ms per token,     4.38 tokens per second)
llama_print_timings:       total time =   92287.53 ms /   405 tokens
Log end

trying for GPU
C:/wse_github/llama.cpp $ nvidia-smi
Tue Feb  6 23:23:04 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.84                 Driver Version: 545.84       CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX 3500 Ada Gene...  WDDM  | 00000000:01:00.0 Off |                  Off |
| N/A   46C    P3              20W /  91W |      0MiB / 12282MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

C:/wse_github/llama.cpp $ make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:   Windows_NT
I UNAME_P:   unknown
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -march=native -mtune=native -Xassembler -muse-unaligned-vector-move -Wdouble-promotion
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Xassembler -muse-unaligned-vector-move  -march=native -mtune=native -Wno-array-bounds -Wno-format-truncation -Wextra-semi
I NVCCFLAGS: -O3 -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128
I LDFLAGS:   -lcuda -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -LC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/targets/x86_64-linux/lib -L/usr/local/cuda/targets/aarch64-linux/lib -L/usr/lib/wsl/lib
I CC:        cc (GCC) 13.2.0
I CXX:       x86_64-w64-mingw32-g++ (GCC) 13.2.0
I NVCC:      Build cuda_12.3.r12.3/compiler.33281558_0

nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -IC:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v12.3/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Xassembler -muse-unaligned-vector-move  -O3 -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -Wno-pedantic -Xcompiler "-Wno-array-bounds" -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified
make: *** [Makefile:430: ggml-cuda.o] Error 1

The text was updated successfully, but these errors were encountered:

obriensystems · 2024-02-07T15:58:46Z

49G on CPU (64G) - RTX-3500 Lenovo P1Gen6 13800H
https://huggingface.co/TheBloke/CodeLlama-70B-hf-GGUF/blob/main/codellama-70b-hf.Q5_K_M.gguf

C:/wse_github/llama.cpp $ ./main.exe -m models/codellama-70b-hf.Q5_K_M.gguf -p "binary tree in java" -n 400 -e
Log start
main: build = 2060 (5ed26e1f)
main: built with cc (GCC) 13.2.0 for x86_64-w64-mingw32
main: seed  = 1707281416
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/codellama-70b-hf.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama_codellama-70b-hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 45.40 GiB (5.65 BPW)
llm_load_print_meta: general.name     = codellama_codellama-70b-hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/81 layers to GPU
llm_load_tensors:        CPU buffer size = 46494.67 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    17.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   158.40 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 binary tree in java
==============

I have created this project to demonstrate how one could implement a Binary Tree in Java.  This is for educational purposes only and is not intended for production use!

I have included the JUnit tests I used to help verify functionality while developing, but they are not comprehensive by any means.

Usage
-----------

The following code will create a tree with root node of value 42 and add two children (19 and 63):

```java
        BinaryTree tree = new BinaryTree();
        tree.add(42);
        tree.add(19);
        tree.add(63);

You can also pass in an array of values to populate the tree:

        int[] values = { 4, 5, 7, 8 };
        BinaryTree tree = new BinaryTree(values);

License

This project is released under the MIT license. See LICENSE for more details. [end of text]

llama_print_timings: load time = 14205.60 ms
llama_print_timings: sample time = 56.09 ms / 228 runs ( 0.25 ms per token, 4065.26 tokens per second)
llama_print_timings: prompt eval time = 5812.95 ms / 5 tokens ( 1162.59 ms per token, 0.86 tokens per second)
llama_print_timings: eval time = 325544.97 ms / 227 runs ( 1434.12 ms per token, 0.70 tokens per second)
llama_print_timings: total time = 331648.96 ms / 232 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/codellama-70b-hf.Q5_K_M.gguf -p "binary tree in java" -n 400 -e
Log start
main: build = 2060 (5ed26e1f)
main: built with cc (GCC) 13.2.0 for x86_64-w64-mingw32
main: seed = 1707315232
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/codellama-70b-hf.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = codellama_codellama-70b-hf
llama_model_loader: - kv 2: llama.context_length u32 = 16384
llama_model_loader: - kv 3: llama.embedding_length u32 = 8192
llama_model_loader: - kv 4: llama.block_count u32 = 80
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 28672
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 64
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 11: general.file_type u32 = 17
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32016] = ["", "~~", "~~", "<0x00>", "<...
llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32016] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32016] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q5_K: 481 tensors
llama_model_loader: - type q6_K: 81 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32016
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 16384
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 16384
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 68.98 B
llm_load_print_meta: model size = 45.40 GiB (5.65 BPW)
llm_load_print_meta: general.name = codellama_codellama-70b-hf
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.28 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/81 layers to GPU
llm_load_tensors: CPU buffer size = 46494.67 MiB
....................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CPU KV buffer size = 160.00 MiB
llama_new_context_with_model: KV self size = 160.00 MiB, K (f16): 80.00 MiB, V (f16): 80.00 MiB
llama_new_context_with_model: CPU input buffer size = 17.01 MiB
llama_new_context_with_model: CPU compute buffer size = 158.40 MiB
llama_new_context_with_model: graph splits (measure): 1

binary tree in java

// create a node class to store data and pointers to left and right child nodes
class Node{
int data; // holds the key
Node left,right; // pointer to left and right children

public Node(int item){
    data=item;
    left=right=null;
}

}
// create a binary tree class that will provide functions to create and manipulate the tree.
class BinaryTree{
Node root; // pointer to the root node of the tree (global variable)

public void printPostorder(Node node){
    if(node==null)
        return;
    printPostorder(node.left);
    printPostorder(node.right);
    System.out.print(node.data+" ");
}
// method to print the tree in post-order.
public void printInorder(Node node){
    if(node==null)
        return;

    printPostorder(node.left);
    System.out.print(node.data+" ");
    printPostorder(node.right);
}
// method to print the tree in pre-order.
public void printPreorder(Node node){
    if(node==null)
        return;

    System.out.print(node.data+" ");
    printPostorder(node.left);
    printPostorder(node.right);
}

}
[end of text]

llama_print_timings: load time = 12756.12 ms
llama_print_timings: sample time = 73.23 ms / 331 runs ( 0.22 ms per token, 4520.25 tokens per second)
llama_print_timings: prompt eval time = 5825.50 ms / 5 tokens ( 1165.10 ms per token, 0.86 tokens per second)
llama_print_timings: eval time = 476828.62 ms / 330 runs ( 1444.94 ms per token, 0.69 tokens per second)
llama_print_timings: total time = 483038.49 ms / 335 tokens

obriensystems · 2024-02-07T17:18:17Z

i13900k 192G ram (2 unused so far 4090 24x2 gpus) -

C:/wse_github/llama.cpp $ ./main.exe -m models/codellama-70b-hf.Q5_K_M.gguf -p "binary tree in java" -n 400 -e
Log start
main: build = 2093 (aa7ab99b)
main: built with cc (GCC) 13.2.0 for x86_64-w64-mingw32
main: seed  = 1707326237
llama_model_loader: loaded meta data with 23 key-value pairs and 723 tensors from models/codellama-70b-hf.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = codellama_codellama-70b-hf
llama_model_loader: - kv   2:                       llama.context_length u32              = 16384
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 8192
llama_model_loader: - kv   4:                          llama.block_count u32              = 80
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 28672
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 64
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32016]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32016]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32016]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q5_K:  481 tensors
llama_model_loader: - type q6_K:   81 tensors
llm_load_vocab: mismatch in special tokens definition ( 264/32016 vs 260/32016 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32016
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 16384
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 16384
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 45.40 GiB (5.65 BPW)
llm_load_print_meta: general.name     = codellama_codellama-70b-hf
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/81 layers to GPU
llm_load_tensors:        CPU buffer size = 46494.67 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   160.00 MiB
llama_new_context_with_model: KV self size  =  160.00 MiB, K (f16):   80.00 MiB, V (f16):   80.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    17.01 MiB
llama_new_context_with_model:        CPU compute buffer size =   158.40 MiB
llama_new_context_with_model: graph splits (measure): 1

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 binary tree in java

```java
package com.test;
import java.util.*;
class BinaryTree{
        Node root=null;
        BinaryTree(int a){
                root=new Node();
                root.data=a;
        }
        void add_node(Node node,int a){
                if(a<node.data){
                        if(node.left!=null)
                                add_node(node.left,a);
                        else{
                                Node newnode=new Node();
                                newnode.data=a;
                                node.left=newnode;
                        }
                }
                if(a>=node.data){
                        if(node.right!=null)
                                add_node(node.right,a);
                        else{
                                Node newnode=new Node();
                                newnode.data=a;
                                node.right=newnode;
                        }
                }
        }
        void inorder(Node node){
                if(node==null) return ;
                inorder(node.left);
                System.out.print(node.data+" ");
                inorder(node.right);
        }
        class Node{
                int data;
                Node left,right=null;
        }
        public static void main(String[] args) {
                BinaryTree bt = new BinaryTree(20);
                bt.add_node(bt.root,5);
                bt.add_node(bt.root,15);
                bt.inorder
llama_print_timings:        load time =    6103.69 ms
llama_print_timings:      sample time =      51.71 ms /   400 runs   (    0.13 ms per token,  7735.45 tokens per second)
llama_print_timings: prompt eval time =    2047.35 ms /     5 tokens (  409.47 ms per token,     2.44 tokens per second)
llama_print_timings:        eval time =  400051.72 ms /   399 runs   ( 1002.64 ms per token,     1.00 tokens per second)
llama_print_timings:       total time =  402304.93 ms /   404 tokens
Log end

obriensystems · 2024-02-07T18:23:34Z

Falcon 40B on CPU 80-100G (falcon 180B needs 400G)
https://huggingface.co/tiiuae/falcon-40b
https://huggingface.co/TheBloke/Falcon-180B-GGUF

Name	Quant method	Bits	Size	Max RAM required	Use case
falcon-180b.Q2_K.gguf	Q2_K	2	73.97 GB	76.47 GB	smallest, significant quality loss - not recommended for most purposes
falcon-180b.Q3_K_S.gguf	Q3_K_S	3	77.77 GB	80.27 GB	very small, high quality loss
falcon-180b.Q3_K_M.gguf	Q3_K_M	3	85.18 GB	87.68 GB	very small, high quality loss
falcon-180b.Q3_K_L.gguf	Q3_K_L	3	91.99 GB	94.49 GB	small, substantial quality loss
falcon-180b.Q4_0.gguf	Q4_0	4	101.48 GB	103.98 GB	legacy; small, very high quality loss - prefer using Q3_K_M
falcon-180b.Q4_K_S.gguf	Q4_K_S	4	101.48 GB	103.98 GB	small, greater quality loss
falcon-180b.Q4_K_M.gguf	Q4_K_M	4	108.48 GB	110.98 GB	medium, balanced quality - recommended
falcon-180b.Q5_0.gguf	Q5_0	5	123.80 GB	126.30 GB	legacy; medium, balanced quality - prefer using Q4_K_M
falcon-180b.Q5_K_S.gguf	Q5_K_S	5	123.80 GB	126.30 GB	large, low quality loss - recommended
falcon-180b.Q5_K_M.gguf	Q5_K_M	5	130.99 GB	133.49 GB	large, very low quality loss - recommended
falcon-180b.Q6_K.gguf	Q6_K	6	147.52 GB	150.02 GB	very large, extremely low quality loss

obriensystems · 2024-02-11T00:30:56Z

$ cat falcon-180b.Q6_K.gguf-split-* > falcon-180b.Q6_K.gguf

at 2.2 GB/s write on a samsung 990 pro NvME it takes about a min to combine the 2 into one 96G file

take out -ngl 64


C:/wse_github/llama.cpp $ ./main.exe -m models/falcon-180b.Q6_K.gguf -p "show use of lambda in java search" -n 400 -e --color -t 16

we go up to 110G but drop out
llm_load_print_meta: model size       = 137.38 GiB (6.57 BPW)

all 3 parts a/b/c total 150G SSD and 140G of ram

-rw-r--r-- 1 michael 197121 147516218272 Feb 10 18:53 falcon-180b.Q6_K.gguf
-rw-r--r-- 1 michael 197121  49172072767 Feb 10 18:37 falcon-180b.Q6_K.gguf-split-a
-rw-r--r-- 1 michael 197121  49172072767 Feb 10 18:39 falcon-180b.Q6_K.gguf-split-b
-rw-r--r-- 1 michael 197121  49172072738 Feb 10 18:40 falcon-180b.Q6_K.gguf-split-c

160 of 192G ram at 91% cpu on 13900k

Think I need segment c as well 96 != 137

obriensystems · 2024-02-11T04:30:41Z

On 13800h p1gen6

llama_print_timings:        load time =    2078.64 ms
llama_print_timings:      sample time =     107.02 ms /   400 runs   (    0.27 ms per token,  3737.72 tokens per second)
llama_print_timings: prompt eval time =     450.00 ms /     6 tokens (   75.00 ms per token,    13.33 tokens per second)
llama_print_timings:        eval time =   75482.42 ms /   399 runs   (  189.18 ms per token,     5.29 tokens per second)
llama_print_timings:       total time =   76450.07 ms /   405 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "describe quantum computing at Google" -t 16 -n 400 -e

llama_print_timings:        load time =    1963.86 ms
llama_print_timings:      sample time =      88.87 ms /   292 runs   (    0.30 ms per token,  3285.81 tokens per second)
llama_print_timings: prompt eval time =     772.40 ms /     6 tokens (  128.73 ms per token,     7.77 tokens per second)
llama_print_timings:        eval time =   69189.03 ms /   291 runs   (  237.76 ms per token,     4.21 tokens per second)
llama_print_timings:       total time =   70424.35 ms /   297 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf -p "describe quantum computing at Google" -t 8 -n 400 -e --color

On 13900k desktop

llama_print_timings:        load time =    1411.46 ms
llama_print_timings:      sample time =      59.34 ms /   400 runs   (    0.15 ms per token,  6741.27 tokens per second)
llama_print_timings: prompt eval time =     406.02 ms /     6 tokens (   67.67 ms per token,    14.78 tokens per second)
llama_print_timings:        eval time =  102460.70 ms /   399 runs   (  256.79 ms per token,     3.89 tokens per second)
llama_print_timings:       total time =  103148.10 ms /   405 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf  -p "describe quantum computing at google" -n 400 -e --color -t 16

llama_print_timings:        load time =    1719.92 ms
llama_print_timings:      sample time =      59.50 ms /   400 runs   (    0.15 ms per token,  6723.03 tokens per second)
llama_print_timings: prompt eval time =     602.42 ms /     6 tokens (  100.40 ms per token,     9.96 tokens per second)
llama_print_timings:        eval time =  119168.22 ms /   399 runs   (  298.67 ms per token,     3.35 tokens per second)
llama_print_timings:       total time =  120069.83 ms /   405 tokens
Log end
C:/wse_github/llama.cpp $ ./main.exe -m models/capybarahermes-2.5-mistral-7b.Q8_0.gguf  -p "describe quantum computing at google" -n 400 -e --color -t 8

why slower

obriensystems · 2024-02-11T16:09:20Z

CUDA on llama.cpp
ggml-org/llama.cpp#1470

adjusting the ENV variable works well - below or shortened copy
-LC:/Progra~~1/"NVIDIA~~1/CUDA/v12.3/targets/x86_64-linux/lib
until
nvcc fatal : Cannot find compiler 'cl.exe' in PATH
make: *** [Makefile:430: ggml-cuda.o] Error 1

fix - add to PATH

C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.37.32822\bin\Hostx64\x64

solving

nvcc -I. -Icommon -D_XOPEN_SOURCE=600 -DNDEBUG -D_WIN32_WINNT=0x602 -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -IC:/opt/CUDA/v12.3/targets/x86_64-linux/include -I/usr/local/cuda/targets/aarch64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -Xassembler -muse-unaligned-vector-move  -O3 -use_fast_math --forward-unknown-to-host-compiler -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128  -Wno-pedantic -Xcompiler "-Wno-array-bounds" -c ggml-cuda.cu -o ggml-cuda.o
nvcc warning : The -std=c++11 flag is not supported with the configured host compiler. Flag will be ignored.
ggml-cuda.cu
cl : Command line error D8021 : invalid numeric argument '/Wno-array-bounds'
make: *** [Makefile:430: ggml-cuda.o] Error 2

using as a reference https://github.com/obrienlabs/CUDA-Programs/tree/main/Chapter01/gpusum as part of the book from Richard Ansorge of University of Cambridge https://www.cambridge.org/core/books/programming-in-parallel-with-cuda/C43652A69033C25AD6933368CDBE084C
see
ObrienlabsDev/blog#1

obriensystems · 2024-02-24T03:46:12Z

revisit llama.cpp for nvidia gpus

make clean && LLAMA_CUBLAS=1 make -j
Makefile:604: *** I ERROR: For CUDA versions < 11.7 a target CUDA architecture must be explicitly provided via CUDA_DOCKER_ARCH.  Stop.


C:/wse_github/llama.cpp $ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:42:34_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
C:/wse_github/llama.cpp $ CUDA_DOCKER_ARCH=12.2


try path escaping
 -IC:/Progra~1/NVIDIA~2/CUDA/v12.2/targets/x86_64-linux/include

look at abetlen/llama-cpp-python#871

obriensystems · 2024-02-24T04:03:46Z

pip install accelerate

from transformers import AutoTokenizer, AutoModelForCausalLM

access_token='hf_cfTP...QqH'

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", token=access_token)
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", token=access_token)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))


michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/gemma (main)
$ python gemma-gpu.py
model.safetensors.index.json: 100%|████████████████████████████████████████████████████████| 13.5k/13.5k [00:00<00:00, 13.5MB/s]
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\huggingface_hub\file_download.py:149: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\michael\.cache\huggingface\hub\models--google--gemma-2b. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
model-00001-of-00002.safetensors: 100%|█████████████████████████████████████████████████████| 4.95G/4.95G [00:48<00:00, 103MB/s]
model-00002-of-00002.safetensors: 100%|█████████████████████████████████████████████████████| 67.1M/67.1M [00:00<00:00, 107MB/s]
Downloading shards: 100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:49<00:00, 24.60s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.37s/it]
generation_config.json: 100%|███████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 274kB/s]
Traceback (most recent call last):
  File "C:\wse_github\obrienlabsdev\machine-learning\gemma\gemma-gpu.py", line 9, in <module>
    input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\tokenization_utils_base.py", line 789, in to
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\tokenization_utils_base.py", line 789, in <dictcomp>
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
                    ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\torch\cuda\__init__.py", line 293, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
(base)

obriensystems · 2024-02-24T04:10:55Z

https://pytorch.org/get-started/locally/

12,1 not 12.2

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

working but - no real output
michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.03s/it]
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\generation\utils.py:1178: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\models\gemma\modeling_gemma.py:555: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
<bos>how is gold made in collapsing neutron stars?

Answer:

Step 1/5

obriensystems · 2024-02-24T04:26:48Z

checking context length
Using the model-agnostic default max_length (=20) to control the generation length

outputs = model.generate(**input_ids, max_new_tokens=1000)

working

michael@13900b MINGW64 /c/wse_github/obrienlabsdev/machine-learning/gemma (main)
$ python gemma-gpu.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.03s/it]
C:\Users\michael\AppData\Roaming\Python\Python311\site-packages\transformers\models\gemma\modeling_gemma.py:555: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
  attn_output = torch.nn.functional.scaled_dot_product_attention(
<bos>how is gold made in collapsing neutron stars?

Answer:

Step 1/5
1. The collapse of a neutron star can lead to the formation of a black hole.

Step 2/5
2. The black hole can then evaporate through Hawking radiation, releasing energy in the form of photons and neutrinos.

Step 3/5
3. The energy released by the black hole can be used to power a gold-making machine.

Step 4/5
4. The gold-making machine can be powered by the energy released by the black hole, which can be used to extract gold from the black hole's debris.

Step 5/5
5. The gold-making machine can then be used to produce gold for human consumption.<eos>
(base)

using

from transformers import AutoTokenizer, AutoModelForCausalLM

access_token='hf_cfTP...KXXCQqH'

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b", token=access_token)
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", token=access_token)

input_text = "how is gold made in collapsing neutron stars"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=1000)
print(tokenizer.decode(outputs[0]))

obriensystems · 2024-02-24T04:29:57Z

python pip summary
RTX-4090 dual running cuda 12.2

332  cd machine-learning/
  335  mkdir gemma
  337  vi gemma-cpu.py
  339  pip install -U transformers
  352  pip install -U torch
  353  python gemma-cpu.py
  355  nvcc --version
  364  pip install accelerate
  366  pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
  368  python gemma-gpu.py

obriensystems · 2024-02-24T04:36:56Z

obriensystems · 2024-02-24T04:41:48Z

obriensystems · 2024-02-24T04:47:33Z

gemma-7b on dual RTX-4090 suprim liquid at 2 x 24 = 48G vram

at 20% TDP or 100 of 400W max due to lack of nvlink on ada class GPUs

obriensystems · 2024-08-28T20:25:13Z

llama-server

on RTX-3500

C:/wse_github/llama.cpp $ make llama-server

C:/wse_github/llama.cpp $ ./llama-server.exe -m /models/capybarahermes-2.5-mistral-7b.Q8_0.gguf    -p "describe quantum computing at Google" -c 2048 -t 10 -n 1000 -e --color

micha@p1gen6 MINGW64 ~
$ curl --request POST     --url http://localhost:8080/completion     --header "Content-Type: application/json"     --data '{"prompt": "describe quantum computing at Google","n_predict": 128}'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1996  100  1929  100    67     31      1  0:01:07  0:01:00  0:00:07   467{"content":", discuss quantum supremacy and quantum advantage, and outline some potential applications of quantum computing.\n\nQuantum computing is an emerging technology that has the potential to revolutionize the way we process information. Unlike classical computers, which use bits that can be either 0 or 1, quantum computers use quantum bits or qubits that can be 0, 1, or both at the same time. This allows quantum computers to perform certain calculations exponentially faster than classical computers.\n\nGoogle has been at the forefront of quantum computing research, and in 2019 they achieved a major milestone called quantum suprem","id_slot":0,"stop":true,"model":"/models/capybarahermes-2.5-mistral-7b.Q8_0.gguf","tokens_predicted":128,"tokens_evaluated":6,"generation_settings":{"n_ctx":2048,"n_predict":1000,"model":"/models/capybarahermes-2.5-mistral-7b.Q8_0.gguf","seed":4294967295,"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,"penalty_prompt_tokens":[],"use_penalty_prompt_tokens":false,"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":128,"n_keep":0,"n_discard":0,"ignore_eos":false,"stream":false,"logit_bias":[],"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typical_p","top_p","min_p","temperature"]},"prompt":"describe quantum computing at Google","truncated":false,"stopped_eos":false,"stopped_word":false,"stopped_limit":true,"stopping_word":"","tokens_cached":133,"timings":{"prompt_n":6,"prompt_ms":529.226,"prompt_per_token_ms":88.20433333333334,"prompt_per_second":11.337311469958014,"predicted_n":128,"predicted_ms":59840.864,"predicted_per_token_ms":467.50675,"predicted_per_second":2.1390065491033017}}

obriensystems self-assigned this Feb 7, 2024

obriensystems changed the title ~~llama.cpp on Nvidia RTX-3500, RTX-A4500 dual~~ llama.cpp on Nvidia RTX-3500, RTX-A4500 dual, RTX-4090 dual Feb 11, 2024

obriensystems mentioned this issue Feb 11, 2024

High Performance Computing: CUDA and GCP ObrienlabsDev/blog#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp on Nvidia RTX-3500, RTX-A4500 dual, RTX-4090 dual #10

llama.cpp on Nvidia RTX-3500, RTX-A4500 dual, RTX-4090 dual #10

obriensystems commented Feb 7, 2024 •

edited

Loading

obriensystems commented Feb 7, 2024

obriensystems commented Feb 7, 2024 •

edited

Loading

obriensystems commented Feb 7, 2024 •

edited

Loading

obriensystems commented Feb 11, 2024

obriensystems commented Feb 11, 2024

obriensystems commented Feb 11, 2024 •

edited

Loading

obriensystems commented Feb 24, 2024

obriensystems commented Feb 24, 2024

obriensystems commented Feb 24, 2024

obriensystems commented Feb 24, 2024 •

edited

Loading

obriensystems commented Feb 24, 2024

obriensystems commented Feb 24, 2024

obriensystems commented Feb 24, 2024

obriensystems commented Feb 24, 2024

obriensystems commented Aug 28, 2024 •

edited

Loading

llama.cpp on Nvidia RTX-3500, RTX-A4500 dual, RTX-4090 dual #10

llama.cpp on Nvidia RTX-3500, RTX-A4500 dual, RTX-4090 dual #10

Comments

obriensystems commented Feb 7, 2024 • edited Loading

obriensystems commented Feb 7, 2024

License

obriensystems commented Feb 7, 2024 • edited Loading

obriensystems commented Feb 7, 2024 • edited Loading

obriensystems commented Feb 11, 2024

obriensystems commented Feb 11, 2024

obriensystems commented Feb 11, 2024 • edited Loading

obriensystems commented Feb 24, 2024

obriensystems commented Feb 24, 2024

obriensystems commented Feb 24, 2024

obriensystems commented Feb 24, 2024 • edited Loading

obriensystems commented Feb 24, 2024

obriensystems commented Feb 24, 2024

obriensystems commented Feb 24, 2024

obriensystems commented Feb 24, 2024

obriensystems commented Aug 28, 2024 • edited Loading

llama-server

obriensystems commented Feb 7, 2024 •

edited

Loading

obriensystems commented Feb 7, 2024 •

edited

Loading

obriensystems commented Feb 7, 2024 •

edited

Loading

obriensystems commented Feb 11, 2024 •

edited

Loading

obriensystems commented Feb 24, 2024 •

edited

Loading

obriensystems commented Aug 28, 2024 •

edited

Loading