Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingestion of documents with Ollama is incredibly slow #1691

Open
Zirgite opened this issue Mar 8, 2024 · 18 comments
Open

Ingestion of documents with Ollama is incredibly slow #1691

Zirgite opened this issue Mar 8, 2024 · 18 comments

Comments

@Zirgite
Copy link

Zirgite commented Mar 8, 2024

I upgraded to the last version of privateGPT and the ingestion speed is much slower than in previous versions. It is so slow to the point of being unusable.
I use the recommended ollama possibility. More than 1 h stiil the document is not finished. I have 3090 and 18 core CPU. And I am using the very small Mistral.
I am ingesting 105 kb pdf file. 37 pages of text
Later I switched to less recommended 'llms-llama-cpp' option in PrivateGP. The problem was solved. But still is anyway to have fast ingetion with Ollama?

@Zirgite Zirgite changed the title Ingestion of documents is incredibly slow Ingestion of documents with Ollama is incredibly slow Mar 9, 2024
@yangyushi
Copy link

I have the exact same issue with the ollama embedding mode pre--configured in the file settings-ollama.yaml.

I ingested my documents with a reasonable (much faster) speed with the huggingface embedding mode.

@imartinez
Copy link
Collaborator

Interesting. Ollama embedding model is way bigger than the default huggingface one, may be the main cause. Dimensionality of vectors is double in Ollama's embedding model

@iotnxt
Copy link

iotnxt commented Mar 12, 2024

I can confirm a performance degradation on 0.4.0 when running with this :
poetry install --extras "ui llms-ollama embeddings-ollama vector-stores-qdrant"
is unusable for older PCs
...and if I try:
poetry install --extras "ui llms-llama-cpp embeddings-huggingface vector-stores-qdrant"
then I have authentication issues with huggingface when running this: poetry run python scripts/setup

@dbzoo
Copy link
Contributor

dbzoo commented Mar 12, 2024

Embedding model changes:

  • BAAI/bge-small-en-v1.5 has a vector size of 384
  • nomic-embed-text has a vector size of 768
    If you are using ollama with the default configuration you are using a larger vector size. This will take longer, but it will also give you better context searching. FWIW: On M2 mac it did not feel that much slower.

@iotnxt
Copy link

iotnxt commented Mar 12, 2024

Thanks @dbzoo but I think it might be more than just that.

During the 60+ min it was ingesting, there was a very modest resource utilisation:
~8.4% out of 32GB RAM
~20% CPU / 8 Core 3.2Ghz
Sporadic and small spikes of 1.5TB SSD activity

At least one of those resources above should have been very high (on average) during those 60+ minutes while processing that small PDF before I decided to cancelled it.

Note:
No GPU on my modest system but not long ago the same file took 20min on an earlier version of privateGPT and it worked when asking questions (replies were slow but it did work).

cc: @imartinez
FEATURE Request:
-please show a progress bar or a percentage indicating how much have been ingested.
(maybe I cancelled it without knowing there was just one min left)

@imartinez
Copy link
Collaborator

@iotnxt maybe Ollama's support for embeddings models is not fully optimized yet. Could be the case. Go back to Huggingface embeddings for intensive use cases.

About the feature request, feel free to contribute through a PR! Being transparent, the roadmap is full of functional improvements, and the "progress bar" would never be prioritized - it is perfect for a contribution though

@Robinsane
Copy link
Contributor

Robinsane commented Mar 13, 2024

For me it's very slow too, and I keep getting the error below after a certain amount of time:
(posted at #1723)

chipgpt | Traceback (most recent call last): chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/gradio/queueing.py", line 495, in call_prediction chipgpt | output = await route_utils.call_process_api( chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/gradio/route_utils.py", line 235, in call_process_api chipgpt | output = await app.get_blocks().process_api( chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/gradio/blocks.py", line 1627, in process_api chipgpt | result = await self.call_function( chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/gradio/blocks.py", line 1173, in call_function chipgpt | prediction = await anyio.to_thread.run_sync( chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync chipgpt | return await get_asynclib().run_sync_in_worker_thread( chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread chipgpt | return await future chipgpt | ^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run chipgpt | result = context.run(func, *args) chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/gradio/utils.py", line 690, in wrapper chipgpt | response = f(*args, **kwargs) chipgpt | ^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/private_gpt/ui/ui.py", line 266, in _upload_file chipgpt | self._ingest_service.bulk_ingest([(str(path.name), path) for path in paths])chipgpt | File "/home/worker/app/private_gpt/server/ingest/ingest_service.py", line 84, in bulk_ingest chipgpt | documents = self.ingest_component.bulk_ingest(files) chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/private_gpt/components/ingest/ingest_component.py", line 198, in bulk_ingest chipgpt | return self._save_docs(documents) chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/private_gpt/components/ingest/ingest_component.py", line 210, in _save_docs chipgpt | self._index.insert_nodes(nodes, show_progress=True) chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 320, in insert_nodes chipgpt | self._insert(nodes, **insert_kwargs) chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 311, in _insert chipgpt | self._add_nodes_to_index(self._index_struct, nodes, **insert_kwargs) chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/llama_index/core/indices/vector_store/base.py", line 233, in _add_nodes_to_index chipgpt | new_ids = self._vector_store.add(nodes_batch, **insert_kwargs) chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/llama_index/vector_stores/qdrant/base.py", line 254, in add chipgpt | points, ids = self._build_points(nodes) chipgpt | ^^^^^^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/llama_index/vector_stores/qdrant/base.py", line 221, in _build_points chipgpt | vectors.append(node.get_embedding()) chipgpt | ^^^^^^^^^^^^^^^^^^^^ chipgpt | File "/home/worker/app/.venv/lib/python3.11/site-packages/llama_index/core/schema.py", line 344, in get_embedding chipgpt | raise ValueError("embedding not set.") chipgpt | ValueError: embedding not set.

@RolT
Copy link

RolT commented Mar 14, 2024

There's an issue with ollama + nomic-embed-text. Fixed but not yet released. Using ollama 0.1.29 fixed the issue for me.

ollama/ollama#3029

@btonasse
Copy link

+1. It takes ~2s to generate embeddings for a 4 word phrase

@codespearhead
Copy link

Ollama v0.1.30 has recently been released.

Is this issue still reproducible in that version?

@fcarsten
Copy link

I seem to have the same or a very similar problem with "ollama" default settings and running ollama v0.1.32.

The console says I get parsing nodes: ~1000 it/s, and generating embeddings: ~ 2s/it

The strange thing is, that it seems that private-gpt/ollama are using hardly any of the available resources. CPU < 4%, Memory < 50%, GPU < 4% processing (1.5/12GB GPU memory), Disk <1%, etc on a Intel i7- I3700K, 32GB Ram, RTX 4070

Example output from the console log:
[...]
Generating embeddings: 0it [00:00, ?it/s]
Parsing nodes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 998.88it/s]
Generating embeddings: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:39<00:00, 2.08s/it]
Generating embeddings: 0it [00:00, ?it/s]
Parsing nodes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 999.83it/s]
Generating embeddings: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:37<00:00, 2.10s/it]
Generating embeddings: 0it [00:00, ?it/s]
[...]

@stevenlafl
Copy link

stevenlafl commented May 2, 2024

Still excruciatingly slow with it barely hitting GPU. Embeddings are ~8 it/s on a 3080. It -does- use the GPU, I confirmed that much. If I double the number of workers, it halves the it/s performance. So there is zero recourse there.

ollama 0.1.33-rc6 so that patch would have been applied.

@zubairahmed-ai
Copy link

Can confirm, using mxbai-embed-large from HF, even a 1.44mb file is taking closer to an hour and still unfinished, while CPU and GPU utilization is below 50% and 10% respectively, using the latest version of PrivateGPT

Any fixes @imartinez ?

@dougy83
Copy link

dougy83 commented Jun 9, 2024

+1. It takes ~2s to generate embeddings for a 4 word phrase

I noticed the same when using http API and python interface. The server says it took <50ms (CPU), so I'm guessing the problem is with detecting that the response is complete. Setting my request timeout to 100ms makes each request take 100ms.

If I use fetch() in nodejs, the response takes <30ms.

I've never used private-gpt, but I'm guessing it's the same problem

EDIT: The python request is fast if I use http://127.0.0.1 rather than http://localhost

@Castolus
Copy link

Castolus commented Jun 25, 2024

Thanks @dbzoo but I think it might be more than just that.

During the 60+ min it was ingesting, there was a very modest resource utilisation: ~8.4% out of 32GB RAM ~20% CPU / 8 Core 3.2Ghz Sporadic and small spikes of 1.5TB SSD activity

At least one of those resources above should have been very high (on average) during those 60+ minutes while processing that small PDF before I decided to cancelled it.

Note: No GPU on my modest system but not long ago the same file took 20min on an earlier version of privateGPT and it worked when asking questions (replies were slow but it did work).

cc: @imartinez FEATURE Request: -please show a progress bar or a percentage indicating how much have been ingested. (maybe I cancelled it without knowing there was just one min left)

Hi,
maybe i´m too late, but i post it anyways.

You can set a progress bar in the console, by editing in ui.py:

Instead this line (345):

    self._ingest_service.bulk_ingest([(str(path.name), path) for path in paths_to_ingest])

put this one:

    for path in tqdm(paths_to_ingest, desc="Ingesting files"):
        self._ingest_service.bulk_ingest([(str(path.name), path)])

by using tdqm you´ll be able to see something like this in the console:

Ingesting files: 40%|████ | 2/5 [00:38<00:49, 16.44s/it]14:10:07.319 [INFO ] private_gpt.server.ingest.ingest_service - Ingesting

Don´t forget to import the library:

from tqdm import tqdm

I´ll probablly integrate it in the UI in the future. Have some other features that may be interesting to @imartinez

Cheers

@HolisticCoder
Copy link

I'm also having this issue. It's weird, the Ollama log shows a new line every ~2-3 seconds, but each line says that it took ~10ms, so what is it doing for the other 2490ms?
Compare the timestamp to the amount of time taken:
temp

@manbehindthemadness
Copy link

manbehindthemadness commented Aug 2, 2024

I am experiencing this as well,

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA RTX 4000 Ada Gene...    Off |   00000000:01:00.0 Off |                  Off |
| 30%   35C    P8             13W /  130W |       3MiB /  20475MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX 4000 Ada Gene...    Off |   00000000:02:00.0 Off |                  Off |
| 30%   43C    P2             32W /  130W |    1891MiB /  20475MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    1   N/A  N/A    551430      C   ...unners/cuda_v11/ollama_llama_server       1886MiB |
+-----------------------------------------------------------------------------------------+

It feels like the ingestion is running single-threaded on CPU...
image

image

The ollama.service log shows the process hammering the API with what looks like single chunks/jobs:
image

@manbehindthemadness
Copy link

Hmmm it seems to go until it halts the ollama service, additionally it's lazy-loading, so the upload process doesn't begin until I supply a prompt:

Aug 02 12:08:12 ai-buffoli ollama[542149]: time=2024-08-02T12:08:12.461Z level=INFO source=sched.go:495 msg="updated VRAM based on existing loaded models" gpu=GPU-88ad38ac-ec75-e3d5-ff99-2cc43700d8bb library=cuda total="19.7 GiB" available="19.5 GiB"
Aug 02 12:08:12 ai-buffoli ollama[542149]: time=2024-08-02T12:08:12.461Z level=INFO source=sched.go:495 msg="updated VRAM based on existing loaded models" gpu=GPU-36340268-892b-0df5-782b-b86438c1404e library=cuda total="19.7 GiB" available="17.7 GiB"
Aug 02 12:08:12 ai-buffoli ollama[542149]: time=2024-08-02T12:08:12.463Z level=INFO source=sched.go:717 msg="new model will fit in available VRAM, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-181667b384afa32256c00c240ead7f7f69b7c13c5a15b260b4eeccc1356310c1 library=cuda parallel=1 required="29.8 GiB"
Aug 02 12:08:12 ai-buffoli ollama[542149]: time=2024-08-02T12:08:12.463Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=41 layers.split=21,20 memory.available="[19.5 GiB 17.7 GiB]" memory.required.full="29.8 GiB" memory.required.partial="29.8 GiB" memory.required.kv="4.8 GiB" memory.required.allocations="[15.7 GiB 14.1 GiB]" memory.weights.total="22.0 GiB" memory.weights.repeating="20.4 GiB" memory.weights.nonrepeating="1.6 GiB" memory.graph.full="2.1 GiB" memory.graph.partial="2.1 GiB"
Aug 02 12:08:12 ai-buffoli ollama[542149]: time=2024-08-02T12:08:12.464Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama1564075646/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-181667b384afa32256c00c240ead7f7f69b7c13c5a15b260b4eeccc1356310c1 --ctx-size 3900 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --tensor-split 21,20 --port 45953"
Aug 02 12:08:12 ai-buffoli ollama[542149]: time=2024-08-02T12:08:12.464Z level=INFO source=sched.go:437 msg="loaded runners" count=2
Aug 02 12:08:12 ai-buffoli ollama[542149]: time=2024-08-02T12:08:12.464Z level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
Aug 02 12:08:12 ai-buffoli ollama[542149]: time=2024-08-02T12:08:12.464Z level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
Aug 02 12:08:12 ai-buffoli ollama[552767]: INFO [main] build info | build=1 commit="a8db2a9" tid="123284990672896" timestamp=1722600492
Aug 02 12:08:12 ai-buffoli ollama[552767]: INFO [main] system info | n_threads=8 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="123284990672896" timestamp=1722600492 total_threads=32
Aug 02 12:08:12 ai-buffoli ollama[552767]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="31" port="45953" tid="123284990672896" timestamp=1722600492
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: loaded meta data with 27 key-value pairs and 322 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-181667b384afa32256c00c240ead7f7f69b7c13c5a15b260b4eeccc1356310c1 (version GGUF V3 (latest))
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv   0:                       general.architecture str              = command-r
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv   1:                               general.name str              = aya-23-35B
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv   2:                      command-r.block_count u32              = 40
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv   3:                   command-r.context_length u32              = 8192
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv   4:                 command-r.embedding_length u32              = 8192
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv   5:              command-r.feed_forward_length u32              = 22528
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv   6:             command-r.attention.head_count u32              = 64
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv   7:          command-r.attention.head_count_kv u32              = 64
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv   8:                   command-r.rope.freq_base f32              = 8000000.000000
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv   9:     command-r.attention.layer_norm_epsilon f32              = 0.000010
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  11:                      command-r.logit_scale f32              = 0.062500
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  12:                command-r.rope.scaling.type str              = none
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,256000]  = ["<PAD>", "<UNK>", "<CLS>", "<SEP>", ...
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, ...
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,253333]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ a...
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 5
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 255001
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  22:           tokenizer.chat_template.tool_use str              = {{ bos_token }}{% if messages[0]['rol...
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  23:                tokenizer.chat_template.rag str              = {{ bos_token }}{% if messages[0]['rol...
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  24:                   tokenizer.chat_templates arr[str,2]       = ["rag", "tool_use"]
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - kv  26:               general.quantization_version u32              = 2
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - type  f32:   41 tensors
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - type q4_0:  280 tensors
Aug 02 12:08:12 ai-buffoli ollama[542149]: llama_model_loader: - type q6_K:    1 tensors
Aug 02 12:08:12 ai-buffoli ollama[542149]: time=2024-08-02T12:08:12.715Z level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
Aug 02 12:08:12 ai-buffoli ollama[542149]: llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
Aug 02 12:08:12 ai-buffoli ollama[542149]: llm_load_vocab: special tokens cache size = 1008
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_vocab: token to piece cache size = 1.8528 MB
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: format           = GGUF V3 (latest)
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: arch             = command-r
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: vocab type       = BPE
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_vocab          = 256000
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_merges         = 253333
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: vocab_only       = 0
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_ctx_train      = 8192
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_embd           = 8192
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_layer          = 40
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_head           = 64
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_head_kv        = 64
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_rot            = 128
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_swa            = 0
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_embd_head_k    = 128
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_embd_head_v    = 128
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_gqa            = 1
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_embd_k_gqa     = 8192
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_embd_v_gqa     = 8192
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: f_norm_eps       = 1.0e-05
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: f_logit_scale    = 6.2e-02
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_ff             = 22528
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_expert         = 0
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_expert_used    = 0
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: causal attn      = 1
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: pooling type     = 0
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: rope type        = 0
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: rope scaling     = none
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: freq_base_train  = 8000000.0
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: freq_scale_train = 1
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: n_ctx_orig_yarn  = 8192
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: rope_finetuned   = unknown
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: ssm_d_conv       = 0
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: ssm_d_inner      = 0
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: ssm_d_state      = 0
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: ssm_dt_rank      = 0
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: model type       = 35B
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: model ftype      = Q4_0
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: model params     = 34.98 B
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: model size       = 18.83 GiB (4.62 BPW)
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: general.name     = aya-23-35B
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: BOS token        = 5 '<BOS_TOKEN>'
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: EOS token        = 255001 '<|END_OF_TURN_TOKEN|>'
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: PAD token        = 0 '<PAD>'
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: LF token         = 136 'Ä'
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_print_meta: max token length = 1024
Aug 02 12:08:13 ai-buffoli ollama[542149]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 02 12:08:13 ai-buffoli ollama[542149]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 02 12:08:13 ai-buffoli ollama[542149]: ggml_cuda_init: found 2 CUDA devices:
Aug 02 12:08:13 ai-buffoli ollama[542149]:   Device 0: NVIDIA RTX 4000 Ada Generation, compute capability 8.9, VMM: yes
Aug 02 12:08:13 ai-buffoli ollama[542149]:   Device 1: NVIDIA RTX 4000 Ada Generation, compute capability 8.9, VMM: yes
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_tensors: ggml ctx size =    0.47 MiB
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_tensors: offloading 40 repeating layers to GPU
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_tensors: offloading non-repeating layers to GPU
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_tensors: offloaded 41/41 layers to GPU
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_tensors:        CPU buffer size =  1640.62 MiB
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_tensors:      CUDA0 buffer size =  9261.66 MiB
Aug 02 12:08:13 ai-buffoli ollama[542149]: llm_load_tensors:      CUDA1 buffer size = 10020.25 MiB
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model: n_ctx      = 3904
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model: n_batch    = 512
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model: n_ubatch   = 512
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model: flash_attn = 0
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model: freq_base  = 8000000.0
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model: freq_scale = 1
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_kv_cache_init:      CUDA0 KV buffer size =  2562.00 MiB
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_kv_cache_init:      CUDA1 KV buffer size =  2318.00 MiB
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model: KV self size  = 4880.00 MiB, K (f16): 2440.00 MiB, V (f16): 2440.00 MiB
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model:  CUDA_Host  output buffer size =     1.01 MiB
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model:      CUDA0 compute buffer size =   646.51 MiB
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model:      CUDA1 compute buffer size =   646.52 MiB
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model:  CUDA_Host compute buffer size =    46.52 MiB
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model: graph nodes  = 1208
Aug 02 12:08:15 ai-buffoli ollama[542149]: llama_new_context_with_model: graph splits = 3
Aug 02 12:08:16 ai-buffoli ollama[552767]: INFO [main] model loaded | tid="123284990672896" timestamp=1722600496
Aug 02 12:08:16 ai-buffoli ollama[542149]: time=2024-08-02T12:08:16.984Z level=INFO source=server.go:617 msg="llama runner started in 4.52 seconds"
Aug 02 12:08:19 ai-buffoli ollama[542149]: [GIN] 2024/08/02 - 12:08:19 | 200 |  7.237094677s |       127.0.0.1 | POST     "/api/chat"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests