Creating embeddings with ollama extremely slow #1787

jabbor · 2024-03-22T12:27:01Z

This is a Windows setup, using also ollama for windows.

System:

Windows 11
64GB memory
RTX 4090 (cuda installed)

Setup: poetry install --extras "ui vector-stores-qdrant llms-ollama embeddings-ollama"

Ollama: pull mixtral, then pull nomic-embed-text.

This is what the logging says (startup, and then loading a 1kb txt file). It is taking a long time.

Did I do something wrong?

Using python3 (3.11.8)
13:21:55.666 [INFO ] private_gpt.settings.settings_loader - Starting application with profiles=['default', 'ollama']
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████| 1.46k/1.46k [00:00<?, ?B/s]
13:22:03.875 [WARNING ] py.warnings - C:\Users\jwbor\AppData\Local\pypoetry\Cache\virtualenvs\private-gpt-TFCUF6yI-py3.11\Lib\site-packages\huggingface_hub\file_download.py:147: UserWarning: huggingface_hub cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in D:\privategpt\models\cache. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the HF_HUB_DISABLE_SYMLINKS_WARNING environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
warnings.warn(message)

tokenizer.model: tokenizer.json: special_tokens_map.json: 13:22:05.412 [INFO 13:22:06.695 [INFO 13:22:06.706 [INFO 13:22:06.706 [INFO Parsing nodes: 0it [00:00, ?it/s]
Generating 13:22:06.827 [INFO ] 13:22:06.983 [INFO ] 13:22:06.983 [INFO ] 13:22:06.983 [INFO ] 13:22:06.983 [INFO ] 13:22:33.469 [INFO ] 13:22:33.559 [INFO ] 13:22:33.563 [INFO ] 13:22:33.768 [INFO ] 13:22:33.774 [INFO ] 13:22:33.777 [INFO ] 13:22:42.139 [INFO ] 13:22:42.144 [INFO ] 13:22:42.148 [INFO ] 13:22:42.209 [INFO Parsing nodes: Generating Generating 13:23:21.988 [INFO 13:23:22.054 [INFO ] 13:23:22.057 [INFO ] 13:23:22.167 [INFO ] 13:23:22.171 [INFO ] 100%|██████████████████████████████████████████████████████████████| 493k/493k [00:00<00:00, 39.6MB/s]
100%|█████████████████████████████████████████████████████████████| 1.80M/1.80M [00:00<00:00, 3.74MB/s]
100%|███████████████████████████████████████████████████████| 72.0/72.0 [00:00<00:00, 144kB/s]
] private_gpt.components.llm.llm_component - Initializing the LLM in mode=ollama
] private_gpt.components.embedding.embedding_component - Initializing the embedding model in mode=ollama
] llama_index.core.indices.loading - Loading all indices.
] private_gpt.components.ingest.ingest_component - Creating a new vector store index
embeddings: 0it [00:00, ?it/s]
private_gpt.ui.ui - Mounting the gradio UI, at path=/
uvicorn.error - Started server process [1572]
uvicorn.error - Waiting for application startup.
uvicorn.error - Application startup complete.
uvicorn.error - Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
uvicorn.access - 127.0.0.1:57963 - "GET / HTTP/1.1" 200
uvicorn.access - 127.0.0.1:57963 - "GET /info HTTP/1.1" 200
uvicorn.access - 127.0.0.1:57963 - "GET /theme.css HTTP/1.1" 200
uvicorn.access - 127.0.0.1:57963 - "POST /run/predict HTTP/1.1" 200
uvicorn.access - 127.0.0.1:57963 - "POST /queue/join HTTP/1.1" 200
uvicorn.access - 127.0.0.1:57963 - "GET /queue/data?session_hash=94yqqpkh9p HTTP/1.1" 200
uvicorn.access - 127.0.0.1:57964 - "POST /upload HTTP/1.1" 200
uvicorn.access - 127.0.0.1:57964 - "POST /queue/join HTTP/1.1" 200
uvicorn.access - 127.0.0.1:57964 - "GET /queue/data?session_hash=94yqqpkh9p HTTP/1.1" 200
] private_gpt.server.ingest.ingest_service - Ingesting file_names=['boericke_zizia.txt']
100%|███████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1001.03it/s]
embeddings: 100%|███████████████████████████████████████████████████████████| 18/18 [00:37<00:00, 2.10s/it]
embeddings: 0it [00:00, ?it/s]
] private_gpt.server.ingest.ingest_service - Finished ingestion file_name=['boericke_zizia.txt']
uvicorn.access - 127.0.0.1:57964 - "POST /queue/join HTTP/1.1" 200
uvicorn.access - 127.0.0.1:57964 - "GET /queue/data?session_hash=94yqqpkh9p HTTP/1.1" 200
uvicorn.access - 127.0.0.1:57964 - "POST /queue/join HTTP/1.1" 200
uvicorn.access - 127.0.0.1:57964 - "GET /queue/data?session_hash=94yqqpkh9p HTTP/1.1" 200

The text was updated successfully, but these errors were encountered:

JTMarsh556 · 2024-03-22T13:27:32Z

I am in for the answer here. Ollama is very slow for me. I switched to llama and it is much faster. There is something broken with ollama and ingestion

dbzoo · 2024-03-22T14:36:52Z

What ingest_mode are you using? That has a significant impact on the timing. pipeline is the fastest #1750. Also there is a problem with ollama fixed in version 0.1.29 ref #1691

jabbor · 2024-03-22T15:58:29Z

I use simple, ollama version 1.29

JTMarsh556 · 2024-03-22T18:39:15Z

I updated the settings-ollama.yaml file to what you linked and verified my ollama version was 0.1.29 but Im not seeing much of a speed improvement and my GPU seems like it isnt getting tasked. Neither the the available RAM or CPU seem to be driven much either. Three files totaling roughly 6.5MB are taking close to 30 mins (20 mins @ 8 workers) to ingest where llama completed the ingest in less than a minute. Am I doing something wrong?

jabbor · 2024-03-22T20:35:17Z

I've switched over to lmstudio (0.2.17) mixtral instruct 8x q4 k m, and start the server in lmstudio

I installed privategpt with the following installation command:
poetry install --extras "ui llms-openai-like embeddings-huggingface vector-stores-qdrant"

settings-vllm.yaml:
server:
env_name: ${APP_ENV:vllm}

llm:
mode: openailike

embedding:
mode: huggingface
ingest_mode: simple

local:
embedding_hf_model_name: nomic-embed-text

openai:
api_base: http://localhost:1234/v1
api_key: EMPTY
model: mixtral

set PGPT_PROFILES=vllm
make run

And: it ingests now much faster!

But: I hope that you can amend privategpt, so that it also runs fast with ollama!

paul-asvb · 2024-03-26T15:07:08Z

the pipeline from @dbzoo is super fast 🚀 ! For my tests (>100k docs) the bottleneck is still the index writing.

dbzoo · 2024-03-27T00:21:35Z

@paul-asvb Index writing will always be a bottleneck. With pipeline mode the index will update in the background whilst still ingesting (doing embed work). Depending on how long the index update takes I have seen the embed worker output Q fill up which stalls the workers, this is in purpose as per the design. We could tweak the Q sizes a little, but at the end of the day writing everything to one index will always be an issue. Glad you noticed that pipeline was fast, I thought so.

Robinsane · 2024-03-27T16:38:15Z

It could also be the swapping from LLM to embedding model and back that makes it very slow
see my PR #1800

Stego72 · 2024-04-15T08:23:47Z

Same here. There seems to be a "hard limit" somewhere setting the pace to 2.06 - > 2.11 s/it on "Generating Embeddings" - when embedding multiple files in "pipeline" all workers are crippled to that speed

stevenlafl · 2024-05-02T13:43:29Z

pipeline does not help for me at all.

DesertReady1 · 2024-05-05T02:03:51Z

Anyone find a fix for this yet? I tried the pipeline settings in the yaml file and it only increased the speed a little bit. Still taking a long time to ingest a 46 page, 1.5mb file.

quincy451 · 2024-05-21T16:06:52Z

I got similar results...40 docs took all night and only about a hour on version 0.2. Going to try the lmstudio idea presented in this thread as a workaround.

nopmop · 2024-06-29T17:10:42Z

I kind of debugged this. There are multiple reasons.
On the privateGPT side there's the fact that the embedding section in settings-ollama.yaml comes without 2 parameters:

  ingest_mode: parallel
  count_workers: <workers_count>

which is referenced here ->

private-gpt/private_gpt/components/ingest/ingest_component.py

Line 498 in c7212ac

elif ingest_mode == "parallel":

-- if you add those parameters, the files will be processed in parallel. But there will be bottlenecks in your vector store and in Ollama.

On the Ollama side, the problem is that ollama starts by default with --parallel 1.
To increase parallelism you can't modify the --parallel parameter because the model is started by the Ollama server, so before your server starts you need to set the variable called OLLAMA_NUM_PARALLEL. However...under linux, it seems that the Ollama server (i.e. the command ollama serve which is run by systemd -> /etc/systemd/system/ollama.service) doesn't even sense the environment that is passed it via Environment=.... systemd directives or via subshell/export tricks.

On the vector store side, if you use Qdrant, the problem is that you can't rely on concurrent access if you configure it with path: local_data..., so you need to run the Qdrant server. A further problem arises when server gives a subtle error because of the max_optimization_threads: null which is a default config parameter in Qdrant -- the value shouldn't be null.

Overall, a big performance improvement can always be achieved by using a memfs for both, source files and datastore, regardless of your configuration.

Anyway it needs a bit of work. Maybe I'll write a PR and make a curses dashboard for ingestion. It seems needed, especially for large datasets where an estimate of how long the process will take is needed.
I hope this helps.

stevenlafl · 2024-06-29T17:50:33Z

That is some great sleuthing. Thank you for taking a look at that more in depth. I kind of had to accept the massive IO wait times and GPU underutilization in the meantime. Didn't know about the ollama parallelism and assumed it was passed somehow via the API. If you do a PR, I will help test it when I'm back home mid July (unless ollama works on AMD GPUs, then I can test next week).

…

On Sat, Jun 29, 2024, 1:11 PM nopmop ***@***.***> wrote: I kind of debugged this. There are multiple reasons. On the privateGPT side there's the fact that the embedding section in settings-ollama.yaml comes without 2 parameters: ingest_mode: parallel count_workers: <workers_count> which is referenced here -> https://github.com/zylon-ai/private-gpt/blob/c7212ac7cc891f9e3c713cc206ae9807c5dfdeb6/private_gpt/components/ingest/ingest_component.py#L498 -- if you add those parameters, the files will be processed in parallel. But there will be bottlenecks in your vector store and in Ollama. On the Ollama side, the problem is that ollama starts by default with --parallel 1. To increase parallelism you can't modify the --parallel parameter because the model is started by the Ollama server, so before your server starts you need to set the variable called OLLAMA_NUM_PARALLEL. However...under linux, it seems that the Ollama server (i.e. the command ollama serve which is run by systemd -> /etc/systemd/system/ollama.service) doesn't even sense the environment that is passed it via Environment=.... systemd directives or via subshell/export tricks. On the vector store side, if you use Qdrant, the problem is that you can't rely on concurrent access if you configure it with path: local_data..., so you need to run the Qdrant server. A further problem arises when server gives a subtle error because of the max_optimization_threads: null which is a default config parameter in Qdrant -- the value shouldn't be null. Overall, a *big* performance improvement can always be achieved by using a memfs for both, source files and datastore, regardless of your configuration. Anyway it needs a bit of work. Maybe I'll write a PR and make a curses dashboard for ingestion. It seems needed, especially for large datasets where an estimate of how long the process will take is needed. I hope this helps. — Reply to this email directly, view it on GitHub <#1787 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AATL4VH3N2LVQVZOZL4S4F3ZJ3TCPAVCNFSM6AAAAABFDHOW5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJYGI3DKMBVGM> . You are receiving this because you commented.Message ID: ***@***.***>

nopmop · 2024-08-11T22:22:43Z

I have tried my best, but I cannot resolve the GPU underutilization.
I've fixed the ollama --parallel setting (I was wrong - ollama does indeed sense the environment var) and I have tried 3 or 4 different parallelization mechanisms for the ingestion process. I managed to achieved exactly the same speed as the pipeline mode - the fastest - but no matter what I try I can't get it to go faster. I haven't done extensive profiling, but I think that the bottleneck is not in the ingestion process but elsewhere. As previously suggested by others, it's better to switch to LM Studio.
My 2 cents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating embeddings with ollama extremely slow #1787

Creating embeddings with ollama extremely slow #1787

jabbor commented Mar 22, 2024 •

edited

Loading

JTMarsh556 commented Mar 22, 2024

dbzoo commented Mar 22, 2024 •

edited

Loading

jabbor commented Mar 22, 2024

JTMarsh556 commented Mar 22, 2024

jabbor commented Mar 22, 2024 •

edited

Loading

paul-asvb commented Mar 26, 2024

dbzoo commented Mar 27, 2024

Robinsane commented Mar 27, 2024

Stego72 commented Apr 15, 2024

stevenlafl commented May 2, 2024

DesertReady1 commented May 5, 2024

quincy451 commented May 21, 2024

nopmop commented Jun 29, 2024

stevenlafl commented Jun 29, 2024 via email

nopmop commented Aug 11, 2024

Creating embeddings with ollama extremely slow #1787

Creating embeddings with ollama extremely slow #1787

Comments

jabbor commented Mar 22, 2024 • edited Loading

JTMarsh556 commented Mar 22, 2024

dbzoo commented Mar 22, 2024 • edited Loading

jabbor commented Mar 22, 2024

JTMarsh556 commented Mar 22, 2024

jabbor commented Mar 22, 2024 • edited Loading

paul-asvb commented Mar 26, 2024

dbzoo commented Mar 27, 2024

Robinsane commented Mar 27, 2024

Stego72 commented Apr 15, 2024

stevenlafl commented May 2, 2024

DesertReady1 commented May 5, 2024

quincy451 commented May 21, 2024

nopmop commented Jun 29, 2024

stevenlafl commented Jun 29, 2024 via email

nopmop commented Aug 11, 2024

jabbor commented Mar 22, 2024 •

edited

Loading

dbzoo commented Mar 22, 2024 •

edited

Loading

jabbor commented Mar 22, 2024 •

edited

Loading