Slow document processing #569
Replies: 11 comments
-
all time take step second "gen_embeddings" after summarization |
Beta Was this translation helpful? Give feedback.
-
hi @KSemenenko pipeline steps cannot run in parallel because each step depends on the previous one, however, if you import multiple documents, each document is processed in parallel. If you use the async pipelines, eg with Azure Queue or RabbitMQ, the ingestion runs in the background, without blocking your apps. Summarizing still takes time, but it runs after indexing the document chunks, so you shouldn't see much difference. In the latest nuget the summarization step is disabled by default, so you might see an improvement, unless you still want to use summaries. |
Beta Was this translation helpful? Give feedback.
-
@dluc I've made a PR and I think these files can definitely be run in parallel. |
Beta Was this translation helpful? Give feedback.
-
We have removed Summarization from the default pipeline, so by default ingestion is quite fast now. About the PR, I think the best approach is using that as a custom handler, rather than changing the existing one. We will provide soon some examples showing how to plug in custom handlers, in the service and in the serverless option. |
Beta Was this translation helpful? Give feedback.
-
I did as you suggest. Anyway embedding is slow. I’m increase number of tokens but it’s still not that fast as some services. that PR is more like code example for discussion. |
Beta Was this translation helpful? Give feedback.
-
@dluc |
Beta Was this translation helpful? Give feedback.
-
also it will be good to see progress for processing, maybe some action in addtion to public void MarkProcessedBy(IPipelineStepHandler handler) |
Beta Was this translation helpful? Give feedback.
-
@KSemenenko about seeing progress, the status endpoint returns For instance: GET http://localhost:9001/upload-status?documentId=doc001 {
"completed": true,
"failed": false,
"empty": false,
"index": "default",
"document_id": "doc001",
"tags": {},
"creation": "2023-11-30T05:46:44.808832+00:00",
"last_update": "2023-11-30T05:46:47.400603+00:00",
"steps": [
"extract",
"partition",
"gen_embeddings",
"save_records"
],
"remaining_steps": [],
"completed_steps": [
"extract",
"partition",
"gen_embeddings",
"save_records"
]
} |
Beta Was this translation helpful? Give feedback.
-
Yes, but I think more about progress for step. |
Beta Was this translation helpful? Give feedback.
-
#147 merged. For now the new handlers are experimental and used only on demand, using the |
Beta Was this translation helpful? Give feedback.
-
Hello is there a way to pralellize RunPipelineAsync methid to execute steps is paralel?
like summarization or upload differnt part of documets?
Beta Was this translation helpful? Give feedback.
All reactions