-
Notifications
You must be signed in to change notification settings - Fork 324
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow document processing #131
Comments
all time take step second "gen_embeddings" after summarization |
hi @KSemenenko pipeline steps cannot run in parallel because each step depends on the previous one, however, if you import multiple documents, each document is processed in parallel. If you use the async pipelines, eg with Azure Queue or RabbitMQ, the ingestion runs in the background, without blocking your apps. Summarizing still takes time, but it runs after indexing the document chunks, so you shouldn't see much difference. In the latest nuget the summarization step is disabled by default, so you might see an improvement, unless you still want to use summaries. |
@dluc I've made a PR and I think these files can definitely be run in parallel. |
We have removed Summarization from the default pipeline, so by default ingestion is quite fast now. About the PR, I think the best approach is using that as a custom handler, rather than changing the existing one. We will provide soon some examples showing how to plug in custom handlers, in the service and in the serverless option. |
I did as you suggest. Anyway embedding is slow. I’m increase number of tokens but it’s still not that fast as some services. that PR is more like code example for discussion. |
@dluc |
also it will be good to see progress for processing, maybe some action in addtion to public void MarkProcessedBy(IPipelineStepHandler handler) |
@KSemenenko about seeing progress, the status endpoint returns For instance: GET http://localhost:9001/upload-status?documentId=doc001 {
"completed": true,
"failed": false,
"empty": false,
"index": "default",
"document_id": "doc001",
"tags": {},
"creation": "2023-11-30T05:46:44.808832+00:00",
"last_update": "2023-11-30T05:46:47.400603+00:00",
"steps": [
"extract",
"partition",
"gen_embeddings",
"save_records"
],
"remaining_steps": [],
"completed_steps": [
"extract",
"partition",
"gen_embeddings",
"save_records"
]
} |
Yes, but I think more about progress for step. |
## Motivation and Context (Why the change? What's the scenario?) Add 2 additional experimental handlers, one to generate embeddings in parallel, and the second to summarize a document working in parallel. Related to #131 --------- Co-authored-by: Devis Lucato <devis@microsoft.com>
#147 merged. For now the new handlers are experimental and used only on demand, using the |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Hello is there a way to pralellize RunPipelineAsync methid to execute steps is paralel?
like summarization or upload differnt part of documets?
The text was updated successfully, but these errors were encountered: