Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow document processing #131

Closed
KSemenenko opened this issue Nov 1, 2023 · 11 comments
Closed

Slow document processing #131

KSemenenko opened this issue Nov 1, 2023 · 11 comments
Assignees

Comments

@KSemenenko
Copy link
Contributor

Hello is there a way to pralellize RunPipelineAsync methid to execute steps is paralel?
like summarization or upload differnt part of documets?

@KSemenenko KSemenenko changed the title Bad performance Slow document processing Nov 1, 2023
@KSemenenko
Copy link
Contributor Author

all time take step second "gen_embeddings" after summarization

@KSemenenko
Copy link
Contributor Author

Screenshot 2023-11-06 at 22 04 57

@dluc
Copy link
Collaborator

dluc commented Nov 7, 2023

hi @KSemenenko pipeline steps cannot run in parallel because each step depends on the previous one, however, if you import multiple documents, each document is processed in parallel. If you use the async pipelines, eg with Azure Queue or RabbitMQ, the ingestion runs in the background, without blocking your apps.

Summarizing still takes time, but it runs after indexing the document chunks, so you shouldn't see much difference.

In the latest nuget the summarization step is disabled by default, so you might see an improvement, unless you still want to use summaries.

@dluc dluc self-assigned this Nov 7, 2023
@dluc dluc added the question Further information is requested label Nov 7, 2023
@KSemenenko
Copy link
Contributor Author

@dluc I've made a PR and I think these files can definitely be run in parallel.
what do you think?

@dluc
Copy link
Collaborator

dluc commented Nov 17, 2023

We have removed Summarization from the default pipeline, so by default ingestion is quite fast now. About the PR, I think the best approach is using that as a custom handler, rather than changing the existing one. We will provide soon some examples showing how to plug in custom handlers, in the service and in the serverless option.

@KSemenenko
Copy link
Contributor Author

I did as you suggest.

Anyway embedding is slow. I’m increase number of tokens but it’s still not that fast as some services.

that PR is more like code example for discussion.

@KSemenenko
Copy link
Contributor Author

@dluc
National Planning Policy Framework.pdf
you can also find it here https://www.gov.uk/government/publications/national-planning-policy-framework--2
I have my favorite test pdf, and for me processign takes like forever

@KSemenenko
Copy link
Contributor Author

KSemenenko commented Nov 20, 2023

also it will be good to see progress for processing, maybe some action in addtion to

public void MarkProcessedBy(IPipelineStepHandler handler)

@dluc
Copy link
Collaborator

dluc commented Nov 30, 2023

@KSemenenko about seeing progress, the status endpoint returns DataPipelineStatus which allows to see the pipeline progress. See completed, steps, remaining_steps and completed_steps. The same feature is used by IKernelMemory.IsDocumentReadyAsync()

For instance:

GET http://localhost:9001/upload-status?documentId=doc001

{
  "completed": true,
  "failed": false,
  "empty": false,
  "index": "default",
  "document_id": "doc001",
  "tags": {},
  "creation": "2023-11-30T05:46:44.808832+00:00",
  "last_update": "2023-11-30T05:46:47.400603+00:00",
  "steps": [
    "extract",
    "partition",
    "gen_embeddings",
    "save_records"
  ],
  "remaining_steps": [],
  "completed_steps": [
    "extract",
    "partition",
    "gen_embeddings",
    "save_records"
  ]
}

@KSemenenko
Copy link
Contributor Author

Yes, but I think more about progress for step.
for example for embedding, if we will have 1k parts, it will be good to know that process in progress. Especially if we run out TPM limit, then we will se it. Because now it’s just step 3 and this is all. For small documents it’s fine. But for 100+ pages document you newer know about actual progress for step

dluc added a commit that referenced this issue Apr 16, 2024
## Motivation and Context (Why the change? What's the scenario?)

Add 2 additional experimental handlers, one to generate embeddings in parallel, and the second to summarize a document working in parallel.

Related to #131

---------

Co-authored-by: Devis Lucato <devis@microsoft.com>
@dluc
Copy link
Collaborator

dluc commented Apr 16, 2024

#147 merged.

For now the new handlers are experimental and used only on demand, using the steps parameter.

@dluc dluc closed this as completed Apr 16, 2024
@microsoft microsoft locked and limited conversation to collaborators Jun 4, 2024
@dluc dluc converted this issue into discussion #569 Jun 4, 2024
@dluc dluc added discussion and removed question Further information is requested labels Jun 4, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

2 participants