Slow document processing #131

KSemenenko · 2023-11-01T10:57:15Z

Hello is there a way to pralellize RunPipelineAsync methid to execute steps is paralel?
like summarization or upload differnt part of documets?

KSemenenko · 2023-11-06T21:01:52Z

all time take step second "gen_embeddings" after summarization

KSemenenko · 2023-11-06T21:05:07Z

dluc · 2023-11-07T02:11:13Z

hi @KSemenenko pipeline steps cannot run in parallel because each step depends on the previous one, however, if you import multiple documents, each document is processed in parallel. If you use the async pipelines, eg with Azure Queue or RabbitMQ, the ingestion runs in the background, without blocking your apps.

Summarizing still takes time, but it runs after indexing the document chunks, so you shouldn't see much difference.

In the latest nuget the summarization step is disabled by default, so you might see an improvement, unless you still want to use summaries.

KSemenenko · 2023-11-09T16:38:48Z

@dluc I've made a PR and I think these files can definitely be run in parallel.
what do you think?

dluc · 2023-11-17T00:44:39Z

We have removed Summarization from the default pipeline, so by default ingestion is quite fast now. About the PR, I think the best approach is using that as a custom handler, rather than changing the existing one. We will provide soon some examples showing how to plug in custom handlers, in the service and in the serverless option.

KSemenenko · 2023-11-17T23:18:48Z

I did as you suggest.

Anyway embedding is slow. I’m increase number of tokens but it’s still not that fast as some services.

that PR is more like code example for discussion.

KSemenenko · 2023-11-20T22:42:43Z

@dluc
National Planning Policy Framework.pdf
you can also find it here https://www.gov.uk/government/publications/national-planning-policy-framework--2
I have my favorite test pdf, and for me processign takes like forever

KSemenenko · 2023-11-20T22:45:48Z

also it will be good to see progress for processing, maybe some action in addtion to

public void MarkProcessedBy(IPipelineStepHandler handler)

dluc · 2023-11-30T05:50:39Z

@KSemenenko about seeing progress, the status endpoint returns DataPipelineStatus which allows to see the pipeline progress. See completed, steps, remaining_steps and completed_steps. The same feature is used by IKernelMemory.IsDocumentReadyAsync()

For instance:

GET http://localhost:9001/upload-status?documentId=doc001

{
  "completed": true,
  "failed": false,
  "empty": false,
  "index": "default",
  "document_id": "doc001",
  "tags": {},
  "creation": "2023-11-30T05:46:44.808832+00:00",
  "last_update": "2023-11-30T05:46:47.400603+00:00",
  "steps": [
    "extract",
    "partition",
    "gen_embeddings",
    "save_records"
  ],
  "remaining_steps": [],
  "completed_steps": [
    "extract",
    "partition",
    "gen_embeddings",
    "save_records"
  ]
}

KSemenenko · 2023-11-30T07:06:02Z

Yes, but I think more about progress for step.
for example for embedding, if we will have 1k parts, it will be good to know that process in progress. Especially if we run out TPM limit, then we will se it. Because now it’s just step 3 and this is all. For small documents it’s fine. But for 100+ pages document you newer know about actual progress for step

## Motivation and Context (Why the change? What's the scenario?) Add 2 additional experimental handlers, one to generate embeddings in parallel, and the second to summarize a document working in parallel. Related to #131 --------- Co-authored-by: Devis Lucato <devis@microsoft.com>

dluc · 2024-04-16T20:35:21Z

#147 merged.

For now the new handlers are experimental and used only on demand, using the steps parameter.

KSemenenko changed the title ~~Bad performance~~ Slow document processing Nov 1, 2023

dluc self-assigned this Nov 7, 2023

dluc added the question Further information is requested label Nov 7, 2023

KSemenenko mentioned this issue Nov 9, 2023

4 thread in parallel #147

Merged

dluc closed this as completed Apr 16, 2024

microsoft locked and limited conversation to collaborators Jun 4, 2024

dluc converted this issue into discussion #569 Jun 4, 2024

dluc added discussion and removed question Further information is requested labels Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Slow document processing #131

Slow document processing #131

KSemenenko commented Nov 1, 2023

KSemenenko commented Nov 6, 2023

KSemenenko commented Nov 6, 2023

dluc commented Nov 7, 2023

KSemenenko commented Nov 9, 2023

dluc commented Nov 17, 2023

KSemenenko commented Nov 17, 2023

KSemenenko commented Nov 20, 2023

KSemenenko commented Nov 20, 2023 •

edited

Loading

dluc commented Nov 30, 2023

KSemenenko commented Nov 30, 2023

dluc commented Apr 16, 2024

This issue was moved to a discussion.

This issue was moved to a discussion.

Slow document processing #131

Slow document processing #131

Comments

KSemenenko commented Nov 1, 2023

KSemenenko commented Nov 6, 2023

KSemenenko commented Nov 6, 2023

dluc commented Nov 7, 2023

KSemenenko commented Nov 9, 2023

dluc commented Nov 17, 2023

KSemenenko commented Nov 17, 2023

KSemenenko commented Nov 20, 2023

KSemenenko commented Nov 20, 2023 • edited Loading

dluc commented Nov 30, 2023

KSemenenko commented Nov 30, 2023

dluc commented Apr 16, 2024

This issue was moved to a discussion.

KSemenenko commented Nov 20, 2023 •

edited

Loading