Slow document processing #569

KSemenenko · 2023-11-01T10:57:15Z

KSemenenko
Nov 1, 2023

Hello is there a way to pralellize RunPipelineAsync methid to execute steps is paralel?
like summarization or upload differnt part of documets?

KSemenenko · 2023-11-06T21:01:52Z

KSemenenko
Nov 6, 2023
Author

all time take step second "gen_embeddings" after summarization

0 replies

KSemenenko · 2023-11-06T21:05:07Z

KSemenenko
Nov 6, 2023
Author

0 replies

dluc · 2023-11-07T02:11:13Z

dluc
Nov 7, 2023
Maintainer

hi @KSemenenko pipeline steps cannot run in parallel because each step depends on the previous one, however, if you import multiple documents, each document is processed in parallel. If you use the async pipelines, eg with Azure Queue or RabbitMQ, the ingestion runs in the background, without blocking your apps.

Summarizing still takes time, but it runs after indexing the document chunks, so you shouldn't see much difference.

In the latest nuget the summarization step is disabled by default, so you might see an improvement, unless you still want to use summaries.

0 replies

KSemenenko · 2023-11-09T16:38:48Z

KSemenenko
Nov 9, 2023
Author

@dluc I've made a PR and I think these files can definitely be run in parallel.
what do you think?

0 replies

dluc · 2023-11-17T00:44:39Z

dluc
Nov 17, 2023
Maintainer

We have removed Summarization from the default pipeline, so by default ingestion is quite fast now. About the PR, I think the best approach is using that as a custom handler, rather than changing the existing one. We will provide soon some examples showing how to plug in custom handlers, in the service and in the serverless option.

0 replies

KSemenenko · 2023-11-17T23:18:48Z

KSemenenko
Nov 17, 2023
Author

I did as you suggest.

Anyway embedding is slow. I’m increase number of tokens but it’s still not that fast as some services.

that PR is more like code example for discussion.

0 replies

KSemenenko · 2023-11-20T22:42:43Z

KSemenenko
Nov 20, 2023
Author

@dluc
National Planning Policy Framework.pdf
you can also find it here https://www.gov.uk/government/publications/national-planning-policy-framework--2
I have my favorite test pdf, and for me processign takes like forever

0 replies

KSemenenko · 2023-11-20T22:45:48Z

KSemenenko
Nov 20, 2023
Author

also it will be good to see progress for processing, maybe some action in addtion to

public void MarkProcessedBy(IPipelineStepHandler handler)

0 replies

dluc · 2023-11-30T05:50:39Z

dluc
Nov 30, 2023
Maintainer

@KSemenenko about seeing progress, the status endpoint returns DataPipelineStatus which allows to see the pipeline progress. See completed, steps, remaining_steps and completed_steps. The same feature is used by IKernelMemory.IsDocumentReadyAsync()

For instance:

GET http://localhost:9001/upload-status?documentId=doc001

{
  "completed": true,
  "failed": false,
  "empty": false,
  "index": "default",
  "document_id": "doc001",
  "tags": {},
  "creation": "2023-11-30T05:46:44.808832+00:00",
  "last_update": "2023-11-30T05:46:47.400603+00:00",
  "steps": [
    "extract",
    "partition",
    "gen_embeddings",
    "save_records"
  ],
  "remaining_steps": [],
  "completed_steps": [
    "extract",
    "partition",
    "gen_embeddings",
    "save_records"
  ]
}

0 replies

KSemenenko · 2023-11-30T07:06:02Z

KSemenenko
Nov 30, 2023
Author

Yes, but I think more about progress for step.
for example for embedding, if we will have 1k parts, it will be good to know that process in progress. Especially if we run out TPM limit, then we will se it. Because now it’s just step 3 and this is all. For small documents it’s fine. But for 100+ pages document you newer know about actual progress for step

0 replies

dluc · 2024-04-16T20:35:21Z

dluc
Apr 16, 2024
Maintainer

#147 merged.

For now the new handlers are experimental and used only on demand, using the steps parameter.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow document processing #569

{{title}}

Replies: 11 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Slow document processing #569

KSemenenko Nov 1, 2023

Replies: 11 comments

KSemenenko Nov 6, 2023 Author

KSemenenko Nov 6, 2023 Author

dluc Nov 7, 2023 Maintainer

KSemenenko Nov 9, 2023 Author

dluc Nov 17, 2023 Maintainer

KSemenenko Nov 17, 2023 Author

KSemenenko Nov 20, 2023 Author

KSemenenko Nov 20, 2023 Author

dluc Nov 30, 2023 Maintainer

KSemenenko Nov 30, 2023 Author

dluc Apr 16, 2024 Maintainer

KSemenenko
Nov 1, 2023

KSemenenko
Nov 6, 2023
Author

KSemenenko
Nov 6, 2023
Author

dluc
Nov 7, 2023
Maintainer

KSemenenko
Nov 9, 2023
Author

dluc
Nov 17, 2023
Maintainer

KSemenenko
Nov 17, 2023
Author

KSemenenko
Nov 20, 2023
Author

KSemenenko
Nov 20, 2023
Author

dluc
Nov 30, 2023
Maintainer

KSemenenko
Nov 30, 2023
Author

dluc
Apr 16, 2024
Maintainer