This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
[Feature Request] Auto-throttling the embedding generation speed thru the use of x-ratelimit-* headers #381
Labels
enhancement
New feature or request
Context / Scenario
I was trying to ingest a large (26MB) PDF file using s Serverless KM instance locally the other day and found that it took really long time for the indexing/embedding to complete. I was trying to profiling the code and realized that the actual extraction process happens really quick.
The reason it took so long is because the GenerateEmbeddingsHandler calls the ITextEmbeddingGenerator in a synchronized foreach loop fashion. We could theoretically convert the existing code to use Parallel.ForEach instead to drastically improve the embedding speed since the embedding for partitionFiles are not logically coupled.
Example:
However, although it works for me but this is still not an ideal solution since both OpenAI and AzureOpenAI has a built-in rate limiter that prevent clients from abusing the endpoint.
But the point is, even without converting the code to Parallel.ForEach, we could still be seeing 429 errors because there is no guarantee that the rate limit is safe without knowing the context, especially if we run multiple KM instances at the same time which potentially calls the embedding API at the same time as well.
The problem
We could implement our own GenerateEmbeddingsHandler or even a better ITextEmbeddingGenerator impl to do parallel embeddings with the control of 429 errors thru exponential retries, but still this is not an ideal solution since we need to carefully configure the KM or even multiple KM instances to understand the max TPM we could use for the model or the embedding service provider we choose at any given moment.
Luckily, OpenAI service as well as AzureOpenAI service provide the context of rate limiting information as part of the response for both Chat and Embedding REST APIs in the headers:
So we could in theory use this atomic information in the response to decide when to scale up and down for the embedding speed to make sure we are using the service at maximum without abusing at the same time. And don't forget, it would be extremely useful under the context of multiple KM instances running at same time. So we don't need to inform the distributed KMs of the current rate limiting knowledge.
We probably don't need to go this far for Chat APIs or Chat use cases but it is very applicable and valuable for embedding scenarios.
Proposed solution
Here are the things that need to be implemented to achieve what I have described above if we are going to do it:
Importance
would be great to have
The text was updated successfully, but these errors were encountered: