fix: Avoid thread starvation on many concurrent requests by making use of asyncio to lock llama_proxy context #1798
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Supersedes previous MR #1795
Previous implementation creates and locks threads when acquiring llama_proxy, this can cause thread starvation on many parallel requests.
This also prevents call to
await run_in_threadpool(llama.create_chat_completion, **kwargs)
proceeding as all worker threads are stuck awaiting lock so no progress may be made.This MR adapts acquiring of llama_proxy to async pattern taking advantage of asyncio mechanisms. ExitStack is replaced with AsyncExitStack and improper closing of the ExitStack is addressed