Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Introduce asyncio within Ray Actors handling LLMClient #8

Merged
merged 23 commits into from
Jul 17, 2024

Conversation

anmolagarwalcp810
Copy link
Collaborator

@anmolagarwalcp810 anmolagarwalcp810 commented Jul 11, 2024

Issue

Since number of ray actors can be limited to a certain amount, it can limit the number of concurrent requests that can be sent to the server.

Fix

To scale up, we have implemented asyncio primitive within each ray actor to handle more requests asynchronously.

Now instead of specifying --num-concurrent-requests, we can specify --num-ray-clients and --num-concurrent-requests-per-client. This would imply total concurrent requests as product of both ray clients and concurrency factor of each client.

Changes Made

  1. OpenAIChatCompletionsClient() client asynchronously through httpx
  2. Introduced new class RequestsManager which acts as Ray Actor and handles requests asynchronously.
  3. Modified RequestsLauncher to launch multiple instances of RequestsManager (equal to --num-ray-clients)
  4. Removed common.py as it wasn't used.
  5. Made send_llm_request async across all llm_client files.
  6. Modified run_benchmark.py to work with two-level request scheduling paradigm.
  7. Modified prefill profiler and capacity search code to support updated CLI arguments for run_benchmark.
  8. Updates yaml files, README and docs to support updated arguments.
  9. Also poll for completed requests asynchronously in a non-blocking manner, and update service metrics. With this logic, run_benchmark was sending more requests than required, so added additional logic as follows:
    • When sent requests < max requests: send new requests
    • When sent requests >= max requests: send new request only when we got errored requests and didn't handle them. Otherwise keep polling for completed requests.
  10. Hardcode number of requests and concurrent requests inside prefill profiler itself.

Pending changes

  1. Make other clients asynchronous as well. (Can be raised as issue or handled in this PR)

docs/tutorials/prefill_profiler.rst Outdated Show resolved Hide resolved
metron/core/llm_clients/openai_chat_completions_client.py Outdated Show resolved Hide resolved
metron/run_benchmark.py Outdated Show resolved Hide resolved
metron/core/requests_launcher.py Show resolved Hide resolved
@anmolagarwalcp810 anmolagarwalcp810 merged commit c009611 into main Jul 17, 2024
1 check passed
@anmolagarwalcp810 anmolagarwalcp810 deleted the users/anmol/async branch July 17, 2024 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants