[Core] Introduce asyncio within Ray Actors handling LLMClient #8

anmolagarwalcp810 · 2024-07-11T22:15:35Z

Issue

Since number of ray actors can be limited to a certain amount, it can limit the number of concurrent requests that can be sent to the server.

Fix

To scale up, we have implemented asyncio primitive within each ray actor to handle more requests asynchronously.

Now instead of specifying --num-concurrent-requests, we can specify --num-ray-clients and --num-concurrent-requests-per-client. This would imply total concurrent requests as product of both ray clients and concurrency factor of each client.

Changes Made

OpenAIChatCompletionsClient() client asynchronously through httpx
Introduced new class RequestsManager which acts as Ray Actor and handles requests asynchronously.
Modified RequestsLauncher to launch multiple instances of RequestsManager (equal to --num-ray-clients)
Removed common.py as it wasn't used.
Made send_llm_request async across all llm_client files.
Modified run_benchmark.py to work with two-level request scheduling paradigm.
Modified prefill profiler and capacity search code to support updated CLI arguments for run_benchmark.
Updates yaml files, README and docs to support updated arguments.
Also poll for completed requests asynchronously in a non-blocking manner, and update service metrics. With this logic, run_benchmark was sending more requests than required, so added additional logic as follows:
- When sent requests < max requests: send new requests
- When sent requests >= max requests: send new request only when we got errored requests and didn't handle them. Otherwise keep polling for completed requests.
Hardcode number of requests and concurrent requests inside prefill profiler itself.

Pending changes

Make other clients asynchronous as well. (Can be raised as issue or handled in this PR)

docs/tutorials/prefill_profiler.rst

metron/core/llm_clients/openai_chat_completions_client.py

metron/run_benchmark.py

metron/core/requests_launcher.py

anmolagarwalcp810 added 12 commits July 10, 2024 05:11

initial async io implementation

844c8ad

timeout

b96c91a

single client, multiple concurrent requests

7a844c1

asyncio with ray actors

a2a9040

Simplifying logic

0003b89

Remove print statements

63a2133

Use num-ray-clients and requests-per-client

5dfd081

Merge branch 'main' into users/anmol/async

7a1b149

Minor bug fixes and cleanup

b9149ee

Removed send_llm_request_ in between

12f459c

Initial fixes with print logs

7fa84f8

Removed print and test asyncio.sleep

534ffbe

anmolagarwalcp810 requested a review from AgrawalAmey July 11, 2024 22:15

anmolagarwalcp810 added 7 commits July 11, 2024 18:17

make format

d9d848f

Error handling with streaming OpenAI client

cb7b52d

make format

5b94fd4

Removing timeout for large requests

c81e7ff

Update stop logic to use completed requests

02adf53

make format

a113e02

Update pbar once more

7773d1f

AgrawalAmey reviewed Jul 15, 2024

View reviewed changes

docs/tutorials/prefill_profiler.rst Outdated Show resolved Hide resolved

metron/core/llm_clients/openai_chat_completions_client.py Outdated Show resolved Hide resolved

metron/run_benchmark.py Outdated Show resolved Hide resolved

metron/core/requests_launcher.py Show resolved Hide resolved

anmolagarwalcp810 added 4 commits July 17, 2024 15:26

Instantiate client only once and close after run.

65f379c

Fixing requests and concurrency (prefill_profiler)

077a661

make format

33ba972

make format

3f9b2fa

AgrawalAmey approved these changes Jul 17, 2024

View reviewed changes

anmolagarwalcp810 merged commit c009611 into main Jul 17, 2024
1 check passed

anmolagarwalcp810 deleted the users/anmol/async branch July 17, 2024 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Introduce asyncio within Ray Actors handling LLMClient #8

[Core] Introduce asyncio within Ray Actors handling LLMClient #8

anmolagarwalcp810 commented Jul 11, 2024 •

edited

Loading

[Core] Introduce asyncio within Ray Actors handling LLMClient #8

[Core] Introduce asyncio within Ray Actors handling LLMClient #8

Conversation

anmolagarwalcp810 commented Jul 11, 2024 • edited Loading

Issue

Fix

Changes Made

Pending changes

anmolagarwalcp810 commented Jul 11, 2024 •

edited

Loading