Add LLM Pipeline #137

kyriediculous · 2024-07-30T23:49:44Z

No description provided.

ad-astra-video · 2024-09-21T05:33:07Z

@rickstaa I have reviewed this and confirmed it works. Code needed to be rebased with new code gen updates from recent SDK releases. @kyriediculous can update this PR or we can move to the other PR.

Some brief research provided there are other implementations to serve LLM pipelines which was also briefly discussed with @kyriediculous. Settled on alternative implementations can be researched and tested if the need arises from user feedback. LLM SPE will continue to support and enhance this pipeline to suite the network requirements for the LLM pipeline as the network evolves.

Notes from review/testing:

I like the streamed response simply starting a second thread to do the inference using a pre-built text streamer from transformers library to send the text chunks back. Note the api for this class may change in the future per note in the transformers documentation .

There was only a couple small changes I made in addition to the changes needed to rebase this PR:

Moved the check_torch_cuda.py to the dev folder since it only provides a helper to check cuda version.
Fixed the logic on returning containers for managed containers. For streamed responses the container was returned right after the streamed response was started. This would allow another request to come in to the GPU and would potentially significantly slow down the first request that was still processing. I would suggest we start with 1 request in flight per GPU for managed containers and target a future enhancement to increase this with thorough testing and documentation of multiple requests in flight on one GPU can be completed timely.
- Note, external containers are not limited to this one request in flight at a time. It is expected that external containers have their own load balancing logic and return 500 error when overloaded.

runner: add llm-generate route and pipeline

9eb654d

kyriediculous marked this pull request as ready for review July 31, 2024 00:31

kyriediculous requested a review from rickstaa as a code owner July 31, 2024 00:31

kyriediculous force-pushed the llm branch 7 times, most recently from 145437a to dcdbe27 Compare July 31, 2024 15:24

kyriediculous added 3 commits July 31, 2024 21:38

add llama3.1 8B to downloads

9ecabab

worker: add llm-generate container management

a8362a7

update transformers

0619926

kyriediculous force-pushed the llm branch from dcdbe27 to 1f4c952 Compare July 31, 2024 19:38

kyriediculous mentioned this pull request Aug 1, 2024

LLM pipeline with stream support livepeer/go-livepeer#3114

Open

5 tasks

llm: support streamed responses

922f9d2

kyriediculous force-pushed the llm branch from 9601b44 to 922f9d2 Compare August 1, 2024 03:54

kyriediculous added 2 commits August 5, 2024 20:16

Load LLM model distributed over multiple GPUs

a11391f

feat: support 8bit and fp16 for llm pipeline

87bfe3f

kyriediculous force-pushed the llm branch from ad44973 to 87bfe3f Compare August 5, 2024 18:17

kyriediculous added 2 commits August 6, 2024 04:15

fix streaming and full multipart body for llm

468d65d

fix history parsing

401dbbe

rickstaa mentioned this pull request Aug 13, 2024

Implement LLM pipeline at the AI runner side [60 LPT] livepeer/bounties#41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLM Pipeline #137

Add LLM Pipeline #137

kyriediculous commented Jul 30, 2024

ad-astra-video commented Sep 21, 2024

Add LLM Pipeline #137

Are you sure you want to change the base?

Add LLM Pipeline #137

Conversation

kyriediculous commented Jul 30, 2024

ad-astra-video commented Sep 21, 2024