Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLM Pipeline #137

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Add LLM Pipeline #137

wants to merge 9 commits into from

Conversation

kyriediculous
Copy link

No description provided.

@kyriediculous kyriediculous marked this pull request as ready for review July 31, 2024 00:31
@ad-astra-video
Copy link
Collaborator

@rickstaa I have reviewed this and confirmed it works. Code needed to be rebased with new code gen updates from recent SDK releases. @kyriediculous can update this PR or we can move to the other PR.

Some brief research provided there are other implementations to serve LLM pipelines which was also briefly discussed with @kyriediculous. Settled on alternative implementations can be researched and tested if the need arises from user feedback. LLM SPE will continue to support and enhance this pipeline to suite the network requirements for the LLM pipeline as the network evolves.

Notes from review/testing:

  • I like the streamed response simply starting a second thread to do the inference using a pre-built text streamer from transformers library to send the text chunks back. Note the api for this class may change in the future per note in the transformers documentation .

There was only a couple small changes I made in addition to the changes needed to rebase this PR:

  1. Moved the check_torch_cuda.py to the dev folder since it only provides a helper to check cuda version.
  2. Fixed the logic on returning containers for managed containers. For streamed responses the container was returned right after the streamed response was started. This would allow another request to come in to the GPU and would potentially significantly slow down the first request that was still processing. I would suggest we start with 1 request in flight per GPU for managed containers and target a future enhancement to increase this with thorough testing and documentation of multiple requests in flight on one GPU can be completed timely.
    • Note, external containers are not limited to this one request in flight at a time. It is expected that external containers have their own load balancing logic and return 500 error when overloaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants