-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RayService vLLM TPU Inference script #1467
base: main
Are you sure you want to change the base?
Add RayService vLLM TPU Inference script #1467
Conversation
3ff9092
to
717b6ef
Compare
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> bug fixes Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> remove extra ray init Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Read hf token from os Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix bugs Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Remove hf token logic Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix serve script Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
717b6ef
to
dfd04dd
Compare
Do we need a RayService YAML in the repo with region tags that you can reference in the GCP docs? |
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# NOTE: this file was inspired from: https://github.com/richardsliu/vllm/blob/rayserve/examples/rayserve_tpu.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@richardsliu can we get this example merged into the vllm repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened this one a while back: vllm-project/vllm#8038
I'll ping them again on it.
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Here is the summary of changes. You are about to add 4 region tags.
This comment is generated by snippet-bot.
|
Yeah that sounds good. I'm still testing out the 405B RayService, but I added the 8B and 70B ones in fe6440c, we can then use |
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
I've tried running LLama-3.1-405B with TPU slice sizes up to 4x4x8 v4 and 8x16 v5e and ran into a few issues:
If the user has sufficient quota for TPU chips and SSD in their region, a v4 4x4x8 or v5e 8x16 are large enough to run multi-host inference with Llama-3.1-405B. However, I'm wondering whether I'm missing anything obvious here (with the current amount of TPU support in vLLM) that could allow us to a). load the model faster and b). require less disk space when initializing the model. |
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
ai-ml/gke-ray/rayserve/llm/llama-3-8b-it/ray-cluster-v5e-tpu.yaml
Outdated
Show resolved
Hide resolved
ai-ml/gke-ray/rayserve/llm/llama-3.1-70b/ray-cluster-v4-tpu.yaml
Outdated
Show resolved
Hide resolved
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Signed-off-by: ryanaoleary <ryanaoleary@google.com>
Description
This PR adds a simple inference script to be used for a Ray multi-host TPU example serving Meta-Llama-3-70B. Similar to the other scripts in the /llm/ folder,
serve_tpu.py
builds a serve deployment for vLLM, which can then be queried with text prompts to generate output. This script will be used as part of a tutorial in the GKE and Ray docs.Tasks