Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Baseline for SGLang Benchmark Test #602

Merged
merged 48 commits into from
Dec 4, 2024

Conversation

stbaione
Copy link
Contributor

Description

The SGLang Benchmark Test has been running for awhile, but only benchmarks the shortfin server itself. In order to get a baseline metric and enable long-term convergence in-terms of performance, we need to be able to track metrics of the SGLang server using the same benchmark method.

This adds an sglang_benchmark_test to complement the shortfin_benchmark_test. Also restructures app_tests/benchmark_tests/llm -> app_tests/benchmark_tests/llm/sglang_benchmarks. This keeps the benchmark tests organized and allows for the folder to be extended with other types of benchmarks in the future.

Why are we using docker to start the SGLang server?

Currently, the pyprompt.toml file inside of SGLang requires vllm==0.6.3.dev13 to run on ROCm. I looked into potentially building vLLM from source for this test, but couldn't find a branch, tag, or release that matched that signature. From their own comments inside of pyproject.toml, it appears to only be available inside of a ROCm base image:

# HIP (Heterogeneous-computing Interface for Portability) for AMD
# => base docker rocm/vllm-dev:20241022, not from public vllm whl
srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.3.dev13"]

Their instructions on installing SGLang and running for ROCm also appear to suggest the docker method:

Instructions from their docs for running with ROCm

docker build --build-arg SGL_BRANCH=v0.3.5.post2 -t v0.3.5.post2-rocm620 -f Dockerfile.rocm .

alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
    --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    -v $HOME/dockerx:/dockerx -v /data:/data'

drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    v0.3.5.post2-rocm620 \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

The workflow file handles starting the container and cleaning up once the workflow is done. I set the timeout for waiting for the server to start to 10 minutes to give the SGLang server enough time to load necessary model weights and startup.

stbaione and others added 20 commits November 22, 2024 01:12
Add sgl server benchmark to workflow file,
Restructure `app_tests/benchmark_tests`
Temporarily comment out shortfin job to verify sglang benchmark job
Update benchmark tests to download model on demand
Add disable-cuda-graph option to allow server to properly run
@stbaione stbaione marked this pull request as ready for review December 2, 2024 21:00
@stbaione stbaione requested review from renxida and ScottTodd December 2, 2024 21:00
@renxida
Copy link
Contributor

renxida commented Dec 2, 2024

Copy link
Contributor

@renxida renxida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Would be nice to get @ScottTodd 's look too if he's got time.

@stbaione stbaione requested a review from ScottTodd December 3, 2024 16:56
.github/workflows/ci-sglang-benchmark.yml Outdated Show resolved Hide resolved
.github/workflows/ci-sglang-benchmark.yml Outdated Show resolved Hide resolved
.github/workflows/ci-sglang-benchmark.yml Outdated Show resolved Hide resolved
.github/workflows/ci-sglang-benchmark.yml Outdated Show resolved Hide resolved
Always use python3.11 for merging reports,
Make merging reports one step,
Temporarily enable PR trigger for validation
@stbaione stbaione requested a review from ScottTodd December 3, 2024 21:56
@stbaione stbaione merged commit fc22312 into main Dec 4, 2024
8 checks passed
@stbaione stbaione deleted the users/stbaione/sgl-benchmark-add-baseline branch December 4, 2024 17:51
monorimet pushed a commit that referenced this pull request Dec 13, 2024
# Description

The SGLang Benchmark Test has been running for awhile, but only
benchmarks the shortfin server itself. In order to get a baseline metric
and enable long-term convergence in-terms of performance, we need to be
able to track metrics of the SGLang server using the same benchmark
method.

This adds an `sglang_benchmark_test` to complement the
`shortfin_benchmark_test`. Also restructures
`app_tests/benchmark_tests/llm` ->
`app_tests/benchmark_tests/llm/sglang_benchmarks`. This keeps the
benchmark tests organized and allows for the folder to be extended with
other types of benchmarks in the future.

# Why are we using docker to start the SGLang server?

Currently, the pyprompt.toml file inside of SGLang requires
`vllm==0.6.3.dev13` to run on ROCm. I looked into potentially building
vLLM from source for this test, but couldn't find a branch, tag, or
release that matched that signature. From their own comments inside of
`pyproject.toml`, it appears to only be available inside of a `ROCm`
base image:

```toml
# HIP (Heterogeneous-computing Interface for Portability) for AMD
# => base docker rocm/vllm-dev:20241022, not from public vllm whl
srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.3.dev13"]
```

Their
[instructions](https://sgl-project.github.io/start/install.html#method-3-using-docker)
on installing SGLang and running for ROCm also appear to suggest the
docker method:

## Instructions from their docs for running with ROCm

```
docker build --build-arg SGL_BRANCH=v0.3.5.post2 -t v0.3.5.post2-rocm620 -f Dockerfile.rocm .

alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
    --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    -v $HOME/dockerx:/dockerx -v /data:/data'

drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    v0.3.5.post2-rocm620 \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
```

The workflow file handles starting the container and cleaning up once
the workflow is done. I set the timeout for waiting for the server to
start to `10 minutes` to give the SGLang server enough time to load
necessary model weights and startup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants