Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[misc] [doc] [frontend] LLM torch profiler support #7943

Merged
merged 6 commits into from
Sep 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 17 additions & 3 deletions docs/source/dev/profiling/profiling_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,28 @@ Traces can be visualized using https://ui.perfetto.dev/.
.. tip::

Only send a few requests through vLLM when profiling, as the traces can get quite large. Also, no need to untar the traces, they can be viewed directly.

Example commands:

.. tip::

To stop the profiler - it flushes out all the profile trace files to the directory. This takes time, for example for about 100 requests worth of data for a llama 70b, it takes about 10 minutes to flush out on a H100.
Set the env variable VLLM_RPC_GET_DATA_TIMEOUT_MS to a big number before you start the server. Say something like 30 minutes.
``export VLLM_RPC_GET_DATA_TIMEOUT_MS=1800000``

Example commands and usage:
===========================

Offline Inference:
------------------

Refer to `examples/offline_inference_with_profiler.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_with_profiler.py>`_ for an example.


OpenAI Server:
--------------

.. code-block:: bash

VLLM_TORCH_PROFILER_DIR=/mnt/traces/ python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B
VLLM_TORCH_PROFILER_DIR=./vllm_profile python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3-70B

benchmark_serving.py:

Expand Down
33 changes: 33 additions & 0 deletions examples/offline_inference_with_profiler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import os

from vllm import LLM, SamplingParams

# enable torch profiler, can also be set on cmd line
os.environ["VLLM_TORCH_PROFILER_DIR"] = "./vllm_profile"

# Sample prompts.
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="facebook/opt-125m")

llm.start_profile()

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)

llm.stop_profile()

# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
6 changes: 6 additions & 0 deletions vllm/engine/llm_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -1914,6 +1914,12 @@ def check_health(self) -> None:
self.tokenizer.check_health()
self.model_executor.check_health()

def start_profile(self) -> None:
self.model_executor.start_profile()

def stop_profile(self) -> None:
self.model_executor.stop_profile()

def is_tracing_enabled(self) -> bool:
return self.tracer is not None

Expand Down
6 changes: 6 additions & 0 deletions vllm/entrypoints/llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -560,6 +560,12 @@ def encode(
outputs = self._run_engine(use_tqdm=use_tqdm)
return LLMEngine.validate_outputs(outputs, EmbeddingRequestOutput)

def start_profile(self) -> None:
self.llm_engine.start_profile()

def stop_profile(self) -> None:
self.llm_engine.stop_profile()

# LEGACY
def _convert_v1_inputs(
self,
Expand Down
6 changes: 6 additions & 0 deletions vllm/executor/cpu_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,12 @@ def _wait_for_tasks_completion(self, parallel_worker_tasks: Any) -> None:
for result in parallel_worker_tasks:
result.get()

def start_profile(self) -> None:
self.driver_method_invoker(self.driver_worker, "start_profile")

def stop_profile(self) -> None:
self.driver_method_invoker(self.driver_worker, "stop_profile")


class CPUExecutorAsync(CPUExecutor, ExecutorAsyncBase):

Expand Down
6 changes: 6 additions & 0 deletions vllm/executor/gpu_executor.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,12 @@ def check_health(self) -> None:
# it's running.
return

def start_profile(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we should also add start_profile/stop_profile for CPU-only targets?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, either that or make it clear in the documentation & example that this is currently only supported on GPUs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, missed it. @DamonFool @ywang96 added to cpu_executor.py in 2b23e8f PTAL.

self.driver_worker.start_profile()

def stop_profile(self) -> None:
self.driver_worker.stop_profile()


class GPUExecutorAsync(GPUExecutor, ExecutorAsyncBase):

Expand Down
Loading