Skip to content

Commit

Permalink
[Doc] Documentation Updates (#4)
Browse files Browse the repository at this point in the history
  • Loading branch information
anmolagarwalcp810 authored Jul 8, 2024
1 parent dc25f1c commit 5d160db
Show file tree
Hide file tree
Showing 38 changed files with 232 additions and 302 deletions.
File renamed without changes.
18 changes: 16 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,20 @@
# Metron
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="docs/_static/logo/dark.png">
<img alt="vLLM" src="docs/_static/logo/light.png" width=50%>
</picture>
</p>

A tool for benchmarking the performance of LLM Inference Systems.
<h3 align="center">
Tool to benchmark LLM Inference Systems
</h3>

<p align="center">
| <a href="https://project-metron.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href=""><b>Paper</b></a> |

</p>

---

## Setup

Expand Down
201 changes: 0 additions & 201 deletions docs/LICENSE

This file was deleted.

Binary file added docs/_static/assets/capacity_bars.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
Binary file added docs/_static/assets/deadline.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
Binary file added docs/_static/assets/miss_rate_cdf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/assets/tbt_cdf_api.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/assets/tbt_cdf_api_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/assets/ttft_violin_plots.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/assets/wandb_dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/assets/yi-arxiv-banner.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/assets/yi-normalized-latency-cdf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/assets/yi-prefill-time-curve.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/assets/yi-scheduling-delay-cdf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/assets/yi-tbt-cdf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/assets/yi-token_throughput_bars.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/logo/dark.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/logo/light.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/assets/wandb_dashboard.png
Binary file not shown.
8 changes: 8 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,11 @@

# -- Added configurations ----------------------------------------------------
html_title = "metron"

html_context = {
"display_github": True, # Integrate GitHub
"github_user": "project-metron", # Username
"github_repo": "metron", # Repo name
"github_version": "main", # Version
"conf_py_path": "/docs/", # Path in the checkout to the docs root
}
2 changes: 1 addition & 1 deletion docs/guides/guide.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Guides
======

This section contains additional topics and features that are not covered in the :doc:`../getting_started` section. These guides are meant to provide a deeper understanding of the library and its features, and help in getting the most out of the ``metron``.
This section contains additional topics and features that are not covered in the :doc:`../how_to_use` section. These guides are meant to provide a deeper understanding of the library and its features, and help in getting the most out of the ``metron``.

Check out the following guides to learn more:

Expand Down
7 changes: 3 additions & 4 deletions docs/getting_started.rst → docs/how_to_use.rst
Original file line number Diff line number Diff line change
@@ -1,19 +1,18 @@
Getting Started
===============
How to use metron
=================

.. toctree::
:maxdepth: 2
:hidden:

tutorials/blackbox_evaluation
tutorials/measuring_qps
tutorials/metrics

``metron`` can evaluate LLM inference systems as a black-box and also determine the serving capacity of the LLM inference system.

``metron`` provides two evaluation recipes as described below:

- **Black-box Evaluation**: ``metron`` hits LLM inference server exposed through API endpoint with set of requests with different prompt lengths and tracks when each output token is generated. This allows ``metron`` to calculate several metrics like TTFT, TBT, TPOT, and *unified-deadline*. With *unified-deadline* metric, ``metron`` also infers the minimum TBT required to meet target acceptance rate threshold, such as - 99% of requests should have acceptance rate of at least 90%.
- **Black-box Evaluation**: ``metron`` hits LLM inference server exposed through API endpoint with set of requests with different prompt lengths and tracks when each output token is generated. This allows ``metron`` to calculate several metrics like TTFT, TBT, TPOT, and *fluidity-index*. With *fluidity-index* metric, ``metron`` also infers the minimum TBT required to meet target acceptance rate threshold, such as - 99% of requests should have acceptance rate of at least 90%, and obtains *fluid-token-generation-rate* metric.

- **Capacity Evaluation**: When deploying LLM inference system, operator needs to know how many requests can be served by the system. This will help operator in determining the configuration of the system, for example, the number of GPUs needed, to meet certain service quality requirements. To help with this process, ``metron`` provides a capacity evaluation module which determines maximum capacity each replica can provide under different request loads while meeting target SLO requirements.

Expand Down
5 changes: 3 additions & 2 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ metron

Serving large language models (LLMs) in production is very expensive, and this challenge has prompted recent advances in inference system optimizations. As of today, these systems are evaluated through conventional metrics like TTFT (time to first token), TBT (time between tokens), normalized latency, and TPOT (time per output token). However, these metrics fail to capture the nuances of LLM inference leading to incomplete assessment of user-facing performance in real-time applications like chat.

``metron`` is a holistic performance evaluation framework that includes new metric, *unified-deadline*, alongside existing conventional metrics. This new metric reflects the intricacies of LLM inference process and its impact on real-time user experience.
``metron`` is a holistic performance evaluation framework that includes new metrics, :ref:`fluidity-index` and :ref:`fluid-token-generation-rate`, alongside existing conventional metrics. The new metrics reflect the intricacies of LLM inference process and its impact on real-time user experience.

``metron`` is designed to be easy to use, and can be integrated with any LLM inference system. It is built on top of Ray, a popular distributed computing framework, and can be used to benchmark LLM inference systems on a single machine or a cluster.

Expand All @@ -20,7 +20,8 @@ Check out the following resources to get started:
:maxdepth: 2

installation
getting_started
tutorials/metrics
how_to_use
guides/guide

Citation
Expand Down
5 changes: 1 addition & 4 deletions docs/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Setup
```bash
pip install sphinx furo
pip install -r requirements.txt
```

## Run
Expand All @@ -11,6 +11,3 @@ make html
```

An `index.html` file will be generated under `_build/html` directory. Open it in browser.

## Website
<img width="1335" alt="Screenshot 2024-07-07 at 12 58 35 AM" src="https://github.com/gatech-sysml/metron/assets/42912887/ac199358-1491-43c0-a518-4cd78400c3de">
38 changes: 37 additions & 1 deletion docs/tutorials/blackbox_evaluation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,46 @@ Black-box Evaluation

``metron`` performs black-box evaluation of both proprietary and open-source systems.

Check out the following resources to learn more:
Check out the following resources to learn how to run ``metron`` with both proprietary and open-source systems:

.. toctree::
:maxdepth: 2

public_apis
open_source_systems

Following figures show evaluations by ``metron``:

.. _token_rate_comparison_api:

.. figure:: ../_static/assets/token_rate_comparison_api.png
:alt: toke_rate_comparison_api
:align: center

**Token Rate Comparison**

Above figure depicts throughput measured by ``metron`` for different systems based on three different metrics:

* TPOT
* TBT
* *fluid-token-generation-rate*: Here we find minimum TBT latency such that 99% of requests have *fluidity-index* at least 0.9. Inverse of TBT latency is *fluid-token-generation-rate*.

.. _tbt_cdf_api:

.. figure:: ../_static/assets/tbt_cdf_api_1.png
:alt: tbt_cdf_api
:align: center

**TBT CDF**

Above figure depicts TBT CDF for different systems. It is difficult to interpret the difference in TBT across different systems.

.. _tbt_acceptance_rate_curve:

.. figure:: ../_static/assets/tbt_acceptance_rate_curve.png
:alt: tbt_acceptance_rate_curve
:align: center

**TBT Acceptance Rate Curve**

Above figure clearly highlights the difference in TBT across different systems which was difficult to interpret in previous figure, :ref:`tbt_cdf_api`.
24 changes: 15 additions & 9 deletions docs/tutorials/capacity_search.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,20 @@ Capacity Search

Capacity Search is a tool to help find maximal QPS given different SLOs. There are three types of SLOs:

1. **Deadline based:** does QPS search based on deadline slo and deadline miss rate slo. Also leverages deadline-miss-rate percentile.
2. **TBT-TTFT based:** does QPS search based on tbt and ttft slo with their percentiles.
3. **TTFT-TPOT based:** does QPS search based on ttft and tpot slo with their percentiles.
1. **Fluidity-Index based:** does QPS search based on deadline slo and deadline miss rate (1 - *fluidity-index*) slo. Also leverages request-level deadline miss rate percentile.
2. **TBT based:** does QPS search based on tbt and ttft slo with their percentiles.
3. **TPOT based:** does QPS search based on ttft and tpot slo with their percentiles.

Below figure shows maximum capacity achieved for different SLOs for Llama-3-8B on different traces and open source systems on H100 GPU:

.. image:: ../_static/assets/capacity_bars.png
:alt: capacity_bars
:align: center

Following sections explain running capacity search for each of the above SLOs.

Deadline Based SLO
~~~~~~~~~~~~~~~~~~
Fluidity-Index Based SLO
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: shell
Expand All @@ -33,8 +39,8 @@ Deadline Based SLO

``--profile-dir`` should point to where ``prefill_predictor.pkl`` model (obtained when running prefill profiler) is stored for a given model and open source system.

TBT-TTFT Based SLO
~~~~~~~~~~~~~~~~~~
TBT Based SLO
~~~~~~~~~~~~~

.. code-block:: shell
Expand All @@ -48,8 +54,8 @@ TBT-TTFT Based SLO
--max-iterations 10 \
--config-path ./metron/capacity_search/config/llama_8b.yml
TTFT-TPOT Based SLO
~~~~~~~~~~~~~~~~~~~
TPOT Based SLO
~~~~~~~~~~~~~~

.. code-block:: shell
Expand Down
Loading

0 comments on commit 5d160db

Please sign in to comment.