[Doc] Documentation Updates (#4)

project-etalon · Jul 8, 2024 · 5d160db · 5d160db
1 parent dc25f1c
commit 5d160db
Show file tree

Hide file tree

Showing 38 changed files with 232 additions and 302 deletions.
diff --git a/LICENSE.txt → LICENSE b/LICENSE.txt → LICENSE
diff --git a/README.md b/README.md
@@ -1,6 +1,20 @@
-# Metron
+<p align="center">
+  <picture>
+    <source media="(prefers-color-scheme: dark)" srcset="docs/_static/logo/dark.png">
+    <img alt="vLLM" src="docs/_static/logo/light.png" width=50%>
+  </picture>
+</p>
 
-A tool for benchmarking the performance of LLM Inference Systems.
+<h3 align="center">
+Tool to benchmark LLM Inference Systems
+</h3>
+
+<p align="center">
+| <a href="https://project-metron.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href=""><b>Paper</b></a> |
+
+</p>
+
+---
 
 ## Setup
 

diff --git a/docs/LICENSE b/docs/LICENSE
diff --git a/docs/_static/assets/capacity_bars.png b/docs/_static/assets/capacity_bars.png
diff --git a/docs/assets/charts.png → docs/_static/assets/charts.png b/docs/assets/charts.png → docs/_static/assets/charts.png
diff --git a/docs/_static/assets/deadline.png b/docs/_static/assets/deadline.png
diff --git a/docs/assets/metric_chart.png → docs/_static/assets/metric_chart.png b/docs/assets/metric_chart.png → docs/_static/assets/metric_chart.png
diff --git a/docs/_static/assets/miss_rate_cdf.png b/docs/_static/assets/miss_rate_cdf.png
diff --git a/docs/_static/assets/tbt_acceptance_rate_curve.png b/docs/_static/assets/tbt_acceptance_rate_curve.png
diff --git a/docs/_static/assets/tbt_cdf_api.png b/docs/_static/assets/tbt_cdf_api.png
diff --git a/docs/_static/assets/tbt_cdf_api_1.png b/docs/_static/assets/tbt_cdf_api_1.png
diff --git a/docs/_static/assets/token_rate_comparison_api.png b/docs/_static/assets/token_rate_comparison_api.png
diff --git a/docs/_static/assets/ttft_violin_plots.png b/docs/_static/assets/ttft_violin_plots.png
diff --git a/docs/_static/assets/wandb_dashboard.png b/docs/_static/assets/wandb_dashboard.png
diff --git a/docs/_static/assets/yi-arxiv-banner.png b/docs/_static/assets/yi-arxiv-banner.png
diff --git a/docs/_static/assets/yi-normalized-latency-cdf.png b/docs/_static/assets/yi-normalized-latency-cdf.png
diff --git a/docs/_static/assets/yi-prefill-time-curve.png b/docs/_static/assets/yi-prefill-time-curve.png
diff --git a/docs/_static/assets/yi-scheduling-delay-cdf.png b/docs/_static/assets/yi-scheduling-delay-cdf.png
diff --git a/docs/_static/assets/yi-tbt-cdf.png b/docs/_static/assets/yi-tbt-cdf.png
diff --git a/docs/_static/assets/yi-token_throughput_bars.png b/docs/_static/assets/yi-token_throughput_bars.png
diff --git a/docs/_static/logo/dark.png b/docs/_static/logo/dark.png
diff --git a/docs/_static/logo/light.png b/docs/_static/logo/light.png
diff --git a/docs/assets/wandb_dashboard.png b/docs/assets/wandb_dashboard.png
diff --git a/docs/conf.py b/docs/conf.py
@@ -30,3 +30,11 @@
 
 # -- Added configurations ----------------------------------------------------
 html_title = "metron"
+
+html_context = {
+    "display_github": True, # Integrate GitHub
+    "github_user": "project-metron", # Username
+    "github_repo": "metron", # Repo name
+    "github_version": "main", # Version
+    "conf_py_path": "/docs/", # Path in the checkout to the docs root
+}
diff --git a/docs/guides/guide.rst b/docs/guides/guide.rst
@@ -1,7 +1,7 @@
 Guides
 ======
 
-This section contains additional topics and features that are not covered in the :doc:`../getting_started` section. These guides are meant to provide a deeper understanding of the library and its features, and help in getting the most out of the ``metron``.
+This section contains additional topics and features that are not covered in the :doc:`../how_to_use` section. These guides are meant to provide a deeper understanding of the library and its features, and help in getting the most out of the ``metron``.
 
 Check out the following guides to learn more:
 

diff --git a/docs/getting_started.rst → docs/how_to_use.rst b/docs/getting_started.rst → docs/how_to_use.rst
@@ -1,19 +1,18 @@
-Getting Started
-===============
+How to use metron
+=================
 
 .. toctree::
    :maxdepth: 2
    :hidden:
 
    tutorials/blackbox_evaluation
    tutorials/measuring_qps
-   tutorials/metrics
 
 ``metron`` can evaluate LLM inference systems as a black-box and also determine the serving capacity of the LLM inference system.
 
 ``metron`` provides two evaluation recipes as described below:
 
-- **Black-box Evaluation**: ``metron`` hits LLM inference server exposed through API endpoint with set of requests with different prompt lengths and tracks when each output token is generated. This allows ``metron`` to calculate several metrics like TTFT, TBT, TPOT, and *unified-deadline*. With *unified-deadline* metric, ``metron`` also infers the minimum TBT required to meet target acceptance rate threshold, such as - 99% of requests should have acceptance rate of at least 90%. 
+- **Black-box Evaluation**: ``metron`` hits LLM inference server exposed through API endpoint with set of requests with different prompt lengths and tracks when each output token is generated. This allows ``metron`` to calculate several metrics like TTFT, TBT, TPOT, and *fluidity-index*. With *fluidity-index* metric, ``metron`` also infers the minimum TBT required to meet target acceptance rate threshold, such as - 99% of requests should have acceptance rate of at least 90%, and obtains *fluid-token-generation-rate* metric.
 
 - **Capacity Evaluation**: When deploying LLM inference system, operator needs to know how many requests can be served by the system. This will help operator in determining the configuration of the system, for example, the number of GPUs needed, to meet certain service quality requirements. To help with this process, ``metron`` provides a capacity evaluation module which determines maximum capacity each replica can provide under different request loads while meeting target SLO requirements.
 

diff --git a/docs/index.rst b/docs/index.rst
@@ -10,7 +10,7 @@ metron
 
 Serving large language models (LLMs) in production is very expensive, and this challenge has prompted recent advances in inference system optimizations. As of today, these systems are evaluated through conventional metrics like TTFT (time to first token), TBT (time between tokens), normalized latency, and TPOT (time per output token). However, these metrics fail to capture the nuances of LLM inference leading to incomplete assessment of user-facing performance in real-time applications like chat.
 
-``metron`` is a holistic performance evaluation framework that includes new metric, *unified-deadline*, alongside existing conventional metrics. This new metric reflects the intricacies of LLM inference process and its impact on real-time user experience.
+``metron`` is a holistic performance evaluation framework that includes new metrics, :ref:`fluidity-index` and :ref:`fluid-token-generation-rate`, alongside existing conventional metrics. The new metrics reflect the intricacies of LLM inference process and its impact on real-time user experience.
 
 ``metron`` is designed to be easy to use, and can be integrated with any LLM inference system. It is built on top of Ray, a popular distributed computing framework, and can be used to benchmark LLM inference systems on a single machine or a cluster.
 
@@ -20,7 +20,8 @@ Check out the following resources to get started:
    :maxdepth: 2
 
    installation
-   getting_started
+   tutorials/metrics
+   how_to_use
    guides/guide
 
 Citation

diff --git a/docs/setup.md b/docs/setup.md
@@ -2,7 +2,7 @@
 
 ## Setup
 ```bash
-pip install sphinx furo
+pip install -r requirements.txt
 ```
 
 ## Run
@@ -11,6 +11,3 @@ make html
 ```
 
 An `index.html` file will be generated under `_build/html` directory. Open it in browser.
-
-## Website
-<img width="1335" alt="Screenshot 2024-07-07 at 12 58 35 AM" src="https://github.com/gatech-sysml/metron/assets/42912887/ac199358-1491-43c0-a518-4cd78400c3de">
diff --git a/docs/tutorials/blackbox_evaluation.rst b/docs/tutorials/blackbox_evaluation.rst
@@ -3,10 +3,46 @@ Black-box Evaluation
 
 ``metron`` performs black-box evaluation of both proprietary and open-source systems.
 
-Check out the following resources to learn more:
+Check out the following resources to learn how to run ``metron`` with both proprietary and open-source systems:
 
 .. toctree::
     :maxdepth: 2
 
     public_apis
     open_source_systems
+
+Following figures show evaluations by ``metron``:
+
+.. _token_rate_comparison_api:
+
+.. figure:: ../_static/assets/token_rate_comparison_api.png
+    :alt: toke_rate_comparison_api
+    :align: center
+
+    **Token Rate Comparison**
+
+Above figure depicts throughput measured by ``metron`` for different systems based on three different metrics:
+
+* TPOT
+* TBT
+* *fluid-token-generation-rate*: Here we find minimum TBT latency such that 99% of requests have *fluidity-index* at least 0.9. Inverse of TBT latency is *fluid-token-generation-rate*.
+
+.. _tbt_cdf_api:
+
+.. figure:: ../_static/assets/tbt_cdf_api_1.png
+    :alt: tbt_cdf_api
+    :align: center
+
+    **TBT CDF**
+
+Above figure depicts TBT CDF for different systems. It is difficult to interpret the difference in TBT across different systems.
+
+.. _tbt_acceptance_rate_curve:
+
+.. figure:: ../_static/assets/tbt_acceptance_rate_curve.png
+    :alt: tbt_acceptance_rate_curve
+    :align: center
+
+    **TBT Acceptance Rate Curve**
+
+Above figure clearly highlights the difference in TBT across different systems which was difficult to interpret in previous figure, :ref:`tbt_cdf_api`.
diff --git a/docs/tutorials/capacity_search.rst b/docs/tutorials/capacity_search.rst
@@ -7,14 +7,20 @@ Capacity Search
 
 Capacity Search is a tool to help find maximal QPS given different SLOs. There are three types of SLOs:
 
-1. **Deadline based:** does QPS search based on deadline slo and deadline miss rate slo. Also leverages deadline-miss-rate percentile.
-2. **TBT-TTFT based:** does QPS search based on tbt and ttft slo with their percentiles.
-3. **TTFT-TPOT based:** does QPS search based on ttft and tpot slo with their percentiles.
+1. **Fluidity-Index based:** does QPS search based on deadline slo and deadline miss rate (1 - *fluidity-index*) slo. Also leverages request-level deadline miss rate percentile.
+2. **TBT based:** does QPS search based on tbt and ttft slo with their percentiles.
+3. **TPOT based:** does QPS search based on ttft and tpot slo with their percentiles.
+
+Below figure shows maximum capacity achieved for different SLOs for Llama-3-8B on different traces and open source systems on H100 GPU:
+
+.. image:: ../_static/assets/capacity_bars.png
+    :alt: capacity_bars
+    :align: center
 
 Following sections explain running capacity search for each of the above SLOs.
 
-Deadline Based SLO
-~~~~~~~~~~~~~~~~~~
+Fluidity-Index Based SLO
+~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. code-block:: shell
 
@@ -33,8 +39,8 @@ Deadline Based SLO
 
     ``--profile-dir`` should point to where ``prefill_predictor.pkl`` model (obtained when running prefill profiler) is stored for a given model and open source system.
 
-TBT-TTFT Based SLO
-~~~~~~~~~~~~~~~~~~
+TBT Based SLO
+~~~~~~~~~~~~~
 
 .. code-block:: shell
 
@@ -48,8 +54,8 @@ TBT-TTFT Based SLO
     --max-iterations 10 \
     --config-path ./metron/capacity_search/config/llama_8b.yml
 
-TTFT-TPOT Based SLO
-~~~~~~~~~~~~~~~~~~~
+TPOT Based SLO
+~~~~~~~~~~~~~~
 
 .. code-block:: shell