diff --git a/LICENSE.txt b/LICENSE similarity index 100% rename from LICENSE.txt rename to LICENSE diff --git a/README.md b/README.md index d1c4644..5d0772f 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,20 @@ -# Metron +

+ + + vLLM + +

-A tool for benchmarking the performance of LLM Inference Systems. +

+Tool to benchmark LLM Inference Systems +

+ +

+| Documentation | Paper | + +

+ +--- ## Setup diff --git a/docs/LICENSE b/docs/LICENSE deleted file mode 100644 index 261eeb9..0000000 --- a/docs/LICENSE +++ /dev/null @@ -1,201 +0,0 @@ - Apache License - Version 2.0, January 2004 - http://www.apache.org/licenses/ - - TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION - - 1. Definitions. - - "License" shall mean the terms and conditions for use, reproduction, - and distribution as defined by Sections 1 through 9 of this document. - - "Licensor" shall mean the copyright owner or entity authorized by - the copyright owner that is granting the License. - - "Legal Entity" shall mean the union of the acting entity and all - other entities that control, are controlled by, or are under common - control with that entity. For the purposes of this definition, - "control" means (i) the power, direct or indirect, to cause the - direction or management of such entity, whether by contract or - otherwise, or (ii) ownership of fifty percent (50%) or more of the - outstanding shares, or (iii) beneficial ownership of such entity. - - "You" (or "Your") shall mean an individual or Legal Entity - exercising permissions granted by this License. - - "Source" form shall mean the preferred form for making modifications, - including but not limited to software source code, documentation - source, and configuration files. - - "Object" form shall mean any form resulting from mechanical - transformation or translation of a Source form, including but - not limited to compiled object code, generated documentation, - and conversions to other media types. - - "Work" shall mean the work of authorship, whether in Source or - Object form, made available under the License, as indicated by a - copyright notice that is included in or attached to the work - (an example is provided in the Appendix below). - - "Derivative Works" shall mean any work, whether in Source or Object - form, that is based on (or derived from) the Work and for which the - editorial revisions, annotations, elaborations, or other modifications - represent, as a whole, an original work of authorship. For the purposes - of this License, Derivative Works shall not include works that remain - separable from, or merely link (or bind by name) to the interfaces of, - the Work and Derivative Works thereof. - - "Contribution" shall mean any work of authorship, including - the original version of the Work and any modifications or additions - to that Work or Derivative Works thereof, that is intentionally - submitted to Licensor for inclusion in the Work by the copyright owner - or by an individual or Legal Entity authorized to submit on behalf of - the copyright owner. For the purposes of this definition, "submitted" - means any form of electronic, verbal, or written communication sent - to the Licensor or its representatives, including but not limited to - communication on electronic mailing lists, source code control systems, - and issue tracking systems that are managed by, or on behalf of, the - Licensor for the purpose of discussing and improving the Work, but - excluding communication that is conspicuously marked or otherwise - designated in writing by the copyright owner as "Not a Contribution." - - "Contributor" shall mean Licensor and any individual or Legal Entity - on behalf of whom a Contribution has been received by Licensor and - subsequently incorporated within the Work. - - 2. Grant of Copyright License. Subject to the terms and conditions of - this License, each Contributor hereby grants to You a perpetual, - worldwide, non-exclusive, no-charge, royalty-free, irrevocable - copyright license to reproduce, prepare Derivative Works of, - publicly display, publicly perform, sublicense, and distribute the - Work and such Derivative Works in Source or Object form. - - 3. Grant of Patent License. Subject to the terms and conditions of - this License, each Contributor hereby grants to You a perpetual, - worldwide, non-exclusive, no-charge, royalty-free, irrevocable - (except as stated in this section) patent license to make, have made, - use, offer to sell, sell, import, and otherwise transfer the Work, - where such license applies only to those patent claims licensable - by such Contributor that are necessarily infringed by their - Contribution(s) alone or by combination of their Contribution(s) - with the Work to which such Contribution(s) was submitted. If You - institute patent litigation against any entity (including a - cross-claim or counterclaim in a lawsuit) alleging that the Work - or a Contribution incorporated within the Work constitutes direct - or contributory patent infringement, then any patent licenses - granted to You under this License for that Work shall terminate - as of the date such litigation is filed. - - 4. Redistribution. You may reproduce and distribute copies of the - Work or Derivative Works thereof in any medium, with or without - modifications, and in Source or Object form, provided that You - meet the following conditions: - - (a) You must give any other recipients of the Work or - Derivative Works a copy of this License; and - - (b) You must cause any modified files to carry prominent notices - stating that You changed the files; and - - (c) You must retain, in the Source form of any Derivative Works - that You distribute, all copyright, patent, trademark, and - attribution notices from the Source form of the Work, - excluding those notices that do not pertain to any part of - the Derivative Works; and - - (d) If the Work includes a "NOTICE" text file as part of its - distribution, then any Derivative Works that You distribute must - include a readable copy of the attribution notices contained - within such NOTICE file, excluding those notices that do not - pertain to any part of the Derivative Works, in at least one - of the following places: within a NOTICE text file distributed - as part of the Derivative Works; within the Source form or - documentation, if provided along with the Derivative Works; or, - within a display generated by the Derivative Works, if and - wherever such third-party notices normally appear. The contents - of the NOTICE file are for informational purposes only and - do not modify the License. You may add Your own attribution - notices within Derivative Works that You distribute, alongside - or as an addendum to the NOTICE text from the Work, provided - that such additional attribution notices cannot be construed - as modifying the License. - - You may add Your own copyright statement to Your modifications and - may provide additional or different license terms and conditions - for use, reproduction, or distribution of Your modifications, or - for any such Derivative Works as a whole, provided Your use, - reproduction, and distribution of the Work otherwise complies with - the conditions stated in this License. - - 5. Submission of Contributions. Unless You explicitly state otherwise, - any Contribution intentionally submitted for inclusion in the Work - by You to the Licensor shall be under the terms and conditions of - this License, without any additional terms or conditions. - Notwithstanding the above, nothing herein shall supersede or modify - the terms of any separate license agreement you may have executed - with Licensor regarding such Contributions. - - 6. Trademarks. This License does not grant permission to use the trade - names, trademarks, service marks, or product names of the Licensor, - except as required for reasonable and customary use in describing the - origin of the Work and reproducing the content of the NOTICE file. - - 7. Disclaimer of Warranty. Unless required by applicable law or - agreed to in writing, Licensor provides the Work (and each - Contributor provides its Contributions) on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or - implied, including, without limitation, any warranties or conditions - of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A - PARTICULAR PURPOSE. You are solely responsible for determining the - appropriateness of using or redistributing the Work and assume any - risks associated with Your exercise of permissions under this License. - - 8. Limitation of Liability. In no event and under no legal theory, - whether in tort (including negligence), contract, or otherwise, - unless required by applicable law (such as deliberate and grossly - negligent acts) or agreed to in writing, shall any Contributor be - liable to You for damages, including any direct, indirect, special, - incidental, or consequential damages of any character arising as a - result of this License or out of the use or inability to use the - Work (including but not limited to damages for loss of goodwill, - work stoppage, computer failure or malfunction, or any and all - other commercial damages or losses), even if such Contributor - has been advised of the possibility of such damages. - - 9. Accepting Warranty or Additional Liability. While redistributing - the Work or Derivative Works thereof, You may choose to offer, - and charge a fee for, acceptance of support, warranty, indemnity, - or other liability obligations and/or rights consistent with this - License. However, in accepting such obligations, You may act only - on Your own behalf and on Your sole responsibility, not on behalf - of any other Contributor, and only if You agree to indemnify, - defend, and hold each Contributor harmless for any liability - incurred by, or claims asserted against, such Contributor by reason - of your accepting any such warranty or additional liability. - - END OF TERMS AND CONDITIONS - - APPENDIX: How to apply the Apache License to your work. - - To apply the Apache License to your work, attach the following - boilerplate notice, with the fields enclosed by brackets "[]" - replaced with your own identifying information. (Don't include - the brackets!) The text should be enclosed in the appropriate - comment syntax for the file format. We also recommend that a - file or class name and description of purpose be included on the - same "printed page" as the copyright notice for easier - identification within third-party archives. - - Copyright [yyyy] [name of copyright owner] - - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at - - http://www.apache.org/licenses/LICENSE-2.0 - - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. diff --git a/docs/_static/assets/capacity_bars.png b/docs/_static/assets/capacity_bars.png new file mode 100644 index 0000000..28ae4f7 Binary files /dev/null and b/docs/_static/assets/capacity_bars.png differ diff --git a/docs/assets/charts.png b/docs/_static/assets/charts.png similarity index 100% rename from docs/assets/charts.png rename to docs/_static/assets/charts.png diff --git a/docs/_static/assets/deadline.png b/docs/_static/assets/deadline.png new file mode 100644 index 0000000..4962a3a Binary files /dev/null and b/docs/_static/assets/deadline.png differ diff --git a/docs/assets/metric_chart.png b/docs/_static/assets/metric_chart.png similarity index 100% rename from docs/assets/metric_chart.png rename to docs/_static/assets/metric_chart.png diff --git a/docs/_static/assets/miss_rate_cdf.png b/docs/_static/assets/miss_rate_cdf.png new file mode 100644 index 0000000..a1f5d9e Binary files /dev/null and b/docs/_static/assets/miss_rate_cdf.png differ diff --git a/docs/_static/assets/tbt_acceptance_rate_curve.png b/docs/_static/assets/tbt_acceptance_rate_curve.png new file mode 100644 index 0000000..4717d8d Binary files /dev/null and b/docs/_static/assets/tbt_acceptance_rate_curve.png differ diff --git a/docs/_static/assets/tbt_cdf_api.png b/docs/_static/assets/tbt_cdf_api.png new file mode 100644 index 0000000..4ebec1b Binary files /dev/null and b/docs/_static/assets/tbt_cdf_api.png differ diff --git a/docs/_static/assets/tbt_cdf_api_1.png b/docs/_static/assets/tbt_cdf_api_1.png new file mode 100644 index 0000000..4ebec1b Binary files /dev/null and b/docs/_static/assets/tbt_cdf_api_1.png differ diff --git a/docs/_static/assets/token_rate_comparison_api.png b/docs/_static/assets/token_rate_comparison_api.png new file mode 100644 index 0000000..d5d1a9c Binary files /dev/null and b/docs/_static/assets/token_rate_comparison_api.png differ diff --git a/docs/_static/assets/ttft_violin_plots.png b/docs/_static/assets/ttft_violin_plots.png new file mode 100644 index 0000000..b754152 Binary files /dev/null and b/docs/_static/assets/ttft_violin_plots.png differ diff --git a/docs/_static/assets/wandb_dashboard.png b/docs/_static/assets/wandb_dashboard.png new file mode 100644 index 0000000..926f01e Binary files /dev/null and b/docs/_static/assets/wandb_dashboard.png differ diff --git a/docs/_static/assets/yi-arxiv-banner.png b/docs/_static/assets/yi-arxiv-banner.png new file mode 100644 index 0000000..9e15b03 Binary files /dev/null and b/docs/_static/assets/yi-arxiv-banner.png differ diff --git a/docs/_static/assets/yi-normalized-latency-cdf.png b/docs/_static/assets/yi-normalized-latency-cdf.png new file mode 100644 index 0000000..db6cc7e Binary files /dev/null and b/docs/_static/assets/yi-normalized-latency-cdf.png differ diff --git a/docs/_static/assets/yi-prefill-time-curve.png b/docs/_static/assets/yi-prefill-time-curve.png new file mode 100644 index 0000000..c3630c4 Binary files /dev/null and b/docs/_static/assets/yi-prefill-time-curve.png differ diff --git a/docs/_static/assets/yi-scheduling-delay-cdf.png b/docs/_static/assets/yi-scheduling-delay-cdf.png new file mode 100644 index 0000000..373c216 Binary files /dev/null and b/docs/_static/assets/yi-scheduling-delay-cdf.png differ diff --git a/docs/_static/assets/yi-tbt-cdf.png b/docs/_static/assets/yi-tbt-cdf.png new file mode 100644 index 0000000..c9b8343 Binary files /dev/null and b/docs/_static/assets/yi-tbt-cdf.png differ diff --git a/docs/_static/assets/yi-token_throughput_bars.png b/docs/_static/assets/yi-token_throughput_bars.png new file mode 100644 index 0000000..e8a0329 Binary files /dev/null and b/docs/_static/assets/yi-token_throughput_bars.png differ diff --git a/docs/_static/logo/dark.png b/docs/_static/logo/dark.png new file mode 100644 index 0000000..7c458b2 Binary files /dev/null and b/docs/_static/logo/dark.png differ diff --git a/docs/_static/logo/light.png b/docs/_static/logo/light.png new file mode 100644 index 0000000..6cf8076 Binary files /dev/null and b/docs/_static/logo/light.png differ diff --git a/docs/assets/wandb_dashboard.png b/docs/assets/wandb_dashboard.png deleted file mode 100644 index 97d2d59..0000000 Binary files a/docs/assets/wandb_dashboard.png and /dev/null differ diff --git a/docs/conf.py b/docs/conf.py index ae1bf08..746b1d6 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -30,3 +30,11 @@ # -- Added configurations ---------------------------------------------------- html_title = "metron" + +html_context = { + "display_github": True, # Integrate GitHub + "github_user": "project-metron", # Username + "github_repo": "metron", # Repo name + "github_version": "main", # Version + "conf_py_path": "/docs/", # Path in the checkout to the docs root +} diff --git a/docs/guides/guide.rst b/docs/guides/guide.rst index e09a5e8..a1e5b34 100644 --- a/docs/guides/guide.rst +++ b/docs/guides/guide.rst @@ -1,7 +1,7 @@ Guides ====== -This section contains additional topics and features that are not covered in the :doc:`../getting_started` section. These guides are meant to provide a deeper understanding of the library and its features, and help in getting the most out of the ``metron``. +This section contains additional topics and features that are not covered in the :doc:`../how_to_use` section. These guides are meant to provide a deeper understanding of the library and its features, and help in getting the most out of the ``metron``. Check out the following guides to learn more: diff --git a/docs/getting_started.rst b/docs/how_to_use.rst similarity index 80% rename from docs/getting_started.rst rename to docs/how_to_use.rst index 73be2a4..c0f70fc 100644 --- a/docs/getting_started.rst +++ b/docs/how_to_use.rst @@ -1,5 +1,5 @@ -Getting Started -=============== +How to use metron +================= .. toctree:: :maxdepth: 2 @@ -7,13 +7,12 @@ Getting Started tutorials/blackbox_evaluation tutorials/measuring_qps - tutorials/metrics ``metron`` can evaluate LLM inference systems as a black-box and also determine the serving capacity of the LLM inference system. ``metron`` provides two evaluation recipes as described below: -- **Black-box Evaluation**: ``metron`` hits LLM inference server exposed through API endpoint with set of requests with different prompt lengths and tracks when each output token is generated. This allows ``metron`` to calculate several metrics like TTFT, TBT, TPOT, and *unified-deadline*. With *unified-deadline* metric, ``metron`` also infers the minimum TBT required to meet target acceptance rate threshold, such as - 99% of requests should have acceptance rate of at least 90%. +- **Black-box Evaluation**: ``metron`` hits LLM inference server exposed through API endpoint with set of requests with different prompt lengths and tracks when each output token is generated. This allows ``metron`` to calculate several metrics like TTFT, TBT, TPOT, and *fluidity-index*. With *fluidity-index* metric, ``metron`` also infers the minimum TBT required to meet target acceptance rate threshold, such as - 99% of requests should have acceptance rate of at least 90%, and obtains *fluid-token-generation-rate* metric. - **Capacity Evaluation**: When deploying LLM inference system, operator needs to know how many requests can be served by the system. This will help operator in determining the configuration of the system, for example, the number of GPUs needed, to meet certain service quality requirements. To help with this process, ``metron`` provides a capacity evaluation module which determines maximum capacity each replica can provide under different request loads while meeting target SLO requirements. diff --git a/docs/index.rst b/docs/index.rst index fdbee97..b2b596f 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -10,7 +10,7 @@ metron Serving large language models (LLMs) in production is very expensive, and this challenge has prompted recent advances in inference system optimizations. As of today, these systems are evaluated through conventional metrics like TTFT (time to first token), TBT (time between tokens), normalized latency, and TPOT (time per output token). However, these metrics fail to capture the nuances of LLM inference leading to incomplete assessment of user-facing performance in real-time applications like chat. -``metron`` is a holistic performance evaluation framework that includes new metric, *unified-deadline*, alongside existing conventional metrics. This new metric reflects the intricacies of LLM inference process and its impact on real-time user experience. +``metron`` is a holistic performance evaluation framework that includes new metrics, :ref:`fluidity-index` and :ref:`fluid-token-generation-rate`, alongside existing conventional metrics. The new metrics reflect the intricacies of LLM inference process and its impact on real-time user experience. ``metron`` is designed to be easy to use, and can be integrated with any LLM inference system. It is built on top of Ray, a popular distributed computing framework, and can be used to benchmark LLM inference systems on a single machine or a cluster. @@ -20,7 +20,8 @@ Check out the following resources to get started: :maxdepth: 2 installation - getting_started + tutorials/metrics + how_to_use guides/guide Citation diff --git a/docs/setup.md b/docs/setup.md index d74f9ea..ad9573b 100644 --- a/docs/setup.md +++ b/docs/setup.md @@ -2,7 +2,7 @@ ## Setup ```bash -pip install sphinx furo +pip install -r requirements.txt ``` ## Run @@ -11,6 +11,3 @@ make html ``` An `index.html` file will be generated under `_build/html` directory. Open it in browser. - -## Website -Screenshot 2024-07-07 at 12 58 35 AM diff --git a/docs/tutorials/blackbox_evaluation.rst b/docs/tutorials/blackbox_evaluation.rst index b438e2a..8877030 100644 --- a/docs/tutorials/blackbox_evaluation.rst +++ b/docs/tutorials/blackbox_evaluation.rst @@ -3,10 +3,46 @@ Black-box Evaluation ``metron`` performs black-box evaluation of both proprietary and open-source systems. -Check out the following resources to learn more: +Check out the following resources to learn how to run ``metron`` with both proprietary and open-source systems: .. toctree:: :maxdepth: 2 public_apis open_source_systems + +Following figures show evaluations by ``metron``: + +.. _token_rate_comparison_api: + +.. figure:: ../_static/assets/token_rate_comparison_api.png + :alt: toke_rate_comparison_api + :align: center + + **Token Rate Comparison** + +Above figure depicts throughput measured by ``metron`` for different systems based on three different metrics: + +* TPOT +* TBT +* *fluid-token-generation-rate*: Here we find minimum TBT latency such that 99% of requests have *fluidity-index* at least 0.9. Inverse of TBT latency is *fluid-token-generation-rate*. + +.. _tbt_cdf_api: + +.. figure:: ../_static/assets/tbt_cdf_api_1.png + :alt: tbt_cdf_api + :align: center + + **TBT CDF** + +Above figure depicts TBT CDF for different systems. It is difficult to interpret the difference in TBT across different systems. + +.. _tbt_acceptance_rate_curve: + +.. figure:: ../_static/assets/tbt_acceptance_rate_curve.png + :alt: tbt_acceptance_rate_curve + :align: center + + **TBT Acceptance Rate Curve** + +Above figure clearly highlights the difference in TBT across different systems which was difficult to interpret in previous figure, :ref:`tbt_cdf_api`. diff --git a/docs/tutorials/capacity_search.rst b/docs/tutorials/capacity_search.rst index 7ed7f7d..2442a28 100644 --- a/docs/tutorials/capacity_search.rst +++ b/docs/tutorials/capacity_search.rst @@ -7,14 +7,20 @@ Capacity Search Capacity Search is a tool to help find maximal QPS given different SLOs. There are three types of SLOs: -1. **Deadline based:** does QPS search based on deadline slo and deadline miss rate slo. Also leverages deadline-miss-rate percentile. -2. **TBT-TTFT based:** does QPS search based on tbt and ttft slo with their percentiles. -3. **TTFT-TPOT based:** does QPS search based on ttft and tpot slo with their percentiles. +1. **Fluidity-Index based:** does QPS search based on deadline slo and deadline miss rate (1 - *fluidity-index*) slo. Also leverages request-level deadline miss rate percentile. +2. **TBT based:** does QPS search based on tbt and ttft slo with their percentiles. +3. **TPOT based:** does QPS search based on ttft and tpot slo with their percentiles. + +Below figure shows maximum capacity achieved for different SLOs for Llama-3-8B on different traces and open source systems on H100 GPU: + +.. image:: ../_static/assets/capacity_bars.png + :alt: capacity_bars + :align: center Following sections explain running capacity search for each of the above SLOs. -Deadline Based SLO -~~~~~~~~~~~~~~~~~~ +Fluidity-Index Based SLO +~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: shell @@ -33,8 +39,8 @@ Deadline Based SLO ``--profile-dir`` should point to where ``prefill_predictor.pkl`` model (obtained when running prefill profiler) is stored for a given model and open source system. -TBT-TTFT Based SLO -~~~~~~~~~~~~~~~~~~ +TBT Based SLO +~~~~~~~~~~~~~ .. code-block:: shell @@ -48,8 +54,8 @@ TBT-TTFT Based SLO --max-iterations 10 \ --config-path ./metron/capacity_search/config/llama_8b.yml -TTFT-TPOT Based SLO -~~~~~~~~~~~~~~~~~~~ +TPOT Based SLO +~~~~~~~~~~~~~~ .. code-block:: shell diff --git a/docs/tutorials/measuring_qps.rst b/docs/tutorials/measuring_qps.rst index 8a66a79..cf965a3 100644 --- a/docs/tutorials/measuring_qps.rst +++ b/docs/tutorials/measuring_qps.rst @@ -18,5 +18,15 @@ It is defined as maximum request load (queries-per-second) a system can sustain Steps to Measure Capacity ------------------------- -1. First ``metron`` needs to profile the prefill times of the given open source system and model combination. Refer to :doc:`prefill_profiler` for more details on how to run prefill profiler. -2. And then ``metron`` runs capacity search to find the maximum QPS. Refer to :doc:`capacity_search` for more details on how to run capacity search. +Prefill Profiler +~~~~~~~~~~~~~~~~ +First ``metron`` needs to profile the prefill times of the given open source system and model combination. + +Refer to :doc:`prefill_profiler` for more details on how to run prefill profiler. + + +Capacity Search +~~~~~~~~~~~~~~~ +``metron`` then runs capacity search to find the maximum QPS. + +Refer to :doc:`capacity_search` for more details on how to run capacity search. diff --git a/docs/tutorials/metrics.rst b/docs/tutorials/metrics.rst index 3ba4c09..dd6a8ca 100644 --- a/docs/tutorials/metrics.rst +++ b/docs/tutorials/metrics.rst @@ -1,57 +1,14 @@ Metrics ======= -``metron`` provides a set of metrics to evaluate the performance of LLM inference systems. These metrics are designed to capture the nuances of LLM inference process and its impact on real-time user experience. +``metron`` provides a set of metrics to evaluate the performance of LLM inference systems. These metrics are designed to capture the nuances of LLM inference process and its impact on real-time user experience. +``metron`` also enables visualizing these metrics to understand and compare the performance across different LLM inference systems. -``metron`` also enables visualizing these metrics to understand and compare the performance across different LLM inference systems. Check out the following to learn more: +Check the following resources to learn more: .. toctree:: - :maxdepth: 1 + :maxdepth: 2 + metrics_used visualizing_metrics - -The description of each metric is provided below: - -Time to First Token (TTFT) -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -It is defined as the time taken between arrival and first output token generated by system for each request. TTFT includes both scheduling delay and prompt processing time. Lower TTFT is better. - -Time Between Tokens (TBT) -^^^^^^^^^^^^^^^^^^^^^^^^^ - -It is defined as the time taken between two consecutive output tokens generated by system for each request. Lower TBT is better. - -Time Per Output Token (TPOT) -^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -It is defined as total time taken to generate all output tokens divided by the number of output tokens generated. Lower TPOT is better. - -Normalized Latency -^^^^^^^^^^^^^^^^^^ - -It is defined as total execution time of request divided by the number of output tokens generated. It includes scheduling delay, prompt generation time and time taken to generate all decode tokens. Lower Normalized Latency is better. - -*unified-deadline* -^^^^^^^^^^^^^^^^^^ - -.. note:: - - *unified-deadline* is a new metric introduced by ``metron`` to evaluate LLM inference systems. It is designed to capture the nuances of LLM inference process and its impact on real-time user experience. - - -Given prefill SLO and TBT SLO, *unified-deadline* is defined as token deadline miss rate of a given request. It accounts for slack which is the difference between actual time taken to generate token and deadline for that token. That slack is used by subsequent tokens if current token is generated before deadline. Lower *unified-deadline* is better. - -More formally, let prefill SLO be :math:`D_p` and TBT SLO be :math:`D_d`. Every token generation is characterized as a periodic task :math:`r_i = (t_i, d_i, s_i)`, where :math:`t_i` is the arrival time of :math:`i^{th}` token, :math:`d_i` is deadline for :math:`i^{th}` token, (i.e., :math:`t_i + D + slack_{i-1}`, where :math:`D = D_p` if first token else :math:`D = D_d`) and :math:`s_i` is the actual time taken to generate :math:`i^{th}` token. - -If :math:`s_i \leq d_i`, then :math:`slack_{i} = slack_{i-1} + d_i - s_i`, else :math:`slack_{i} = 0`. :math:`slack_{0} = 0` for first token. - -The *unified-deadline* is calculated as follows: - -.. math:: - - \textit{unified-deadline miss rate} = \frac{\sum_{i=1}^{n} \mathbb{I}\{s_i > d_i\}}{n} - -, where :math:`\mathbb{I}\{s_i > d_i\} = 1` if :math:`s_i > d_i` else :math:`0` and :math:`n` is the number of decode tokens generated. - diff --git a/docs/tutorials/metrics_used.rst b/docs/tutorials/metrics_used.rst new file mode 100644 index 0000000..2b74344 --- /dev/null +++ b/docs/tutorials/metrics_used.rst @@ -0,0 +1,64 @@ +Metrics used by metron +====================== + +``metron`` supports 4 conventional metrics: TTFT, TBT, TPOT and Normalized Latency. + +Additionally, it introduces two new metrics, *fluidity-index* and *fluid-token-generation-rate*, to evaluate LLM inference systems. + +The description of each metric is provided below: + +Time to First Token (TTFT) +^^^^^^^^^^^^^^^^^^^^^^^^^^ + +It is defined as the time taken between arrival and first output token generated by system for each request. TTFT includes both scheduling delay and prompt processing time. Lower TTFT is better. + +Time Between Tokens (TBT) +^^^^^^^^^^^^^^^^^^^^^^^^^ + +It is defined as the time taken between two consecutive output tokens generated by system for each request. Lower TBT is better. + +Time Per Output Token (TPOT) +^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +It is defined as total time taken to generate all output tokens divided by the number of output tokens generated. Lower TPOT is better. + +Normalized Latency +^^^^^^^^^^^^^^^^^^ + +It is defined as total execution time of request divided by the number of output tokens generated. It includes scheduling delay, prompt generation time and time taken to generate all decode tokens. Lower Normalized Latency is better. + +.. _fluidity-index: + +*fluidity-index* +^^^^^^^^^^^^^^^^^^ + +.. note:: + + *fluidity-index* is a new metric introduced by ``metron`` to evaluate LLM inference systems. It is designed to capture the nuances of LLM inference process and its impact on real-time user experience. + + +Given target prefill and decode latencies, *fluidity-index* is defined as fraction of tokens that satisfy the target latencies for a given request. It accounts for slack which is the difference between actual time taken to generate token and deadline for that token. That slack is used by subsequent tokens if current token is generated before deadline. Higher *fluidity-index* is better. + +More formally, let target prefill latency be :math:`D_p` and target decode latency be :math:`D_d`. Let :math:`D` be denoted as deadline for each token, where :math:`D = D_p` for first token and :math:`D = D_d` for subsequent tokens. Every token generation is characterized as a periodic task :math:`r_i = (t_i, d_i, s_i)`, where :math:`t_i` is the arrival time of :math:`i^{th}` token, :math:`d_i` is deadline for :math:`i^{th}` token, (i.e., :math:`t_i + D + slack_{i-1}`) and :math:`s_i` is the actual time taken to generate :math:`i^{th}` token. + +If :math:`s_i + t_i \leq d_i`, then :math:`slack_{i} = slack_{i-1} + D - s_i`, else :math:`slack_{i} = 0`. :math:`slack_{0} = 0` for first token. + +The *fluidity-index* is calculated as follows: + +.. math:: + + \textit{fluidity-index} = \frac{\sum_{i=1}^{n} \mathbb{I}\{t_i + s_i \leq d_i\}}{n} + +, where :math:`\mathbb{I}\{t_i + s_i \leq d_i\} = 1` if :math:`t_i + s_i \leq d_i` else :math:`0` and :math:`n` is the number of decode tokens generated. + +.. _fluid-token-generation-rate: + +*fluid token generation rate* +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. note:: + + *fluid token generation rate* is a another new metric introduced by ``metron`` to evaluate LLM inference systems. + +*fluid token generation rate* is defined as maximum tokens per second an inference system can serve such that 99% of the requests achieve fluidity-index of at-least 0.9. Higher *fluid token generation rate* is better. + diff --git a/docs/tutorials/open_source_systems.rst b/docs/tutorials/open_source_systems.rst index 3de27c5..045c717 100644 --- a/docs/tutorials/open_source_systems.rst +++ b/docs/tutorials/open_source_systems.rst @@ -34,23 +34,40 @@ Benchmark can be run as shown below: python -m metron.run_benchmark \ --model "meta-llama/Meta-Llama-3-8B-Instruct" \ - --max-num-completed-requests 150 \ - --timeout 600 \ - --num-concurrent-requests 10 \ - --output-dir "result_outputs" \ - --request-interval-generator-provider "poisson" \ - --poisson-request-interval-generator-qps 0.5 \ - --request-length-generator-provider "trace" \ - --trace-request-length-generator-trace-file "./data/processed_traces/arxiv_summarization_filtered_stats_llama2_tokenizer.csv" \ + --max-num-completed-requests 20 \ + --request-interval-generator-provider "gamma" \ + --request-length-generator-provider "zipf" \ --request-generator-max-tokens 8192 \ - --ttft-deadline 0.3 \ - --tbt-deadline 0.03 \ + --output-dir "results" + +Be sure to update ``--model`` flag to same model used to launch vLLM. + +.. note:: + + ``metron`` supports different generator providers for request interval and request length. For more details, refer to :doc:`../guides/request_generator_providers`. + +.. _wandb_args_open_source_systems: + +Specifying wandb args [Optional] +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Optionally, you can also specify the following arguments to log results to wandb: + +.. code-block:: shell + --should-write-metrics \ --wandb-project Project \ --wandb-group Group \ --wandb-run-name Run -Be sure to update ``--model`` flag to same model used to launch vLLM. +Other Arguments +^^^^^^^^^^^^^^^ +There are many more arguments for running benchmark, run the following to know more: + +.. code-block:: shell + + python -m metron.run_benchmark -h + Saving Results ~~~~~~~~~~~~~~~ diff --git a/docs/tutorials/prefill_profiler.rst b/docs/tutorials/prefill_profiler.rst index fc19d47..962b218 100644 --- a/docs/tutorials/prefill_profiler.rst +++ b/docs/tutorials/prefill_profiler.rst @@ -3,6 +3,12 @@ Prefill Profiler To profile prefill times of open source systems and create a prefill time predictor for a given model and open source system combination, based on input prompt length, we can run ``metron.prefill_profiler``. +.. image:: ../_static/assets/yi-prefill-time-curve.png + :align: center + :scale: 50% + +Above figure shows prefill time curve for Yi-34B on 2 H100s. We can see that prefill time increases with prompt length quadratically. + Launch any open source system and setup API keys and URL as shown in :doc:`open_source_systems`. And, then run the following command: diff --git a/docs/tutorials/public_apis.rst b/docs/tutorials/public_apis.rst index 5dd1058..4de669f 100644 --- a/docs/tutorials/public_apis.rst +++ b/docs/tutorials/public_apis.rst @@ -20,22 +20,35 @@ Running Benchmark python -m metron.run_benchmark \ --model "meta-llama/Meta-Llama-3-8B-Instruct" \ - --max-num-completed-requests 150 \ - --timeout 600 \ - --num-concurrent-requests 10 \ - --output-dir "result_outputs" \ - --request-interval-generator-provider "poisson" \ - --poisson-request-interval-generator-qps 0.5 \ - --request-length-generator-provider "trace" \ - --trace-request-length-generator-trace-file "./data/processed_traces/arxiv_summarization_filtered_stats_llama2_tokenizer.csv" \ + --max-num-completed-requests 20 \ + --request-interval-generator-provider "gamma" \ + --request-length-generator-provider "zipf" \ --request-generator-max-tokens 8192 \ - --ttft-deadline 0.3 \ - --tbt-deadline 0.03 \ + --output-dir "results" + +Be sure to update ``--model`` flag to the model used in the proprietary system. + +.. note:: + + ``metron`` supports different generator providers for request interval and request length. For more details, refer to :doc:`../guides/request_generator_providers`. + +.. _wandb_args_proprietary_systems: + +Specifying wandb args [Optional] +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Optionally, you can also specify the following arguments to log results to wandb: + +.. code-block:: shell + --should-write-metrics \ --wandb-project Project \ --wandb-group Group \ --wandb-run-name Run +Other Arguments +^^^^^^^^^^^^^^^ + There are many more arguments for running benchmark, run the following to know more: .. code-block:: shell diff --git a/docs/tutorials/visualizing_metrics.rst b/docs/tutorials/visualizing_metrics.rst index 5710a11..c325ce8 100644 --- a/docs/tutorials/visualizing_metrics.rst +++ b/docs/tutorials/visualizing_metrics.rst @@ -1,26 +1,27 @@ Visualizing Metrics =================== -``metron`` logs all the metrics to wandb. You can visualize these metrics using wandb dashboard. Check :ref:`wandb_setup` to setup wandb. +``metron`` logs all the metrics to wandb. You can visualize these metrics using wandb dashboard. Check :ref:`wandb_setup` to setup wandb and :ref:`wandb_args_open_source_systems` to log metrics to wandb. To visualize the metrics, follow the steps below: 1. Go to wandb dashboard at ``https://wandb.ai//``. 2. Select the runs you want to visualize in Workspace tab. -.. image:: ../assets/wandb_dashboard.png +.. image:: ../_static/assets/wandb_dashboard.png :alt: wandb_dashboard :align: center + :scale: 50% 3. Go to any charts section you want to visualize. -.. image:: ../assets/charts.png +.. image:: ../_static/assets/charts.png :alt: wandb_charts :align: center 4. Select the chart from the set of available charts in a given charts section. -.. image:: ../assets/metric_chart.png +.. image:: ../_static/assets/metric_chart.png :alt: wandb_chart :align: center diff --git a/metron/run_benchmark.py b/metron/run_benchmark.py index 050e389..d94bf26 100644 --- a/metron/run_benchmark.py +++ b/metron/run_benchmark.py @@ -434,13 +434,15 @@ def parse_args(): args.add_argument( "--should-use-given-dir", # added to prevent the creation of a new directories for the capacity search type=bool, - default=False, - help=("Whether to add a new directory for the results. (default: %(default)s)"), + default=True, + help=( + "Whether to add directly use --output-dir directory or create new directories for the results. (default: %(default)s)" + ), ) args.add_argument( "--should-write-metrics", type=bool, - default=True, + default=False, action=argparse.BooleanOptionalAction, help=("Whether to write metrics to wandb. (default: %(default)s)"), )