Azure · sushaanttb · Feb 28, 2024
diff --git a/README.md b/README.md
@@ -36,10 +36,13 @@ Consider the following guidelines when creating your benchmark tests
 1. **Run your test long enough to reach a stable state**. Throttling is based on the total compute you have deployed and are utilizing. The utilization includes active calls. As a result you will see a higher call rate when ramping up on an unloaded deployment because there are no existing active calls being processed. Once your deplyoment is fully loaded with a utilzation near 100%, throttling will increase as calls can only be processed as earlier ones are completed. To ensure an accurate measure, set the duration long enough for the throughput to stabilize, especialy when running at or close to 100% utilization.
 1. **Consider whether to use a retry strategy, and the effect of throttling on the resulting stats**. There are careful considerations when selecting a retry strategy, as the resulting latency statistics will be effected if the resource is pushed beyond it's capacity and to the point of throttling.
 * When running a test with `retry=none`, any throttled request will be treated as throttled and a new request will be made to replace it, with the start time of the replacement request being reset to a newer time. If the resource being tested starts returning 429s, then any latency metrics from this tool will only represent the values of the final successful request, without also including the time that was spent retrying to resource until a successful response was received (which may not be representative of the real-world user experience). This setting should be used when the workload being tested results is within the resource's capacity and no throttling occurs, or where you are looking to understand what percentage of requests to a PTU instance might need to be diverted to a backup resource, such as during periods of peak load which require more throughput than the PTU resource can handle.
+* When running a test with `retry=header-based`, any failed or throttled request will be retried based on the value of "retry-after" header(s), up to a max of 60 seconds. 
 * When running a test with `retry=exponential`, any failed or throttled request will be retried with exponential backoff, up to a max of 60 seconds. While it is always recommended to deploy backup AOAI resources for use-cases that will experience periods of high load, this setting may be useful for trying to simulate a scenario where no backup resource is available, and where throttled or failed requests must still be fulfilled by the resource. In this case, the TTFT and e2e latency metrics will represent the time from the first throttled request to the time that the final request was successful, and may be more reflective of the total time that an end user could spend waiting for a response, e.g. in a chat application. Use this option in situations where you want to understand the latency of requests which are throttled and need to be retried on the same resource, and the how the total latency of a request is impacted by multiple request retries.
+* When running a test with `retry=all`, any failed or throttled request will be retried based on the value of "retry-after" header(s) as well as exponential backoff. Think of it as `header-based + exponential`.
+Please note that ideally, this option does not makes much sense i.e. to retry the same request using two logics, but we have added it as an option still to test any particular scenario which a user might be having or to match with their production setup.
 * As a practical example, if a PTU resource is tested beyond 100% capacity and starts returning 429s:
     * With `retry=none` the TTFT and e2e latency statistics will remain stable (and very low), since only the successful requests will be included in the metrics. Number of throttled requests will be relatively high.
-    * With `retry=exponential`, the TTFT/e2e latency metrics will increase (potentially up to the max of 60 seconds), while the number of throttled requests will remain lower (since a request is only treated as throttled after 60 seconds, regardless of how many attempts were made within the retry period).
+    * With `retry=header-based` or `retry=exponential`, the TTFT/e2e latency metrics will increase (potentially up to the max of 60 seconds), while the number of throttled requests will remain lower (since a request is only treated as throttled after 60 seconds, regardless of how many attempts were made within the retry period).
     * Total throughput values (RPM, TPM) may be lower when `retry=none` if rate limiting is applied.
 * As a best practice, any PTU resource should be deployed with a backup PayGO resource for times of peak load. As a result, any testing should be conducted with the values suggested in the AOAI capacity calculator (within the AI Azure Portal) to ensure that throttling does not occur during testing.
 

diff --git a/benchmark/bench.py b/benchmark/bench.py
@@ -31,7 +31,7 @@ def main():
     load_parser.add_argument("--temperature", type=float, help="Request temperature.")
     load_parser.add_argument("--top-p", type=float, help="Request top_p.")
     load_parser.add_argument("-f", "--output-format", type=str, default="human", help="Output format.", choices=["jsonl", "human"])
-    load_parser.add_argument("-t", "--retry", type=str, default="none", help="Request retry strategy. See README for details", choices=["none", "exponential"])
+    load_parser.add_argument("-t", "--retry", type=str, default="none", help="Request retry strategy. See README for details", choices=["none", "header-based", "exponential", "all"])
     load_parser.add_argument("-e", "--deployment", type=str, help="Azure OpenAI deployment name.", required=True)
     load_parser.add_argument("api_base_endpoint", help="Azure OpenAI deployment base endpoint.", nargs=1)
     load_parser.set_defaults(func=load)

diff --git a/benchmark/loadcmd.py b/benchmark/loadcmd.py
@@ -105,7 +105,7 @@ def load(args):
       api_key=api_key,
       url=url,
       rate_limiter=rate_limiter,
-      backoff=args.retry=="exponential",
+      retry=args.retry,
       request_count=args.requests,
       duration=args.duration,
       aggregation_duration=args.aggregation_window,
@@ -116,7 +116,7 @@ def _run_load(request_builder: Iterable[dict],
               api_key: str,
               url: str,
               rate_limiter=None, 
-              backoff=False,
+              retry:str="none",
               duration=None, 
               aggregation_duration=60,
               request_count=None,
@@ -126,7 +126,7 @@ def _run_load(request_builder: Iterable[dict],
       dump_duration=1, 
       clients=max_concurrency,
       json_output=json_output)
-   requester = OAIRequester(api_key, url, backoff=backoff)
+   requester = OAIRequester(api_key, url, retry=retry)
 
    async def request_func(session:aiohttp.ClientSession):
       nonlocal aggregator

diff --git a/benchmark/oairequester.py b/benchmark/oairequester.py
@@ -45,12 +45,12 @@ class OAIRequester:
     statistics.
     :param api_key: Azure OpenAI resource endpoint key.
     :param url: Full deployment URL in the form of https://<resource>.openai.azure.com/openai/deployments/<deployment>/chat/completins?api-version=<api_version>
-    :param backoff: Whether to retry throttled or unsuccessful requests.
+    :param retry: Whether to retry throttled or unsuccessful requests.
     """
-    def __init__(self, api_key: str, url: str, backoff=False):
+    def __init__(self, api_key: str, url: str, retry:str="none"):
         self.api_key = api_key
         self.url = url
-        self.backoff = backoff
+        self.retry = retry
 
     async def call(self, session:aiohttp.ClientSession, body: dict) -> RequestStats:
         """
@@ -95,7 +95,7 @@ async def _call(self, session:aiohttp.ClientSession, body: dict, stats: RequestS
             self._read_utilization(response, stats)
             if response.status != 429:
                 break
-            if self.backoff and RETRY_AFTER_MS_HEADER in response.headers:
+            if (self.retry=="header-based" or self.retry=="all") and RETRY_AFTER_MS_HEADER in response.headers:
                 try:
                     retry_after_str = response.headers[RETRY_AFTER_MS_HEADER]
                     retry_after_ms = float(retry_after_str)
@@ -111,7 +111,7 @@ async def _call(self, session:aiohttp.ClientSession, body: dict, stats: RequestS
 
         if response.status != 200 and response.status != 429:
             logging.warning(f"call failed: {REQUEST_ID_HEADER}={response.headers[REQUEST_ID_HEADER]} {response.status}: {response.reason}")
-        if self.backoff:
+        if (self.retry=="exponential" or self.retry=="all"):
             response.raise_for_status()
         if response.status == 200:
             await self._handle_response(response, stats)