mlcommons · mrmhodak · Oct 22, 2024 · Oct 22, 2024
@@ -10,7 +10,7 @@ This repository provides the config files and scripts to run and verify TEST 06
 
 The purpose of this test is to ensure the consistency of the output of the LLM (Llama2 and Mixtral) model and avoid a potential EOS exploit. This test will make a performance run, with a limit of 100 samples and logging them into `mlperf_log_accuracy.json`. To achieve a passing result in this test, three criteria must be met:
 - In the case the first token is reported independently (not applicable for Offline scenario), it should match for every query with the first token of the model output.
-- For each query, the model output should only end with zero or one EOS token. The only exception for 2 EOS tokens is when the entire output sequences are EOS tokens (i.e. output is [eos_token_id, eos_token_id])
+- For each query, the model output should only end with zero or one EOS token.
 - The number of reported tokens should match with the length of output sequence.
 
 ## Requisites

@@ -51,8 +51,7 @@ def eos_check(acc_data, dtype, eos_token_id=2):
             if data[i] == eos_token_id:
                 n_eos_tokens += 1
             if n_eos_tokens >= 2:
-                # Allow output to be [eos_token_id, eos_token_id]
-                return len(data) == 2
+                return False
             if data[i] != eos_token_id:
                 break
             i-=1

@@ -109,6 +109,9 @@ rclone copyurl https://inference.mlcommons-storage.org/mixtral_8x7b%2F2024.06.06
 #### Using wget
 
 Alternatively, you can simply cd into the folder where you want to place the dataset and run
+
+TBD: The dataset is being replaced in v5.0 due to https://github.com/mlcommons/inference/issues/1777
+
 ```bash
 wget https://inference.mlcommons-storage.org/mixtral_8x7b%2F2024.06.06_mixtral_15k_v4.pkl
 ```
@@ -261,17 +264,17 @@ python -u evaluate-accuracy.py --checkpoint-path [path_to_model_checkpoint] \
 Reference scores:
 Open Orca:
 ```json
-{'rouge1': 45.4911, 'rouge2': 23.2829, 'rougeL': 30.3615}
+{'rouge1': 45.5989, 'rouge2': 23.3526, 'rougeL': 30.4608}
 ```
 GSM8K:
 ```json
-{'gsm8k': 73.78}
+{'gsm8k': 73.66}
 ```
 MBXP:
 ```json
-{'mbxp': 60.12}
+{'mbxp': 60.16}
 ```
-For official submissions, 99% of each reference score is enforced. Additionally, 90%-110% of the generated tokens_per_samples:
+For official submissions, 99% of each reference score is enforced. Additionally, 90%-110% of the generated tokens_per_samples (counting all the non-EOS tokens):
 ```json
-{'tokens_per_sample': 145.9}
+{'tokens_per_sample': 144.84}
 ```
@@ -27,7 +27,7 @@
 gen_kwargs = {
     "early_stopping": True,
     "max_new_tokens": 1024,
-    "min_new_tokens": 1,
+    "min_new_tokens": 2,
     "num_beams": 1,
     "do_sample": False
 }