ShishirPatil · ShishirPatil · Aug 19, 2024 · Aug 13, 2024 · Aug 13, 2024 · Aug 13, 2024
diff --git a/berkeley-function-call-leaderboard/README.md b/berkeley-function-call-leaderboard/README.md
@@ -104,6 +104,8 @@ Below is *a table of models we support* to run our leaderboard evaluation agains
 |gpt-4o-2024-08-06 | Prompt|
 |gpt-4o-2024-05-13-FC | Function Calling|
 |gpt-4o-2024-05-13| Prompt|
+|gpt-4o-mini-2024-07-18-FC | Function Calling|
+|gpt-4o-mini-2024-07-18 | Prompt|
 |google/gemma-7b-it 💻| Prompt|
 |meetkai/functionary-medium-v3.1-FC| Function Calling|
 |meetkai/functionary-small-{v3.1,v3.2}-FC| Function Calling|
@@ -145,33 +147,42 @@ For `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace an
 ### Available Test Category
 In the following two sections, the optional `--test-category` parameter can be used to specify the category of tests to run. You can specify multiple categories separated by spaces. Available options include:
 
-- `all`: Run all test categories.
-    - This is the default option if no test category is provided.
-- `ast`: Abstract Syntax Tree tests.
-- `executable`: Executable code evaluation tests.
-- `python`: Tests specific to Python code.
-- `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
-- `python-ast`: Python Abstract Syntax Tree tests.
-- Individual test categories:
-    - `simple`: Simple function calls.
-    - `parallel_function`: Multiple function calls in parallel.
-    - `multiple_function`: Multiple function calls in sequence.
-    - `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
-    - `executable_simple`: Executable function calls.
-    - `executable_parallel_function`: Executable multiple function calls in parallel.
-    - `executable_multiple_function`: Executable multiple function calls in sequence.
-    - `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
-    - `java`: Java function calls.
-    - `javascript`: JavaScript function calls.
-    - `rest`: REST API function calls.
-    - `relevance`: Function calls with irrelevant function documentation.
-- If no test category is provided, the script will run all available test categories. (same as `all`)
-
-> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
-
-> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.
-
-> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
+* Available test groups:
+  * `all`: All test categories.
+    * This is the default option if no test category is provided.
+  * `live`: All user-contributed live test categories.
+  * `non_live`: All not-user-contributed test categories (the opposite of `live`).
+  * `ast`: Abstract Syntax Tree tests.
+  * `executable`: Executable code evaluation tests.
+  * `python`: Tests specific to Python code.
+  * `non_python`: Tests for code in languages other than Python, such as Java and JavaScript.
+  * `python_ast`: Python Abstract Syntax Tree tests.
+* Available individual test categories:
+  * `simple`: Simple function calls.
+  * `parallel`: Multiple function calls in parallel.
+  * `multiple`: Multiple function calls in sequence.
+  * `parallel_multiple`: Multiple function calls in parallel and in sequence.
+  * `java`: Java function calls.
+  * `javascript`: JavaScript function calls.
+  * `exec_simple`: Executable function calls.
+  * `exec_parallel`: Executable multiple function calls in parallel.
+  * `exec_multiple`: Executable multiple function calls in parallel.
+  * `exec_parallel_multiple`: Executable multiple function calls in parallel and in sequence.
+  * `rest`: REST API function calls.
+  * `irrelevance`: Function calls with irrelevant function documentation.
+  * `live_simple`: User-contributed simple function calls.
+  * `live_multiple`: User-contributed multiple function calls in sequence.
+  * `live_parallel`: User-contributed multiple function calls in parallel.
+  * `live_parallel_multiple`: User-contributed multiple function calls in parallel and in sequence.
+  * `live_irrelevance`: User-contributed function calls with irrelevant function documentation.
+  * `live_relevance`: User-contributed function calls with relevant function documentation.
+* If no test category is provided, the script will run all available test categories. (same as `all`)
+
+> If you want to run the `all`, `non_live`, `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
+
+> If you do not wish to provide API keys for REST API testing, set `test-category` to any non-executable category.
+
+> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include any executable categories (eg, the test name contains `exec`), the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
 
 
 ## Evaluating the LLM generations
@@ -181,7 +192,7 @@ In the following two sections, the optional `--test-category` parameter can be u
 Navigate to the `gorilla/berkeley-function-call-leaderboard/eval_checker` directory and run the `eval_runner.py` script with the desired parameters. The basic syntax is as follows:
 
 ```bash
-python eval_runner.py --model MODEL_NAME --test-category {TEST_CATEGORY,all,ast,executable,python,non-python}
+python eval_runner.py --model MODEL_NAME --test-category TEST_CATEGORY
 ```
 
 For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section.
@@ -202,16 +213,16 @@ If you want to evaluate all offline tests (do not require RapidAPI keys) for Ope
 python eval_runner.py --model gpt-3.5-turbo-0125 --test-category ast
 ```
 
-If you want to run `rest` tests for a few Claude models, you can use the following command:
+If you want to run the `rest` tests for a few Claude models, you can use the following command:
 
 ```bash
 python eval_runner.py --model claude-3-5-sonnet-20240620 claude-3-opus-20240229 claude-3-sonnet-20240229 --test-category rest
 ```
 
-If you want to run `rest` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:
+If you want to run `live_simple` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:
 
 ```bash
-python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category rest javascript
+python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category live_simple javascript
 ```
 
 ### Model-Specific Optimization
@@ -221,6 +232,7 @@ Some companies have proposed some optimization strategies in their models' handl
 
 ## Changelog
 
+* [August 19, 2024] [#580](https://github.com/ShishirPatil/gorilla/pull/580): Introduce BFCL V2 Live dataset, featuring user-contributed live prompts and function docs. To read more about the composition and construction of this dataset, please refer to our [blog](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html). All CLI commands have been updated to support the new dataset.
 * [August 8, 2024] [#574](https://github.com/ShishirPatil/gorilla/pull/574): Set temperature to 0.001 for all models for consistency and reproducibility.
 * [August 7, 2024] [#571](https://github.com/ShishirPatil/gorilla/pull/571): Support parallel inference for hosted models. User can specify the number of threads to use for parallel inference by setting the `--num-threads` flag. The default is 1, which means no parallel inference.
 * [August 6, 2024] [#569](https://github.com/ShishirPatil/gorilla/pull/569), [#570](https://github.com/ShishirPatil/gorilla/pull/570), [#573](https://github.com/ShishirPatil/gorilla/pull/573): Add the following new models to the leaderboard:

diff --git a/...rilla_openfunctions_v1_test_chatable.json → ...ll-leaderboard/data/BFCL_v2_chatable.json b/...rilla_openfunctions_v1_test_chatable.json → ...ll-leaderboard/data/BFCL_v2_chatable.json