ShishirPatil · ShishirPatil · Aug 19, 2024 · Aug 13, 2024 · Aug 13, 2024 · Aug 13, 2024
diff --git a/berkeley-function-call-leaderboard/README.md b/berkeley-function-call-leaderboard/README.md
@@ -145,33 +145,43 @@ For `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace an
 ### Available Test Category
 In the following two sections, the optional `--test-category` parameter can be used to specify the category of tests to run. You can specify multiple categories separated by spaces. Available options include:
 
-- `all`: Run all test categories.
-    - This is the default option if no test category is provided.
-- `ast`: Abstract Syntax Tree tests.
-- `executable`: Executable code evaluation tests.
-- `python`: Tests specific to Python code.
-- `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
-- `python-ast`: Python Abstract Syntax Tree tests.
-- Individual test categories:
-    - `simple`: Simple function calls.
-    - `parallel_function`: Multiple function calls in parallel.
-    - `multiple_function`: Multiple function calls in sequence.
-    - `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
-    - `executable_simple`: Executable function calls.
-    - `executable_parallel_function`: Executable multiple function calls in parallel.
-    - `executable_multiple_function`: Executable multiple function calls in sequence.
-    - `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
-    - `java`: Java function calls.
-    - `javascript`: JavaScript function calls.
-    - `rest`: REST API function calls.
-    - `relevance`: Function calls with irrelevant function documentation.
-- If no test category is provided, the script will run all available test categories. (same as `all`)
-
-> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
-
-> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.
-
-> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
+* The following test groups are supported:
+  * `all`: Run all test categories (BFCL v1 & v2).
+    * This is the default option if no test category is provided.
+  * `v2_live`: Run all BFCL V2 user-contributed test categories.
+  * `v1_all`: Run all BFCL v1 test categories.
+  * `v1_ast`: Abstract Syntax Tree tests for BFCL v1.
+  * `v1_exec`: Executable code evaluation tests for BFCL v1.
+  * `v1_python`: Tests specific to Python code for BFCL v1.
+  * `v1_non_python`: Tests for code in languages other than Python, such as Java and JavaScript, for BFCL v1.
+  * `v1_python_ast`: Python Abstract Syntax Tree tests for BFCL v1.
+* Individual test categories for BFCL v1:
+  * `v1_simple`: Simple function calls.
+  * `v1_parallel`: Multiple function calls in parallel.
+  * `v1_multiple`: Multiple function calls in sequence.
+  * `v1_parallel_multiple`: Multiple function calls in parallel and in sequence.
+  * `v1_java`: Java function calls.
+  * `v1_javascript`: JavaScript function calls.
+  * `v1_exec_simple`: Executable function calls.
+  * `v1_exec_parallel`: Executable multiple function calls in parallel.
+  * `v1_exec_multiple`: Executable multiple function calls in parallel.
+  * `v1_exec_parallel_multiple`: Executable multiple function calls in parallel and in sequence.
+  * `v1_rest`: REST API function calls.
+  * `v1_irrelevance`: Function calls with irrelevant function documentation.
+* Individual test categories for BFCL v2:
+  * `v2_live_simple`: User-contributed simple function calls.
+  * `v2_live_multiple`: User-contributed multiple function calls in sequence.
+  * `v2_live_parallel`: User-contributed multiple function calls in parallel.
+  * `v2_live_parallel_multiple`: User-contributed multiple function calls in parallel and in sequence.
+  * `v2_live_irrelevance`: User-contributed function calls with irrelevant function documentation.
+  * `v2_live_relevance`: User-contributed function calls with relevant function documentation.
+* If no test category is provided, the script will run all available test categories. (same as `all`)
+
+> If you want to run the `all`, `v1_all`, `v1_exec` or `v1_python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!
+
+> If you do not wish to provide API keys for REST API testing, set `test-category` to any non-executable category.
+
+> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include any executable categories (eg, the test name contains `exec`), the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
 
 
 ## Evaluating the LLM generations
@@ -199,19 +209,19 @@ python eval_runner.py --model gorilla-openfunctions-v2
 If you want to evaluate all offline tests (do not require RapidAPI keys) for OpenAI GPT-3.5, you can use the following command:
 
 ```bash
-python eval_runner.py --model gpt-3.5-turbo-0125 --test-category ast
+python eval_runner.py --model gpt-3.5-turbo-0125 --test-category v1_ast v2_live
 ```
 
-If you want to run `rest` tests for a few Claude models, you can use the following command:
+If you want to run `v1_rest` tests for a few Claude models, you can use the following command:
 
 ```bash
-python eval_runner.py --model claude-3-5-sonnet-20240620 claude-3-opus-20240229 claude-3-sonnet-20240229 --test-category rest
+python eval_runner.py --model claude-3-5-sonnet-20240620 claude-3-opus-20240229 claude-3-sonnet-20240229 --test-category v1_rest
 ```
 
-If you want to run `rest` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:
+If you want to run `v1_rest` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:
 
 ```bash
-python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category rest javascript
+python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category v1_rest v1_javascript
 ```
 
 ### Model-Specific Optimization
@@ -221,6 +231,7 @@ Some companies have proposed some optimization strategies in their models' handl
 
 ## Changelog
 
+* [August 14, 2024] [#580](https://github.com/ShishirPatil/gorilla/pull/580): Introduce BFCL V2 dataset, featuring user-contributed prompts and function docs. To read more about the composition and construction of this dataset, please refer to our [blog](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html). All CLI commands have been updated to support the new dataset.
 * [August 8, 2024] [#574](https://github.com/ShishirPatil/gorilla/pull/574): Set temperature to 0.001 for all models for consistency and reproducibility.
 * [August 7, 2024] [#571](https://github.com/ShishirPatil/gorilla/pull/571): Support parallel inference for hosted models. User can specify the number of threads to use for parallel inference by setting the `--num-threads` flag. The default is 1, which means no parallel inference.
 * [August 6, 2024] [#569](https://github.com/ShishirPatil/gorilla/pull/569), [#570](https://github.com/ShishirPatil/gorilla/pull/570), [#573](https://github.com/ShishirPatil/gorilla/pull/573): Add the following new models to the leaderboard:

diff --git a/...rilla_openfunctions_v1_test_chatable.json → ...ll-leaderboard/data/BFCL_v1_chatable.json b/...rilla_openfunctions_v1_test_chatable.json → ...ll-leaderboard/data/BFCL_v1_chatable.json