Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Add BFCL_V2_Live Dataset #580

Merged
merged 31 commits into from
Aug 19, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5f7b238
standardize format for BFCL V1 dataset and possible answer
HuanzhiMao Aug 13, 2024
046560a
update eval_runner to support relevance category
HuanzhiMao Aug 13, 2024
5d0bb97
update checker with new format
HuanzhiMao Aug 13, 2024
82da573
update claude handler; remove outdated methods
HuanzhiMao Aug 13, 2024
283ea13
update utils and constant
HuanzhiMao Aug 13, 2024
c879235
update test file mapping
HuanzhiMao Aug 14, 2024
33c6b35
update model handlers accordingly
HuanzhiMao Aug 14, 2024
32b2ae8
add code to generate separate live csv for leaderboard
HuanzhiMao Aug 14, 2024
cad498e
fix typo
HuanzhiMao Aug 14, 2024
b8a6ca6
fix one more typo
HuanzhiMao Aug 14, 2024
6755ef2
add v2 dataset
HuanzhiMao Aug 15, 2024
179d97c
update README
HuanzhiMao Aug 15, 2024
ab957d6
rename categories for clarity
HuanzhiMao Aug 15, 2024
5294b17
Merge branch 'main' into bfcl_v2_live
HuanzhiMao Aug 15, 2024
832d37a
use weighted average instead of unweighted for summary column
HuanzhiMao Aug 15, 2024
4ac7876
nit: fix typo, add doc string
HuanzhiMao Aug 15, 2024
0c14a47
fix xlam handler
HuanzhiMao Aug 16, 2024
58743bd
revert back to unweighted average
HuanzhiMao Aug 16, 2024
2cd2d82
fix prompt processing logic for oss models
HuanzhiMao Aug 16, 2024
2eab158
nit: add explanation for dataset index
HuanzhiMao Aug 16, 2024
a4fc061
rename test files
HuanzhiMao Aug 16, 2024
73f17ce
update README test category options
HuanzhiMao Aug 16, 2024
d2a68c2
fix typo
HuanzhiMao Aug 16, 2024
58506a5
clean up
HuanzhiMao Aug 16, 2024
671af5f
fix dataset entry to use the new format
HuanzhiMao Aug 16, 2024
1a52d49
Merge branch 'main' into bfcl_v2_live
HuanzhiMao Aug 16, 2024
10f8b02
improve error log readability
HuanzhiMao Aug 18, 2024
8b0aa57
fixed empty param hadler parsing
CharlieJCJ Aug 19, 2024
ffa740a
use bfloat16 for OSS model response generation
HuanzhiMao Aug 19, 2024
3e468d0
update dataset entry
HuanzhiMao Aug 19, 2024
9898d35
update changelog
HuanzhiMao Aug 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 43 additions & 32 deletions berkeley-function-call-leaderboard/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,33 +145,43 @@ For `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace an
### Available Test Category
In the following two sections, the optional `--test-category` parameter can be used to specify the category of tests to run. You can specify multiple categories separated by spaces. Available options include:

- `all`: Run all test categories.
- This is the default option if no test category is provided.
- `ast`: Abstract Syntax Tree tests.
- `executable`: Executable code evaluation tests.
- `python`: Tests specific to Python code.
- `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
- `python-ast`: Python Abstract Syntax Tree tests.
- Individual test categories:
- `simple`: Simple function calls.
- `parallel_function`: Multiple function calls in parallel.
- `multiple_function`: Multiple function calls in sequence.
- `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
- `executable_simple`: Executable function calls.
- `executable_parallel_function`: Executable multiple function calls in parallel.
- `executable_multiple_function`: Executable multiple function calls in sequence.
- `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
- `java`: Java function calls.
- `javascript`: JavaScript function calls.
- `rest`: REST API function calls.
- `relevance`: Function calls with irrelevant function documentation.
- If no test category is provided, the script will run all available test categories. (same as `all`)

> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!

> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.

> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
* The following test groups are supported:
* `all`: Run all test categories (BFCL v1 & v2).
* This is the default option if no test category is provided.
* `v2_live`: Run all BFCL V2 user-contributed test categories.
* `v1_all`: Run all BFCL v1 test categories.
* `v1_ast`: Abstract Syntax Tree tests for BFCL v1.
* `v1_exec`: Executable code evaluation tests for BFCL v1.
* `v1_python`: Tests specific to Python code for BFCL v1.
* `v1_non_python`: Tests for code in languages other than Python, such as Java and JavaScript, for BFCL v1.
* `v1_python_ast`: Python Abstract Syntax Tree tests for BFCL v1.
* Individual test categories for BFCL v1:
* `v1_simple`: Simple function calls.
* `v1_parallel`: Multiple function calls in parallel.
* `v1_multiple`: Multiple function calls in sequence.
* `v1_parallel_multiple`: Multiple function calls in parallel and in sequence.
* `v1_java`: Java function calls.
* `v1_javascript`: JavaScript function calls.
* `v1_exec_simple`: Executable function calls.
* `v1_exec_parallel`: Executable multiple function calls in parallel.
* `v1_exec_multiple`: Executable multiple function calls in parallel.
* `v1_exec_parallel_multiple`: Executable multiple function calls in parallel and in sequence.
* `v1_rest`: REST API function calls.
* `v1_irrelevance`: Function calls with irrelevant function documentation.
* Individual test categories for BFCL v2:
* `v2_live_simple`: User-contributed simple function calls.
* `v2_live_multiple`: User-contributed multiple function calls in sequence.
* `v2_live_parallel`: User-contributed multiple function calls in parallel.
* `v2_live_parallel_multiple`: User-contributed multiple function calls in parallel and in sequence.
* `v2_live_irrelevance`: User-contributed function calls with irrelevant function documentation.
* `v2_live_relevance`: User-contributed function calls with relevant function documentation.
* If no test category is provided, the script will run all available test categories. (same as `all`)

> If you want to run the `all`, `v1_all`, `v1_exec` or `v1_python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!

> If you do not wish to provide API keys for REST API testing, set `test-category` to any non-executable category.

> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include any executable categories (eg, the test name contains `exec`), the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.


## Evaluating the LLM generations
Expand Down Expand Up @@ -199,19 +209,19 @@ python eval_runner.py --model gorilla-openfunctions-v2
If you want to evaluate all offline tests (do not require RapidAPI keys) for OpenAI GPT-3.5, you can use the following command:

```bash
python eval_runner.py --model gpt-3.5-turbo-0125 --test-category ast
python eval_runner.py --model gpt-3.5-turbo-0125 --test-category v1_ast v2_live
```

If you want to run `rest` tests for a few Claude models, you can use the following command:
If you want to run `v1_rest` tests for a few Claude models, you can use the following command:

```bash
python eval_runner.py --model claude-3-5-sonnet-20240620 claude-3-opus-20240229 claude-3-sonnet-20240229 --test-category rest
python eval_runner.py --model claude-3-5-sonnet-20240620 claude-3-opus-20240229 claude-3-sonnet-20240229 --test-category v1_rest
```

If you want to run `rest` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:
If you want to run `v1_rest` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:

```bash
python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category rest javascript
python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category v1_rest v1_javascript
```

### Model-Specific Optimization
Expand All @@ -221,6 +231,7 @@ Some companies have proposed some optimization strategies in their models' handl

## Changelog

* [August 14, 2024] [#580](https://github.com/ShishirPatil/gorilla/pull/580): Introduce BFCL V2 dataset, featuring user-contributed prompts and function docs. To read more about the composition and construction of this dataset, please refer to our [blog](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html). All CLI commands have been updated to support the new dataset.
* [August 8, 2024] [#574](https://github.com/ShishirPatil/gorilla/pull/574): Set temperature to 0.001 for all models for consistency and reproducibility.
* [August 7, 2024] [#571](https://github.com/ShishirPatil/gorilla/pull/571): Support parallel inference for hosted models. User can specify the number of threads to use for parallel inference by setting the `--num-threads` flag. The default is 1, which means no parallel inference.
* [August 6, 2024] [#569](https://github.com/ShishirPatil/gorilla/pull/569), [#570](https://github.com/ShishirPatil/gorilla/pull/570), [#573](https://github.com/ShishirPatil/gorilla/pull/573): Add the following new models to the leaderboard:
Expand Down
Loading