Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL] Add BFCL_V2_Live Dataset #580

Merged
merged 31 commits into from
Aug 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5f7b238
standardize format for BFCL V1 dataset and possible answer
HuanzhiMao Aug 13, 2024
046560a
update eval_runner to support relevance category
HuanzhiMao Aug 13, 2024
5d0bb97
update checker with new format
HuanzhiMao Aug 13, 2024
82da573
update claude handler; remove outdated methods
HuanzhiMao Aug 13, 2024
283ea13
update utils and constant
HuanzhiMao Aug 13, 2024
c879235
update test file mapping
HuanzhiMao Aug 14, 2024
33c6b35
update model handlers accordingly
HuanzhiMao Aug 14, 2024
32b2ae8
add code to generate separate live csv for leaderboard
HuanzhiMao Aug 14, 2024
cad498e
fix typo
HuanzhiMao Aug 14, 2024
b8a6ca6
fix one more typo
HuanzhiMao Aug 14, 2024
6755ef2
add v2 dataset
HuanzhiMao Aug 15, 2024
179d97c
update README
HuanzhiMao Aug 15, 2024
ab957d6
rename categories for clarity
HuanzhiMao Aug 15, 2024
5294b17
Merge branch 'main' into bfcl_v2_live
HuanzhiMao Aug 15, 2024
832d37a
use weighted average instead of unweighted for summary column
HuanzhiMao Aug 15, 2024
4ac7876
nit: fix typo, add doc string
HuanzhiMao Aug 15, 2024
0c14a47
fix xlam handler
HuanzhiMao Aug 16, 2024
58743bd
revert back to unweighted average
HuanzhiMao Aug 16, 2024
2cd2d82
fix prompt processing logic for oss models
HuanzhiMao Aug 16, 2024
2eab158
nit: add explanation for dataset index
HuanzhiMao Aug 16, 2024
a4fc061
rename test files
HuanzhiMao Aug 16, 2024
73f17ce
update README test category options
HuanzhiMao Aug 16, 2024
d2a68c2
fix typo
HuanzhiMao Aug 16, 2024
58506a5
clean up
HuanzhiMao Aug 16, 2024
671af5f
fix dataset entry to use the new format
HuanzhiMao Aug 16, 2024
1a52d49
Merge branch 'main' into bfcl_v2_live
HuanzhiMao Aug 16, 2024
10f8b02
improve error log readability
HuanzhiMao Aug 18, 2024
8b0aa57
fixed empty param hadler parsing
CharlieJCJ Aug 19, 2024
ffa740a
use bfloat16 for OSS model response generation
HuanzhiMao Aug 19, 2024
3e468d0
update dataset entry
HuanzhiMao Aug 19, 2024
9898d35
update changelog
HuanzhiMao Aug 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 43 additions & 31 deletions berkeley-function-call-leaderboard/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,8 @@ Below is *a table of models we support* to run our leaderboard evaluation agains
|gpt-4o-2024-08-06 | Prompt|
|gpt-4o-2024-05-13-FC | Function Calling|
|gpt-4o-2024-05-13| Prompt|
|gpt-4o-mini-2024-07-18-FC | Function Calling|
|gpt-4o-mini-2024-07-18 | Prompt|
|google/gemma-7b-it 💻| Prompt|
|meetkai/functionary-medium-v3.1-FC| Function Calling|
|meetkai/functionary-small-{v3.1,v3.2}-FC| Function Calling|
Expand Down Expand Up @@ -145,33 +147,42 @@ For `Databrick-DBRX-instruct`, you need to create a Databrick Azure workspace an
### Available Test Category
In the following two sections, the optional `--test-category` parameter can be used to specify the category of tests to run. You can specify multiple categories separated by spaces. Available options include:

- `all`: Run all test categories.
- This is the default option if no test category is provided.
- `ast`: Abstract Syntax Tree tests.
- `executable`: Executable code evaluation tests.
- `python`: Tests specific to Python code.
- `non-python`: Tests for code in languages other than Python, such as Java and JavaScript.
- `python-ast`: Python Abstract Syntax Tree tests.
- Individual test categories:
- `simple`: Simple function calls.
- `parallel_function`: Multiple function calls in parallel.
- `multiple_function`: Multiple function calls in sequence.
- `parallel_multiple_function`: Multiple function calls in parallel and in sequence.
- `executable_simple`: Executable function calls.
- `executable_parallel_function`: Executable multiple function calls in parallel.
- `executable_multiple_function`: Executable multiple function calls in sequence.
- `executable_parallel_multiple_function`: Executable multiple function calls in parallel and in sequence.
- `java`: Java function calls.
- `javascript`: JavaScript function calls.
- `rest`: REST API function calls.
- `relevance`: Function calls with irrelevant function documentation.
- If no test category is provided, the script will run all available test categories. (same as `all`)

> If you want to run the `all` or `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!

> If you do not wish to provide API keys for REST API testing, set `test-category` to `ast` or any non-executable category.

> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include `executable`, the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.
* Available test groups:
* `all`: All test categories.
* This is the default option if no test category is provided.
* `live`: All user-contributed live test categories.
* `non_live`: All not-user-contributed test categories (the opposite of `live`).
* `ast`: Abstract Syntax Tree tests.
* `executable`: Executable code evaluation tests.
* `python`: Tests specific to Python code.
* `non_python`: Tests for code in languages other than Python, such as Java and JavaScript.
* `python_ast`: Python Abstract Syntax Tree tests.
* Available individual test categories:
* `simple`: Simple function calls.
* `parallel`: Multiple function calls in parallel.
* `multiple`: Multiple function calls in sequence.
* `parallel_multiple`: Multiple function calls in parallel and in sequence.
* `java`: Java function calls.
* `javascript`: JavaScript function calls.
* `exec_simple`: Executable function calls.
* `exec_parallel`: Executable multiple function calls in parallel.
* `exec_multiple`: Executable multiple function calls in parallel.
* `exec_parallel_multiple`: Executable multiple function calls in parallel and in sequence.
* `rest`: REST API function calls.
* `irrelevance`: Function calls with irrelevant function documentation.
* `live_simple`: User-contributed simple function calls.
* `live_multiple`: User-contributed multiple function calls in sequence.
* `live_parallel`: User-contributed multiple function calls in parallel.
* `live_parallel_multiple`: User-contributed multiple function calls in parallel and in sequence.
* `live_irrelevance`: User-contributed function calls with irrelevant function documentation.
* `live_relevance`: User-contributed function calls with relevant function documentation.
* If no test category is provided, the script will run all available test categories. (same as `all`)

> If you want to run the `all`, `non_live`, `executable` or `python` category, make sure to register your REST API keys in `function_credential_config.json`. This is because Gorilla Openfunctions Leaderboard wants to test model's generated output on real world API!

> If you do not wish to provide API keys for REST API testing, set `test-category` to any non-executable category.

> By setting the `--api-sanity-check` flag, or `-c` for short, if the test categories include any executable categories (eg, the test name contains `exec`), the evaluation process will perform the REST API sanity check first to ensure that all the API endpoints involved during the execution evaluation process are working properly. If any of them are not behaving as expected, we will flag those in the console and continue execution.


## Evaluating the LLM generations
Expand All @@ -181,7 +192,7 @@ In the following two sections, the optional `--test-category` parameter can be u
Navigate to the `gorilla/berkeley-function-call-leaderboard/eval_checker` directory and run the `eval_runner.py` script with the desired parameters. The basic syntax is as follows:

```bash
python eval_runner.py --model MODEL_NAME --test-category {TEST_CATEGORY,all,ast,executable,python,non-python}
python eval_runner.py --model MODEL_NAME --test-category TEST_CATEGORY
```

For available options for `MODEL_NAME` and `TEST_CATEGORY`, please refer to the [Models Available](#models-available) and [Available Test Category](#available-test-category) section.
Expand All @@ -202,16 +213,16 @@ If you want to evaluate all offline tests (do not require RapidAPI keys) for Ope
python eval_runner.py --model gpt-3.5-turbo-0125 --test-category ast
```

If you want to run `rest` tests for a few Claude models, you can use the following command:
If you want to run the `rest` tests for a few Claude models, you can use the following command:

```bash
python eval_runner.py --model claude-3-5-sonnet-20240620 claude-3-opus-20240229 claude-3-sonnet-20240229 --test-category rest
```

If you want to run `rest` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:
If you want to run `live_simple` and `javascript` tests for a few models and `gorilla-openfunctions-v2`, you can use the following command:

```bash
python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category rest javascript
python eval_runner.py --model gorilla-openfunctions-v2 claude-3-5-sonnet-20240620 gpt-4-0125-preview gemini-1.5-pro-preview-0514 --test-category live_simple javascript
```

### Model-Specific Optimization
Expand All @@ -221,6 +232,7 @@ Some companies have proposed some optimization strategies in their models' handl

## Changelog

* [August 19, 2024] [#580](https://github.com/ShishirPatil/gorilla/pull/580): Introduce BFCL V2 Live dataset, featuring user-contributed live prompts and function docs. To read more about the composition and construction of this dataset, please refer to our [blog](https://gorilla.cs.berkeley.edu/blogs/12_bfcl_v2_live.html). All CLI commands have been updated to support the new dataset.
* [August 8, 2024] [#574](https://github.com/ShishirPatil/gorilla/pull/574): Set temperature to 0.001 for all models for consistency and reproducibility.
* [August 7, 2024] [#571](https://github.com/ShishirPatil/gorilla/pull/571): Support parallel inference for hosted models. User can specify the number of threads to use for parallel inference by setting the `--num-threads` flag. The default is 1, which means no parallel inference.
* [August 6, 2024] [#569](https://github.com/ShishirPatil/gorilla/pull/569), [#570](https://github.com/ShishirPatil/gorilla/pull/570), [#573](https://github.com/ShishirPatil/gorilla/pull/573): Add the following new models to the leaderboard:
Expand Down
Loading