Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add support for JSON dataset #212

Merged
merged 2 commits into from
May 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .changeset/clever-guests-appear.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---

---

docs: add support for JSON dataset
103 changes: 53 additions & 50 deletions docs/dataset/basics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -42,35 +42,26 @@ The LLM prompt can use this value in the prompt through the `{{user_message}}` p
}
```

## Import from JSONL file

Specify a path to the JSONL file. Each line of the file should be a valid JSON object.
On import, the keys of this JSON will be converted into inputs of the sample.

If using relative paths, the path is treated relative to the configuration file.
## Import from Google Sheets
Specify a path to the Google sheet in the `empiricalrc.json` file.

```json
```json empiricalrc.json
"dataset": {
"path": "HumanEval.jsonl"
"path": "https://docs.google.com/spreadsheets/d/1AsMekKCG74m1PbBZQN_sEJgaW0b9Xarg4ms4mhG3i5k"
}
```
Refer to our [chatbot example](https://github.com/empirical-run/empirical/tree/main/examples/chatbot) which uses this dataset.

## Import from CSV
Specify a path to the CSV file in the `empiricalrc.json`. If using relative paths, the path is treated relative to the configuration file.
The sheet should contain column headers.
The rows of the file are converted into dataset inputs with column header names as the name of the parameter. For example:

```json
"dataset": {
"path": "foo.csv"
}
```md
| name | age |
| ---- | --- |
| John | 25 |
```

The CSV file should contain headers.
The lines of the file are converted into dataset inputs with column header names as the name of the parameter. For example:
```csv foo.csv
name,age
John,25
```
The above CSV gets converted into the following dataset object:
The above table in the sheet gets converted into the following dataset object:
```json
"dataset": {
"samples": [
Expand All @@ -84,33 +75,58 @@ The above CSV gets converted into the following dataset object:
}
```

The above conversion enables you to create a prompt with placeholders. For example:
```json
The above conversion enables you to create prompt with placeholders. For example:
```json empiricalrc.json
{
"prompt": "Your name is {{name}} and you are a helpful assistant..."
}
```

## Import from Google Sheets
Specify a path to the Google sheet in the `empiricalrc.json` file.
> If you wish to extract data from a specific sheet of Google Sheet, make sure to navigate to the desired sheet and copy the browser URL into `empiricalrc.json`.
```json empiricalrc.json
## Import from JSONL file

Specify a path to the JSONL file. Each line of the file should be a valid JSON object.
On import, the keys of this JSON will be converted into inputs of the sample.

If using relative paths, the path is treated relative to the configuration file.

```json
"dataset": {
"path": "https://docs.google.com/spreadsheets/d/1AsMekKCG74m1PbBZQN_sEJgaW0b9Xarg4ms4mhG3i5k"
"path": "HumanEval.jsonl"
}
```
Refer to our [chatbot example](https://github.com/empirical-run/empirical/tree/main/examples/chatbot) which uses this dataset.

The sheet should contain column headers.
The rows of the file are converted into dataset inputs with column header names as the name of the parameter. For example:
## Import from JSON

```md
| name | age |
| ---- | --- |
| John | 25 |
Specify a path to the JSON file. The file should contain array of objects.
On import, the object keys will be converted into inputs of the sample.

If using relative paths, the path is treated relative to the configuration file.

```json
"dataset": {
"path": "dataset.json"
}
```
Refer to [tool call example](https://github.com/empirical-run/empirical/tree/main/examples/tool_calls) which uses this dataset.

## Import from CSV
Specify a path to the CSV file in the `empiricalrc.json`. If using relative paths, the path is treated relative to the configuration file.

The above table in the sheet gets converted into the following dataset object:
```json
"dataset": {
"path": "foo.csv"
}
```

The CSV file should contain headers.
The lines of the file are converted into dataset inputs with column header names as the name of the parameter. For example:
```csv foo.csv
name,age
John,25
```
The above CSV gets converted into the following dataset object:
```json
"dataset": {
"samples": [
Expand All @@ -124,27 +140,14 @@ The above table in the sheet gets converted into the following dataset object:
}
```

The above conversion enables you to create prompt with placeholders. For example:
```json empiricalrc.json
The above conversion enables you to create a prompt with placeholders. For example:
```json
{
"prompt": "Your name is {{name}} and you are a helpful assistant..."
}
```


> If you wish to extract data from a specific sheet of Google Sheet, make sure to navigate to the desired sheet and copy the browser URL into `empiricalrc.json`.
## Import Empirical JSON format

If your dataset follows the Empirical JSON format, you can import that from
a file or HTTP endpoint.

```json
"dataset": {
"path": "https://assets.empirical.run/datasets/json/spider-tiny.json"
}
```