Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eval: Probability Questions Sourced From Actuarial Exam P and University Statistics Courses #263

Merged
merged 1 commit into from
May 17, 2023

Conversation

steven-luabase
Copy link
Contributor

@steven-luabase steven-luabase commented Mar 16, 2023

Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, failure to follow the guidelines below will result in the PR being closed automatically. Note that even if the criteria are met, that does not guarantee the PR will be merged nor GPT-4 access granted. 🚨

PLEASE READ THIS:

In order for a PR to be merged, it must fail on GPT-4. We are aware that right now, users do not have access, so you will not be able to tell if the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep in mind as we run the eval, if GPT-4 gets higher than 90% on the eval, we will likely reject since GPT-4 is already capable of completing the task.

We plan to roll out a way for users submitting evals to see the eval performance on GPT-4 soon. Stay tuned! Until then, you will not be able to see the eval performance on GPT-4. We encourage partial PR's with ~5-10 example that we can then run the evals on and share the results with you so you know how your eval does with GPT-4 before writing all 100 examples.

Eval details 📑

Eval name

Probability Questions

Eval description

Tests the model's ability to understand answer probability questions. Questions are sourced from Society of Actuaries Exam P sample questions and practice problems/exams from statistics classes at MIT, UPenn, California State University, Durham University, University of Connecticut, and other sources. The full list of questions and sources (in the same order as in the .jsonl files) can be found in this Google sheet

What makes this a useful eval?

Test the model's ability to understand worded probability questions, bring in concepts such as probability distributions, and then reason through a correct answer.

Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general, we are seeking cases where the model does not do a good job despite being capable of generating a good response (note that there are some things large language models cannot do, so those would not make good evals).

Your eval should be:

  • Thematically consistent: The eval should be thematically consistent. We'd like to see a number of prompts all demonstrating some particular failure mode. For example, we can create an eval on cases where the model fails to reason about the physical world.
  • Contains failures where a human can do the task, but either GPT-4 or GPT-3.5-Turbo could not.
  • Includes good signal around what is the right behavior. This means either a correct answer for Basic evals or the Fact Model-graded eval, or an exhaustive rubric for evaluating answers for the Criteria Model-graded eval.
  • Include at least 100 high quality examples (it is okay to only contribute 5-10 meaningful examples and have us test them with GPT-4 before adding all 100)

If there is anything else that makes your eval worth including, please document it below.

Unique eval value

Using the match grading criteria, GPT3.5-turbo got an accuracy score of {'accuracy': 0.07}

Eval structure 🏗️

Your eval should

  • Check that your data is in evals/registry/data/{name}
  • Check that your yaml is registered at evals/registry/evals/{name}.jsonl
  • Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing eval classes. You may still write custom eval classes for your own cases, and we may consider merging them in the future.)

Final checklist 👀

Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic and data under the same MIT license as this repository. You must have adequate rights to upload any data used in an Eval. OpenAI reserves the right to use this data in future service improvements to our product. Contributions to OpenAI Evals will be subject to our usual Usage Policies (https://platform.openai.com/docs/usage-policies).

  • I agree that my submission will be made available under an MIT license and complies with OpenAI's usage policies.

Email address validation

If your submission is accepted, we will be granting GPT-4 access to a limited number of contributors. Access will be given to the email address associated with the merged pull request.

  • I acknowledge that GPT-4 access will only be granted, if applicable, to the email address used for my merged pull request.

Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission, help improve our models, and gain access to GPT-4. However, due to the requirements mentioned above and high volume of submissions, we will not be able to accept all submissions and thus not grant everyone who opens a PR GPT-4 access. We know this is disappointing, but we hope to set the right expectation before you open this PR.

  • I understand that opening a PR, even if it meets the requirements above, does not guarantee the PR will be merged nor GPT-4 access granted.

Submit eval

  • I have filled out all required fields in the evals PR form
  • (Ignore if not submitting code) I have run pip install pre-commit; pre-commit install and have verified that black, isort, and autoflake are running when I commit and push

Failure to fill out all required fields will result in the PR being closed.

Eval JSON data

Since we are using Git LFS, we are asking eval submitters to add in as many Eval Samples (at least 5) from their contribution here:

View evals in JSON

Eval

{"input": [{"role": "system", "content": "You are a helpful statistician. Answer the questions with only the numerical answer rounded to 4 decimal places. Provide no explanation."}, {"role": "user", "content": "A pair of fair, standard dice are rolled. What is the probability the sum of the dice is 5"}], "ideal": ["0.1111"]}
{"input": [{"role": "system", "content": "You are a helpful statistician. Answer the questions with only the numerical answer rounded to 4 decimal places. Provide no explanation."}, {"role": "user", "content": "An airplane is built to be able to fly on one engine. If the plane's two engines operate independently, and each has a 1% chance of failing in any given four-hour flight, what is the chance the plane will fail to complete a four-hour flight to Oklahoma due to engine failure?"}], "ideal": ["0.0001"]}
{"input": [{"role": "system", "content": "You are a helpful statistician. Answer the questions with only the numerical answer rounded to 4 decimal places. Provide no explanation."}, {"role": "user", "content": "A 1-inch-diameter coin is thrown on a table covered with a grid of lines two inches apart. What is the probability the coin lands in a square without touching any of the lines of the grid?"}], "ideal": ["0.2500"]}
{"input": [{"role": "system", "content": "You are a helpful statistician. Answer the questions with only the numerical answer rounded to 4 decimal places. Provide no explanation."}, {"role": "user", "content": "Of the 50 students in a certain class, 5 speak French. Two students of the class will be selected at random. Which of the following is closest to the probability that neither of the students selected will speak French?"}], "ideal": ["0.8100"]}
{"input": [{"role": "system", "content": "You are a helpful statistician. Answer the questions with only the numerical answer rounded to 4 decimal places. Provide no explanation."}, {"role": "user", "content": "Of the 10 marbles in a box, 2 are green. A person will select 2 marbles simultaneously and at random from the box. What is the probability that neither of the marbles selected will be green?"}], "ideal": ["0.6222"]}

Copy link
Collaborator

@luqman-openai luqman-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this eval! This PR looks interesting. I'm approving this PR.

@andrew-openai andrew-openai merged commit 8e276ea into openai:main May 17, 2023
arbreton pushed a commit to arbreton/evals that referenced this pull request Jul 8, 2023
…ity Statistics Courses (openai#263)

# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨

__PLEASE READ THIS__:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. We encourage partial PR's with
~5-10 example that we can then run the evals on and share the results
with you so you know how your eval does with GPT-4 before writing all
100 examples.

## Eval details 📑
### Eval name
Probability Questions

### Eval description

Tests the model's ability to understand answer probability questions.
Questions are sourced from Society of Actuaries Exam P sample questions
and practice problems/exams from statistics classes at MIT, UPenn,
California State University, Durham University, University of
Connecticut, and other sources. The full list of questions and sources
(in the same order as in the `.jsonl` files) can be found in this Google
[sheet](https://docs.google.com/spreadsheets/d/1TU_4VPhIce9JtLV5gLy619WNibVjiWB-dtiwqkBtCrU/edit?usp=sharing)

### What makes this a useful eval?

Test the model's ability to understand worded probability questions,
bring in concepts such as probability distributions, and then reason
through a correct answer.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] Include at least 100 high quality examples (it is okay to only
contribute 5-10 meaningful examples and have us test them with GPT-4
before adding all 100)

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

Using the `match` grading criteria, GPT3.5-turbo got an accuracy score
of `{'accuracy': 0.07}`

## Eval structure 🏗️

Your eval should
- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your yaml is registered at
`evals/registry/evals/{name}.jsonl`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.

### Submit eval

- [x] I have filled out all required fields in the evals PR form
- [x] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data 

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "You are a helpful
statistician. Answer the questions with only the numerical answer
rounded to 4 decimal places. Provide no explanation."}, {"role": "user",
"content": "A pair of fair, standard dice are rolled. What is the
probability the sum of the dice is 5"}], "ideal": ["0.1111"]}
{"input": [{"role": "system", "content": "You are a helpful
statistician. Answer the questions with only the numerical answer
rounded to 4 decimal places. Provide no explanation."}, {"role": "user",
"content": "An airplane is built to be able to fly on one engine. If the
plane's two engines operate independently, and each has a 1% chance of
failing in any given four-hour flight, what is the chance the plane will
fail to complete a four-hour flight to Oklahoma due to engine
failure?"}], "ideal": ["0.0001"]}
{"input": [{"role": "system", "content": "You are a helpful
statistician. Answer the questions with only the numerical answer
rounded to 4 decimal places. Provide no explanation."}, {"role": "user",
"content": "A 1-inch-diameter coin is thrown on a table covered with a
grid of lines two inches apart. What is the probability the coin lands
in a square without touching any of the lines of the grid?"}], "ideal":
["0.2500"]}
{"input": [{"role": "system", "content": "You are a helpful
statistician. Answer the questions with only the numerical answer
rounded to 4 decimal places. Provide no explanation."}, {"role": "user",
"content": "Of the 50 students in a certain class, 5 speak French. Two
students of the class will be selected at random. Which of the following
is closest to the probability that neither of the students selected will
speak French?"}], "ideal": ["0.8100"]}
{"input": [{"role": "system", "content": "You are a helpful
statistician. Answer the questions with only the numerical answer
rounded to 4 decimal places. Provide no explanation."}, {"role": "user",
"content": "Of the 10 marbles in a box, 2 are green. A person will
select 2 marbles simultaneously and at random from the box. What is the
probability that neither of the marbles selected will be green?"}],
"ideal": ["0.6222"]}
  ```
</details>
Linmj-Judy pushed a commit to TablewareBox/evals that referenced this pull request Feb 27, 2024
…ity Statistics Courses (openai#263)

# Thank you for contributing an eval! ♥️

🚨 Please make sure your PR follows these guidelines, __failure to follow
the guidelines below will result in the PR being closed automatically__.
Note that even if the criteria are met, that does not guarantee the PR
will be merged nor GPT-4 access granted. 🚨

__PLEASE READ THIS__:

In order for a PR to be merged, it must fail on GPT-4. We are aware that
right now, users do not have access, so you will not be able to tell if
the eval fails or not. Please run your eval with GPT-3.5-Turbo, but keep
in mind as we run the eval, if GPT-4 gets higher than 90% on the eval,
we will likely reject since GPT-4 is already capable of completing the
task.

We plan to roll out a way for users submitting evals to see the eval
performance on GPT-4 soon. Stay tuned! Until then, you will not be able
to see the eval performance on GPT-4. We encourage partial PR's with
~5-10 example that we can then run the evals on and share the results
with you so you know how your eval does with GPT-4 before writing all
100 examples.

## Eval details 📑
### Eval name
Probability Questions

### Eval description

Tests the model's ability to understand answer probability questions.
Questions are sourced from Society of Actuaries Exam P sample questions
and practice problems/exams from statistics classes at MIT, UPenn,
California State University, Durham University, University of
Connecticut, and other sources. The full list of questions and sources
(in the same order as in the `.jsonl` files) can be found in this Google
[sheet](https://docs.google.com/spreadsheets/d/1TU_4VPhIce9JtLV5gLy619WNibVjiWB-dtiwqkBtCrU/edit?usp=sharing)

### What makes this a useful eval?

Test the model's ability to understand worded probability questions,
bring in concepts such as probability distributions, and then reason
through a correct answer.

## Criteria for a good eval ✅

Below are some of the criteria we look for in a good eval. In general,
we are seeking cases where the model does not do a good job despite
being capable of generating a good response (note that there are some
things large language models cannot do, so those would not make good
evals).

Your eval should be:

- [x] Thematically consistent: The eval should be thematically
consistent. We'd like to see a number of prompts all demonstrating some
particular failure mode. For example, we can create an eval on cases
where the model fails to reason about the physical world.
- [x] Contains failures where a human can do the task, but either GPT-4
or GPT-3.5-Turbo could not.
- [x] Includes good signal around what is the right behavior. This means
either a correct answer for `Basic` evals or the `Fact` Model-graded
eval, or an exhaustive rubric for evaluating answers for the `Criteria`
Model-graded eval.
- [x] Include at least 100 high quality examples (it is okay to only
contribute 5-10 meaningful examples and have us test them with GPT-4
before adding all 100)

If there is anything else that makes your eval worth including, please
document it below.

### Unique eval value

Using the `match` grading criteria, GPT3.5-turbo got an accuracy score
of `{'accuracy': 0.07}`

## Eval structure 🏗️

Your eval should
- [x] Check that your data is in `evals/registry/data/{name}`
- [x] Check that your yaml is registered at
`evals/registry/evals/{name}.jsonl`
- [x] Ensure you have the right to use the data you submit via this eval

(For now, we will only be approving evals that use one of the existing
eval classes. You may still write custom eval classes for your own
cases, and we may consider merging them in the future.)

## Final checklist 👀

### Submission agreement

By contributing to Evals, you are agreeing to make your evaluation logic
and data under the same MIT license as this repository. You must have
adequate rights to upload any data used in an Eval. OpenAI reserves the
right to use this data in future service improvements to our product.
Contributions to OpenAI Evals will be subject to our usual Usage
Policies (https://platform.openai.com/docs/usage-policies).

- [x] I agree that my submission will be made available under an MIT
license and complies with OpenAI's usage policies.

### Email address validation

If your submission is accepted, we will be granting GPT-4 access to a
limited number of contributors. Access will be given to the email
address associated with the merged pull request.

- [x] I acknowledge that GPT-4 access will only be granted, if
applicable, to the email address used for my merged pull request.

### Limited availability acknowledgement

We know that you might be excited to contribute to OpenAI's mission,
help improve our models, and gain access to GPT-4. However, due to the
requirements mentioned above and high volume of submissions, we will not
be able to accept all submissions and thus not grant everyone who opens
a PR GPT-4 access. We know this is disappointing, but we hope to set the
right expectation before you open this PR.

- [x] I understand that opening a PR, even if it meets the requirements
above, does not guarantee the PR will be merged nor GPT-4 access
granted.

### Submit eval

- [x] I have filled out all required fields in the evals PR form
- [x] (Ignore if not submitting code) I have run `pip install
pre-commit; pre-commit install` and have verified that `black`, `isort`,
and `autoflake` are running when I commit and push

Failure to fill out all required fields will result in the PR being
closed.

### Eval JSON data 

Since we are using Git LFS, we are asking eval submitters to add in as
many Eval Samples (at least 5) from their contribution here:

<details>
  <summary>View evals in JSON</summary>

  ### Eval
  ```jsonl
{"input": [{"role": "system", "content": "You are a helpful
statistician. Answer the questions with only the numerical answer
rounded to 4 decimal places. Provide no explanation."}, {"role": "user",
"content": "A pair of fair, standard dice are rolled. What is the
probability the sum of the dice is 5"}], "ideal": ["0.1111"]}
{"input": [{"role": "system", "content": "You are a helpful
statistician. Answer the questions with only the numerical answer
rounded to 4 decimal places. Provide no explanation."}, {"role": "user",
"content": "An airplane is built to be able to fly on one engine. If the
plane's two engines operate independently, and each has a 1% chance of
failing in any given four-hour flight, what is the chance the plane will
fail to complete a four-hour flight to Oklahoma due to engine
failure?"}], "ideal": ["0.0001"]}
{"input": [{"role": "system", "content": "You are a helpful
statistician. Answer the questions with only the numerical answer
rounded to 4 decimal places. Provide no explanation."}, {"role": "user",
"content": "A 1-inch-diameter coin is thrown on a table covered with a
grid of lines two inches apart. What is the probability the coin lands
in a square without touching any of the lines of the grid?"}], "ideal":
["0.2500"]}
{"input": [{"role": "system", "content": "You are a helpful
statistician. Answer the questions with only the numerical answer
rounded to 4 decimal places. Provide no explanation."}, {"role": "user",
"content": "Of the 50 students in a certain class, 5 speak French. Two
students of the class will be selected at random. Which of the following
is closest to the probability that neither of the students selected will
speak French?"}], "ideal": ["0.8100"]}
{"input": [{"role": "system", "content": "You are a helpful
statistician. Answer the questions with only the numerical answer
rounded to 4 decimal places. Provide no explanation."}, {"role": "user",
"content": "Of the 10 marbles in a box, 2 are green. A person will
select 2 marbles simultaneously and at random from the box. What is the
probability that neither of the marbles selected will be green?"}],
"ideal": ["0.6222"]}
  ```
</details>
inwaves added a commit to inwaves/evals that referenced this pull request Mar 19, 2024
commit af58ab97de097ab65cf8c5bbf4b58c1abf645f6c
Author: Andrei Alexandru <inwaves@users.noreply.github.com>
Date:   Tue Mar 19 03:14:40 2024 +0000

    Remove Ainu dataset from skill acquisition (#345)

    * Remove Ainu dataset

    * Remove Ainu dataset

commit 5317164a5091db464772da76e18c5806523a7cba
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Fri Mar 15 14:37:48 2024 +0000

    Multithreading for GeminiSolver (#330)

    * GeminiSolver supports multithreading

    * add documentation for glm client setup

    * better doc

    * only share client between solver copies

    * implement backoff

    * add model version property

commit ac9024921855cff4e698718d10de11b91cab19b4
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Fri Mar 15 15:24:32 2024 +0100

    AnthropicSolver (#331)

    * add anthropic to pyproject toml

    * anthropic solver skeleton

    * name property

    * mvp; need to fix backoff/rate limiting; need to test cot

    * forgot to commit registry entry

    * implement backoff

    * implement alternating roles to support CoT

    * fill in the rest of the yaml with remaining anthropic models

    * log usage

    * make it a static method

    * anthropic solver pytests

    * update docs; implement model_version

    * dont use abbreviation

    Co-authored-by: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>

    * let _solve accept kwargs

    * include haiku (came out today)

    * switch ordering of haiku and sonnet

    * reqs handled by pyproject.toml

    ---------

    Co-authored-by: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>

commit 209fa98a10604b3d7645ab64dec148cb167abc16
Author: Dane <danesherbs@users.noreply.github.com>
Date:   Sat Mar 16 00:25:59 2024 +1100

    Refactor MLAB v2 (#340)

commit 081d74af1701fef72c8c0ad39e847e05a168289f
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Fri Mar 15 10:26:42 2024 +0000

    correct eval names for error recovery (#341)

commit 66684a555b0abe51d1be3c09cf33278bf67a7f76
Author: Dane <danesherbs@users.noreply.github.com>
Date:   Fri Mar 15 19:56:27 2024 +1100

    Update MLAB LICENSE.md (#338)

commit f2a900a6aa6c356305817fb0282073b97deaa291
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Fri Mar 15 06:33:07 2024 +0000

    Final cleanup for release (#339)

    * Remove redundant httpx logging settings

    * Minor bugged tools readme

    * minor cleanup and precommit icrl

    * Cleanup fdeduct todos

commit 7f5f8c6e44b22d11f51ccc506ebee0bd161b8d3d
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Thu Mar 14 14:05:49 2024 -0700

    Update contribution statements in READMEs (#337)

commit cadff08108a7c81c15561d968fb8ebc3ab7832c5
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Thu Mar 14 05:40:14 2024 -0700

    LICENSE files for Error Recover and 20 Questions (#335)

    * 20Q and Error Recover dataset Licenses

    * Create LICENSE files for 20Q and ER datasets

commit b7cffc0cf06b2cd3fd42a9491f0f6fbffa15c242
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Thu Mar 14 12:11:17 2024 +0000

    skip test (#334)

commit ffd004e65c7b7dda4f726916dc1fd453d7368347
Author: Dane <danesherbs@users.noreply.github.com>
Date:   Thu Mar 14 19:54:07 2024 +1100

    Add licenses to MLAB v2 (#333)

    * Create LICENSE

    * Update LICENSE.md

commit 5815222c56f8e0b186bbe8ba69f189e704db17e6
Author: Dane <danesherbs@users.noreply.github.com>
Date:   Thu Mar 14 19:42:26 2024 +1100

    Update LICENSE.md (#332)

commit 3a00e556ea1a1e18eb7063321e14d8534bf0e48d
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Thu Mar 14 08:53:16 2024 +0100

    Coherence - final PR (#325)

    * get rid of .dev alias

    * ast precommit checks+fixes

    * tts precommit checks+fixes

commit c71db16e4fafa93d44ec6dec58a24cbd5ad82259
Author: Dane <danesherbs@users.noreply.github.com>
Date:   Thu Mar 14 02:06:33 2024 +1100

    Update MLAB v2 scripts (#323)

commit 7399673ad004ca32a2bf668df1d7a3cdf11df5fa
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Wed Mar 13 14:27:21 2024 +0100

    wordnet license (#329)

commit d3bf8c96c5aca7a3733dcee985de7f1b77ba32d4
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Wed Mar 13 12:53:48 2024 +0100

    AST - Use wordnet corpus rather than brown corpus (#328)

    * wordnet, not brown

    * fix path

    * wordnet words, not brown words

    * wordnet corpus, not brown

commit 024ee14e7d06bf8f1d3f5a83f6e0ce84794e4992
Author: Ian McKenzie <140545726+ianmckenzie-oai@users.noreply.github.com>
Date:   Tue Mar 12 23:17:23 2024 -0700

    Multi-Step Web Tasks (formerly Bucket of Tasks) (#49)

    * add partial webarena

    * very basic task runnable (for real)

    * replace BoTSolver with Solver

    * change TaskState, change log, remove some redundant code/files

    * change logging debug to info

    * change to using local rather than global docker_client

    * delete unnecessary files

    * set up skeleton of BashBrowserEnv

    * add BashBrowser auxiliary classes

    * add more BashBrowser stuff, and fix running multiple tasks in sequence

    * simple task that uses bash and browser working!

    * setup and teardown code for simple-web working

    * abstract more container loading logic, downloading is untested

    * fix downloading, improve container setup

    * move Session class to separate file

    * add (untested) setups for webarena containers

    * basic setup working for all environments

    * remove accidentally committed scratch file

    * add gpt-3.5 solvers

    * refactor container setup, add networking

    * register bash with session, set up networking

    * remove input statement, add more waiting

    * move bash setup into Session and enable building from Dockerfile

    * add bugged internet disabling

    * disabling internet access seems to work

    * delete unused agent code

    * add try..except around Session setup

    * fix stopping condition to include early stopping

    * set up BoTTaskState such that default Solvers can try the task

    * add the three easy tasks

    * slightly edit the agent prompt to see if it helps

    * fix easy Python task!

    * change to gpt-4-32k

    * fix BashObservation to have data property, not method

    * more prompt tuning

    * add way to run environments in repro

    * start adding medium/hard tasks (not complete)

    * add homepage docker container (first draft)

    * remove unused apps from homepage

    * add homepage to Session

    * add medium tasks, fix evaluator

    * add individual tasks for testing

    * fix ProgramHTML evaluation

    * replace 'match' with 'elif's

    * fixes to make medium and hard tasks run

    * clean up datasets and yaml

    * fix '|' mistakes in match-statement-replacement elifs

    * add gitlab url fix

    * remove unused field from json

    * add reproducibility run_experiments.sh script

    * solvers and 3.9 compat: replacing '| None' with 'Optional' and 'match' with 'elif'

    * more 3.9 compat changes

    * add longer timeout to ready check (for simple-web)

    * small task fixes

    * some logging tweaks

    * some prompt tweaks

    * change to using StrongSolver in run_experiments.sh

    * re: Dane's review, add timeout in constants, fix elif

    * update gitlab task to say 'main'

    * update prompt to emphasize homepage

    * remove redundant network from bash env

    * fix homepage links

    * save final report as dictionaries with task_ids

    * add some scratch code to run the bash container too

    * fix task 7 to use different repo

    * add sleep and empty check to avoid issues with browser env

    * add hack to prevent 'goto' accessing the internet

    * use 0613 checkpoints

    * add homepage to all tasks

    * use model context length to choose observation history

    * change to using 0613 snapshots

    * change message fitting to cut long observations

    * have Session log errors that cause it to shut down

    * switch default action splitter to single backtick

    * add first draft of README

    * add explicit check for chat model

    * Revert "switch default action splitter to single backtick"

    This reverts commit 813d832b26ba3aefebc6144a12812adf0f9968f0.

    * add action splitter to previous actions in prompt

    * reduce chars per token estimate from 3.2 to 3

    * update wikipedia task to use Lorentz rather than Croatian election

    * change logging file extensions to match other evals

    * update StrongSolver to use tiktoken to fit context length

    * start on data parsing for plots

    * first version of plotting

    * change wikipedia task to not be answerable by gpt-4

    * modify prompts based on jun's feedback

    * remove vscode file

    * update import and remove redundant todo

    * make output dir automatically in make_plots, add to run_experiments.sh

    * tidy up plot, add descriptive labels

    * changes from Dane/Jun comments

    * add contribution statement to README

    * change Eval -> SolverEval

    * add cleanup script to reproducibility

    * remove old task file

    * remove 'BoTSolverResult'

    * remove unused 'browser_early_stop' fn

    * move requirements into pyproject.toml and add setup instructions

    * improve setup instructions in README and improve error handling in session.py

    * initial version of playwright-flask app

    * two options: exec or define all functions

    * have a draft of basic structure of client and server code down

    * remove unused 'run_function' methods

    * debugged some issues with return values

    * first draft of dockerizing api (untested)

    * more incremental progress on flask-playwright api

    * incremental: remove iptables, debug more commands

    * change cleanup to prune networks, and add wikipedia to run_environments

    * refactor to use 'Forwarder's everywhere, and change how ports are used

    * add separation between client urls and server urls

    * add flask-playwright to CLEANUP

    * end-to-end run of task 1 successful

    * add slightly better logging to failed actions

    * remove 'bridge' network, containers only accessible via 'bucket-of-tasks_network' now

    * fix task urls

    * update strong solver prompt (fix typo, add http://homepage.com)

    * change gitlab container to use 'http://gitlab.com' as its url

    * add better resetting, try to debug Execution context error

    * just wait after navigation commands

    * fix bugs with retrying and base urls inside containers

    * fix issue with quoting in page.evaluate and fix bug with retry logging

    * remove redundant url fixing method and continue after invalid url in goto

    * better error logging in evaluators.py

    * change urls on homepage to be docker internal

    * change to using homepage as start url for all tasks

    * update debugging script

    * remove hardcoded hack to ensure urls start with 127.0.0.1

    * revert change to gitlab internal url (should be gitlab:8023 for git clone)

    * change curl command logging info -> debug to reduce spam

    * move BoT reproducibility to elsuite

    * fix output dir in run_experiments

    * remove messy debugging scripts

    * remove linux restriction now that we use playwright container

    * remove exposed ports from container setup

    * change cache dir to be more portable

    * rename bucket-of-tasks to multistep-web-tasks

    * rename bucket of tasks in README

    * change default task to simple-web for CI, fix minor bugs

    * Remove commented-out code

    * Add spacing for docstrings

    ---------

    Co-authored-by: Dane <danesherbs@gmail.com>

commit c4c9517bb9ef2eee705a2d1ad00b9a805f2ab33c
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Mon Mar 11 12:48:51 2024 +0100

    Coherence - Readmes + minor cleanup (#316)

commit ebf453c16aeace5d9884aa809aae3c0ebc5f3a99
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Mon Mar 11 10:37:37 2024 +0000

    Bugged Tools - minor fixes (#315)

    * don't use judge with DummySolver

    * basic error handling around tools

commit b1c5cc0339d66202efcf4a1868f07b96025fa6aa
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Mon Mar 11 08:50:57 2024 +0000

    Function Deduction plotting code (#305)

    * Redo plotting code

    * Make colors less jarring

    * Fix missing text annotations bug

    * Update evals/elsuite/function_deduction/scripts/make_plots.py

    Co-authored-by: James Aung <129281094+james-aung@users.noreply.github.com>

    * Update evals/elsuite/function_deduction/scripts/make_plots.py

    Co-authored-by: James Aung <129281094+james-aung@users.noreply.github.com>

    ---------

    Co-authored-by: James Aung <129281094+james-aung@users.noreply.github.com>

commit ce40f1eafb653fa3da30c003f64b8132cbd0bb8b
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Mon Mar 11 08:27:30 2024 +0000

    don't attempt to sum None (#314)

commit c92857de8df82cc74861b493954c28dd239ab0ef
Author: Dane <danesherbs@users.noreply.github.com>
Date:   Mon Mar 11 19:20:18 2024 +1100

    Standardize terminology in MLAB v2 baselines (#321)

    * Update baselines to report "return" instead of "reward"

    * Update bipedal walker baseline

    * Update random sampling seed in naive baselines

commit e47ce920b4290209bf77216da1b5532eae788fed
Merge: e3b7360b 7efc3f65
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Mon Mar 11 08:17:48 2024 +0000

    Merge pull request #322 from openai/jun/change-steg-datasets

    Drop two datasets from steganography

commit e3b7360b13de77c9b901d3e33d2eff939ffcac7e
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Mon Mar 11 01:02:30 2024 -0700

    Add InContext RL READMEs (#318)

    * Create README

    * Add ICRL to main README

    * Add Dataset section

    ---------

    Co-authored-by: Chan Jun Shern <chanjunshern@gmail.com>

commit 40b5f144325c75dfbf3087692348a3e214490129
Merge: 7e85428d 030b2324
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Mon Mar 11 06:15:55 2024 +0000

    Merge pull request #320 from openai/jun/sync-oss-20240311

    Jun/sync oss 20240311

commit 7efc3f6573182d039cd7fa5de729c195e26e8cbb
Author: Chan Jun Shern <chanjunshern@gmail.com>
Date:   Mon Mar 11 12:50:58 2024 +0800

    Drop two datasets from steganography

commit 030b2324798b81a043c0a56fa0b75e33317a72bf
Merge: 7e85428d 82ec660e
Author: Chan Jun Shern <chanjunshern@gmail.com>
Date:   Mon Mar 11 12:12:22 2024 +0800

    Merge remote-tracking branch 'public/main' into jun/sync-oss-20240311

commit 7e85428db2a1587f9a824be23543be32f78cd13e
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Sun Mar 10 20:20:48 2024 -0700

    In-Context RL plotting (#280)

    * init template scripts

    * plotting code

    * correct plots

    * plotting code improvement

    * Got plotting working in notebook

    * Updated line styles for baselines

    * Change opacity of lines

    * Un-messup merge

    * Update anti-cot solver for 4-turbo and the new 3.5

    * Run experiments

    * Enable printing of each command before execution in run_experiments.sh script

    * New plotting code

    * Plotting code working

    * Delete old files

    * Add final average reward as a metric

    * Fix threading issue?

    * fix explanations arg name

    * change what solver we use to be generation/direct

    * no longer run sequentially thanks to threadding fix

    * Change qlearning baseline to train for max steps instead of max episodes

    * fix too many messages

    * new plotting code

    * qlearning for 1m steps

    * Add loop to run experiments multiple times

    * pretty names

    * Add filter script for log file processing

    * adjust fig size and add labels

    * New custom map for FrozenLake

    * Change max_steps to 200

    * add evaluation function

    * simplify plotting 😌

    * Fix saving plot to correct directory

    * per-env window sized and prettify

    * annotate lines with final values

    * invalid response rate plots

    * Update qlearning_baseline.ipynb

    * add labels

    * Catch broader class of GoogleAPIError

    * Widen catch for response.text errors

    * adjust invalid action message

    * new episode reward metrics

    * del plotting playground

    * Fix _calculate_episode_rewards method signature

    * Add handler for response.parts as well

    * cleanup plotting

    * update qlearning baselien for correct custom map

    * add short variant

    * max rolling avg in json

    * remove filter notebook

    * fix max reward of windows

    * Fix final settings for experiment script

    * Clean up prettifying functions to be dicts instead

    * clenn up episode rewards calculation in eval.py

    ---------

    Co-authored-by: Chan Jun Shern <chanjunshern@gmail.com>

commit 25aa67e6babcb32dfd6ac92ab1e87cb9387f321b
Author: Dane <danesherbs@users.noreply.github.com>
Date:   Sat Mar 9 20:48:28 2024 +1100

    Fix spelling mistake in HRMLAB prompts (#319)

commit caa46568eafbc6e7fc7d7ccd662e70d141fe02eb
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Fri Mar 8 16:28:12 2024 +0000

    prevent division by zero (#317)

commit d872aeb3bb326790078dc24fe9498e25ee802926
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Fri Mar 8 11:20:08 2024 +0000

    CDTA - readme (#313)

    * add eval README

    * compute headline metric

    * add metrics + description to eval yaml. Add shorter

    * add shorter version of eval

    * fix dataset dupliation bug

    * readme updates

    * point users towards prompts in readme

    * comment on custom chess solvers

    * add CDTA to root README

    * Update evals/registry/evals/cant_do_that_anymore.yaml

    Co-authored-by: Giulio Starace <giulio.starace@gmail.com>

    * replace subsequent -> special

    * Update evals/elsuite/cant_do_that_anymore/README.md

    Co-authored-by: Giulio Starace <giulio.starace@gmail.com>

    * mention more explicitly that each solver has a unique dataset

    * Update evals/elsuite/cant_do_that_anymore/README.md

    Co-authored-by: Giulio Starace <giulio.starace@gmail.com>

    * include both input and output tokens

    * include all metrics in eval yaml

    ---------

    Co-authored-by: Giulio Starace <giulio.starace@gmail.com>

commit 90f0bc6f95985243cb31107cbee5747772ea182a
Author: Dane <danesherbs@users.noreply.github.com>
Date:   Fri Mar 8 21:21:07 2024 +1100

    Add OpenAI Gym environments to HR-MLAB (#242)

    * Merge commit with main

    * Update plotting code

    * Update default max steps to be `30`

    * Add gym to HR-MLAB requirements

    * Tidy up vectorization's `train.py`

    * Add the BipedalWalker-v3 environment

    * Add the CartPole-v1 environment

    * Add partially-implemented llama inference task

    * Update Cart Pole baselines

    * Update baselines and scoring for BipedalWalker

    * Update Cart Pole baseline and grading

    * Update bipedal walker baselines and grading

    * Add the inverted pendulum env

    * Fix human baseline for inverted pendulum

    * Fix bipedal walker baselines

    * Add pusher environment

    * Add task descriptions and update yaml

    * Add the `Ant-v4` env

    * Add `Ant-v4` data and yaml

    * Add `Pong-ramDeterministic-v4` env

    * Make Ant normalization fn clearer

    * Update Ant grading

    * Add `Humanoid-v4` env

    * Update Cart Pole grading

    * Update inverted pendulum grading

    * Updating `Pusher-v4` grading

    * Update Pong human baseline and grading

    * Update yaml and add humanoid jsonl

    * Update humanoid grading

    * Update humanoid human baseline and grading

    * Update the bipedal walker grading script

    * Update bipedal walker grading docstring

    * Update `Cart Pole` time limit and baselines

    * Update time limits for inverted pendulum

    * Update bipedal walker max time limit

    * Update yaml file

    * Update `Ant-v4` environment

    * Update `BipedalWalker-v3` env

    * Update `CartPole-v1` env

    * Update `Humanoid-v4` env

    * Update `InvertedPendulum-v4` env

    * Update `PongNoFrameskip-v4` env

    * Update `Pusher-v4` env

    * Update .gitignore

    * Add time limit to attempt

    * Update the `Ant-v4` env

    * Update the `Humanoid-v3` env

    * Refactor get_baseline_score function to allow for additional files and saving checkpoints

    * Refactor human baseline script and add checkpoint file

    * Remove unnecessary code for file copying

    * Add cache decorator to score calculation functions

    * Add cache decorator to score calculation functions

    * Update CartPole human baseline

    * Cache baselines for humanoid env

    * Cache baselines for inverted pendulum env

    * Cache baselines for pong

    * Cache baselines for pusher

    * Update experiment scripts

    * Update timeout error message in environment.py

    * Create logic to stop eval on out of context error

    * Update .gitignore

    * Add time and steps remaining reminder for model

    * Make normalization functions linear

    * Handle context length exceeded in solver

    * Update README

    * Merge pyproject.toml

    * Add `max_time` parameter to v1 jsonl files

    * Remove default max steps etc. and improve instructions

    * Remove `shell=True` from file execution fn

    * Remove video recorder from baselines

    * Add script to calculate token estimates

    * Apply hooks

    * Refactor error message for invalid action input

    * Update task descriptions for llama inference and vectorization

    * Refactor solver class name to SimpleActionAgent

    * Update assertion error message in baseline solver

    * Remove commented-out code in autoeval.py

    * Update run_experiments.py

    * Update Bipedal Walker task description

    * Simplify token consumption estimate

    * Update README.md

    * Handle edge case for unknown completion fn

    * Change baseline solver to use `OpenAISolver` instead of `OpenAIChatCompletionFn`

    * Re-add files with Git LFS tracking

commit 2e0676e14616895baa5456a72c822476feef906a
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Fri Mar 8 09:55:40 2024 +0000

    Final plots for error rec (#312)

commit 1fa967fa8d470f88da0ea780666c250069a5a717
Author: Ian McKenzie <140545726+ianmckenzie-oai@users.noreply.github.com>
Date:   Fri Mar 8 01:48:48 2024 -0800

    [Skill Acquisition] Changes from rerun (#300)

    * Fix copy bug for subclasses

    * Add retries with backoff

    * Add task_description to record_sampling

    * Add SkillAcqAssistantsSolver along with yaml entries

    * Log token usage in final report

    * reduce yaml + run exps script to just solvers used

    * changes for rerun

    * add back some solvers

    * remove my absolute path

    * Update plotting script

    * Remove commented out code

    ---------

    Co-authored-by: Chan Jun Shern <chanjunshern@gmail.com>

commit e775967ecd1d9c639593d78d2ed5f612fafb1a5a
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Thu Mar 7 07:58:52 2024 -0800

    error recovery README

commit 752432e3112f2eb387fa6ff6213ee651c5c193f6
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Thu Mar 7 07:58:23 2024 -0800

    Update README.md (#309)

commit b9938df47d4ddbc3baffbef859e0d18f5cdd5e29
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Thu Mar 7 15:53:35 2024 +0000

    Mode Collapse - Variants + Polish (#307)

    * add option to create control dataset

    * rename dataset creation script

    * various readability + documentation improvements

    * log how many games have been examined during dataset creation process

    * only update tqdm bar + dataset if more than one move is found

    * remove unnecessary any_previous_move_found var

    * add option to save dataset throughout creation

    * remove unused n_threads arg for dataset_creation.py

    * remove unused recorder

    * skeleton structure for creating diagonal dataset

    * logic for finding diagonal moves

    * add support for diagonal experiment

    * remove unused plot. Render headline metrics

    * add plot for diagonal variant. Plot performance vs. num previous bishop moves

    * update progress bar with correct num. new examples

    * fixed bug where dataset would contain duplicate examples

    * run reproducability multiple times to compute SE

    * reproducability script to run diagonal variant

    * wrap single move in main+control datasets in list

    * seed changes on each different eval run

    * combine run_experiments scripts

    * remove unnecessary sort

    * add chess to pyproject.toml. Also correct typing

    * rename eval dir mode_collapse -> cant_do_that_anymore

    * rename mode collapse -> cant do that anymore

    * add diagonal dataset.

    * add required base solvers

    * up reproducability n_samples to 1000

    * remove marker from plots. Provide correct path for saved figure

    * update datasets to use new keys

    * use default dataset for DummySolver

commit 7f5f2a61d530485892a1e9a006e82006a3c3a7a3
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Thu Mar 7 13:36:28 2024 +0100

    Coherence reproducibility (#261)

    * repro script for already_said_that

    * make it exectuable

    * solvers on outer loop

    * use aliases rather than specific snapshot

    * autoformat

    * missing gpt-4-base solver

    * temporarily suspend cot solvers

    * switch to 10 threads

    * run_experiments script for track the stat

    * temporarily disable gpt-4-base

    * .log; temporarily skip gpt-4-base

    * specify seed

    * additional helper method for specs

    * distractorless baseline

    * formatting

    * script for making track_the_stat plots

    * already_said_that plotting

    * increase thread count to 100

    * enable gpt-4-base

    * explicit gpt-4-base was missing

    * get gpt-4-base too!

    * 50 threads; 100 causes hanging issues

    * indefinite (or long) hanging iwth more than 10 threads. sorry

    do gpt-4 first

    * use first letter rather than find-letter

    * clarify task even more

    * switch to first letter

    * all seeds of one solver before moving to the next

    * first letters (plural)

    * first letters plural

    * fix label; make results_dict init more general

    * use a global var for MODELS; correct first-letters to plural

    * fix legends

    * rename ideal to max

    * divide up plots and json

    * update cot solver names; only run gpt-4-turbo-preview cot by default

    * prepare plotting for gpt-4-base and cot 4 turbo

    * n_samples to match what ive been running (250 rather than 500) and reproducibility

    * make distractorless a bar

    * implement token counting

    * token counts in track the stat

    * handle CoT models

    * styling

    * run exps wrapping up

    * integrate gemini

    * tts explicit state for gemini and together

    * allow specifying the role of the explicit state message

    * integrate gemini in tts run exps

    * random baseline

    * Catch broader class of GoogleAPIError

    * Widen catch for response.text errors

    * integrate random baseline

    * remove extra comment

    * add direct solvers for together models

    * integrate together models in tts run exps

    * fix shortsight in comment

    * Add handler for response.parts as well

    * move gemini to bottom

    * handle new models

    * integrate random baseline

    * move main to the bottom

    * all stats in json stats

    * handle single seeds; third party models

    * plot random and human baseline

    * adjust plot size

    * switch to 0.05

    * snapshot name

    * move main to bottom

    * dont hardcode num repeats

    * gemini pro 1.0

    * better labels

    * turns, not steps

    * annotate n_samples as Optional

    * contract array definition/declaration

    * add diagnostic echo statements in loop

    * add explanatory comment

    * no longer need stat to legend loc

    * dont hardcode the length of the samples to 500

    ---------

    Co-authored-by: Chan Jun Shern <chanjunshern@gmail.com>

commit 1b85e50f1bfdc76ce8258054700c3d17bba473ba
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Thu Mar 7 10:40:20 2024 +0100

    Id vars - support for mixtral/gemini/llama in plots (#304)

    * autoformat

    * integrate third party solvers

    * clarify that third party is corrset only

    * tell users why were skipping

commit 0acb25cc4a79b5a22e4e674030a1cee6dd8c7ecf
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Thu Mar 7 09:20:45 2024 +0000

    Mode Collapse - Automatically Create Dataset (#294)

    * add model_version property to solvers

    * solvers that aren't nested now have str model_version

    * rename model -> solver. Initialise solver before calling get_solver_predictions

    * optionally create dataset for evaluated solver if doesn't exist

    * if solver isn't nested solver, clone to force temp=0

    * special_move_creation now doesn't compute solver predictions

    * update datasets

    * update args for reproducability script

    * move model prediction logic outside of special_move_creation.py

    * don't attempt to remake solver with temp=0

    * update default eval args to match standard setup

    * use dummy recorder for dataset creation

    * add special moves dataset to git lfs

    * move funcs in mode_collapse/scripts/utils.py -> mode_collapse/utils.py

    * add warning when generating dataset

    * up num. samples to 1000 to match default setup

commit 751004bf500cca60ee581f1e7fee98579abe4289
Author: Andrei Alexandru <inwaves@users.noreply.github.com>
Date:   Thu Mar 7 09:09:46 2024 +0000

    20 questions: add readme, final update to plotting code (#306)

    * WIP plotting and repro code

    * Add dataset generation script and generate datasets

    * Remove plurals from lexicon dataset

    * Update repro & plotting code, add explicit max_questions arg to specs

    * Uncomment other solvers

    * Remove sequential run, uncomment standard

    * Remove dataset creation script, other datasets, update YAML and repro

    * Add final dataset, accept lowercase guesses for proper nouns

    * Fix word difficulty constant in plotting code

    * Include variant as plot title

    * Less verbose plot titles, update prompt, temperature value

    * Add baseline, labels

    * Add readme, final updates to plotting code

    * uppercase 'they' in prompt

    * Swap names in contrb statement

commit 2ca19c356ec79a9f24cf0ed37c041e97512bbc83
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Thu Mar 7 08:56:42 2024 +0000

    Add model_version to solvers (#288)

    * add model_version property to solvers

    * solvers that aren't nested now have str model_version

commit 3df5e01870829d641aad22edf7771d6d34bf1377
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Wed Mar 6 13:51:09 2024 +0000

    Mode Collapse - Tweaks & Violations (#293)

    * update task description. Include more information on notation

    * raise error if notation parser is given incorrect input

    * construct controller for all solvers

    * measure and log violations

    * log std of previous move length

    * remove random solver

    * tidy incorrect notation handling

    * split variable init

    * tidy violation metric calculations

    * better naming of specific violation

    * move get_binary_avg to mode_collapse/utils.py, add typing

commit cff5aba866cc85fd1e541de2afcd1be2594e05b7
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Wed Mar 6 03:11:04 2024 +0000

    AssistantsSolver for SkillAcq (#291)

    * Fix copy bug for subclasses

    * Add retries with backoff

    * Add task_description to record_sampling

    * Add SkillAcqAssistantsSolver along with yaml entries

    * Place files under current_state['files']

    * Include files on every message in thread instead of only one; remove redundant self.all_uploaded_files variable

    * Expand SkillAcqAssistantsSolver to SkillAcquisitionAssistantsSolver

commit 037b1765e8c795d332261b022964e0834d0f10a5
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Wed Mar 6 03:10:28 2024 +0000

    Log token usage in final report (#298)

    * Log token usage in final report

    * Print comma-delimited numbers

commit 827f906736be5f2f55f43a2f0a8320e7f6ec4f4a
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Tue Mar 5 03:03:27 2024 +0000

    Jun/gemini bugfixes (#302)

    * Catch broader class of GoogleAPIError

    * Widen catch for response.text errors

    * Add handler for response.parts as well

commit 038e43530f7dbf1c154a48952d3af6198f0144b1
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Mon Mar 4 12:06:01 2024 +0100

    Coherence - solver config for third party models (gemini/together) (#301)

    * tts explicit state for gemini and together

    * allow specifying the role of the explicit state message

commit 60be685819f24e44bab014fde330ecefc6c492b9
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Fri Mar 1 17:39:41 2024 +0100

    Coherence - Random baseline solvers (#266)

    * random baseline solver for track the stat

    * random baseline solver for already said that

    * add registry=None to inits so that they dont crash

    see #267

    * more consistent rounding

commit 9b6ee76d463683cd8974f34e4f03b6f866567d2e
Author: Ian McKenzie <140545726+ianmckenzie-oai@users.noreply.github.com>
Date:   Fri Mar 1 01:26:43 2024 -0800

    [Solvers] Add 'stop' sequences to base model calls (#295)

    * use msg seps as stop seqs for base model api

    * log warning if too many stop sequences are used

commit aa7431deb3b493d72626699c3fda5ae30f2e23d0
Author: Ian McKenzie <140545726+ianmckenzie-oai@users.noreply.github.com>
Date:   Thu Feb 29 14:43:04 2024 -0800

    [Solvers] Add open-source models from together.ai (#284)

    * initial version of together.ai solver

    * modify messages in TogetherSolver to fit format

    * add mixtral and 70b

    * add scripts for running on os models

    * add script for running evals over weekend

    * add sleep between evals

    * fix for loop in bugged_tools script

    * use 10 threads

    * make skill acq scripts executable

    * handle together.ai context length errors

    * add cot and custom os solvers

    * switch to using cot solvers

    * add run_os_experiments for error recovery

    * change run_os_experiments to executable

    * fix custom solvers to use correct model names

    * use n_repeat=1 to match gemini

    * remove print statement

    * fix bugged_tools log naming

    * fix skill_acquisition log naming

    * os plotting changes for fdeduct and ivariables

    * add os/gemini models to error recovery plots

    * revert os plotting/experiment scripts

    * add default api base url

    * remove unused completion_fn config for together

    * add optional message merging and tests

commit 0e8b05d18b8bd737a39c48107c606b8c0dee91fb
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Thu Feb 29 19:56:49 2024 +0000

    [Solvers] Add GeminiSolver (#263)

    * Add initial working prototype for GeminiSolver

    * Add basic tests, add solvers/gemini.yaml

    * Add google-generativeai to pyproject.toml

    * Relax safety settings; handle API errors gracefully

    * Remove redundancy in model name

    * Revert to resp.text instead of longer version

    * Make test case less ambiguous

    * Update docstrings

    * Add postprocessors and CoT solver

    * Log messages in non-google format for logviz compatibility

    * Explicitly require EVALS_SEQUENTIAL while we haven't figured out threading

    * Catch known error about  while we haven't figured it out

    * Drop postprocs that were causing issues with CoT

    * Fix failing test

    * Register gemini for fdeduct and skillacq eval-specific solvers

    * point to github issue for gemini threading todo

    ---------

    Co-authored-by: Ian <ian.mckenzie@c-openai.com>

commit 0912197f18b0d4a27a2fb1d1764f5e52e2cfbccc
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Thu Feb 29 18:17:04 2024 +0000

    [Skill Acquisition] Jun + Ian bug fixes (#289)

    * Fix bug in get_average_bleu_score

    * Compute acc for non-translation q's only

    * Log more metrics

    * Use full paths for files in current_state

    * Fix wrong key name

    * Log number of translation and non-translation samples in final report

    * fix typo in 'wrong_section' prompt

    * stop removing quotes from model outputs

    ---------

    Co-authored-by: Ian McKenzie <ian.mckenzie@c-openai.com>

commit 19aa0f48759fb6881486b8670dcdd2edb445f740
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Mon Feb 26 15:19:50 2024 +0000

    Add handler for oai max messages exceeded (#290)

commit 5706f315c94748e41e48ea02f71ce3da3b5657a7
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Fri Feb 23 09:42:49 2024 +0000

    Mode Collapse - reproducibility (#276)

    * select dataset depending on evaluated model

    * add random solver

    * record avg. number of previous moves in dataset

    * reproducability code for running and plotting main experiment

    * dataset creation follows models defined in eval.py

    * tqdm has correct target length

    * fix plotting bug, check if value is True correctly

    * sample dataset consistently between runs

    * remove preprocessors

    * add strip to solver ouputs since postprocessors no longer used

    * only create controller if necessary

    * pass current state as None if no legal moves are passed

    * rename plot_experiments.py -> make_plots.py

    * add TODOs for remaining work

    * use snapshot for gpt-4-turbo

commit f6f6a765477eda6391083cd92fa16a0f42797000
Author: Andrei Alexandru <inwaves@users.noreply.github.com>
Date:   Fri Feb 23 10:01:00 2024 +0200

    20 questions: datasets + repro/plotting code (#258)

    * WIP plotting and repro code

    * Add dataset generation script and generate datasets

    * Remove plurals from lexicon dataset

    * Update repro & plotting code, add explicit max_questions arg to specs

    * Uncomment other solvers

    * Remove sequential run, uncomment standard

    * Remove dataset creation script, other datasets, update YAML and repro

    * Add final dataset, accept lowercase guesses for proper nouns

    * Fix word difficulty constant in plotting code

    * Include variant as plot title

commit c6f24dca6788185af403834ee0b1a86008518064
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Fri Feb 23 07:50:07 2024 +0000

    String explanations and frozenlake variant for InContext RL (#283)

    * refactor main loop to use a single for loop

    * Change logging severity level

    * Switch out keys for an explanations string

    * Add FrozenLake variant with custom map

    * Update evals/elsuite/incontext_rl/defaults.py

    * Update use_explanations variable name

commit 6f83b69be681b3fd7a79fd8ac7c8caf9a325c2b5
Author: Ian McKenzie <140545726+ianmckenzie-oai@users.noreply.github.com>
Date:   Thu Feb 22 21:27:40 2024 -0800

    [Error recovery] Final eval + plot changes (#269)

    * partial progress on plots

    * progress on plots (still more to go)

    * small tweaks after putting plots in report

    * put step plots on same fig

    * make small changes to plots

    * messy own vs other plotting

    * change to have models next to each other, and refactor cli

    * add option to have reasoning in user message

    * add option to have answer prompt be user or system

    * parse mark_as_own_reasoning better

    * fix up run_experiments.sh

    * clean up own_reasoning vs other_reasoning

    * add gpt-4-base solver that continues assistant message

    * Remove unused function

    ---------

    Co-authored-by: Chan Jun Shern <JunShern@users.noreply.github.com>

commit 82fb4401b2b88acda979dcc9df715777274dd426
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Thu Feb 22 17:29:54 2024 +0000

    Clean CurrentState for InContext RL (#282)

    * refactor main loop to use a single for loop

    * Change logging severity level

    * Clean CurrentState to have fewer properties

    * Fix random solver

commit 1d0fcf608729b5ea4830faae72ca2f2698ab952f
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Thu Feb 22 15:55:26 2024 +0000

    Refactor main loop of in-context RL to use a single for loop (#281)

    * refactor main loop to use a single for loop

    * Change logging severity level

commit 2961e233dac2b5908a181c23d899413b0d53daeb
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Thu Feb 22 14:55:28 2024 +0100

    Coherence - missing solvers (#287)

    * missing gpt-4-base solver

    * CoT solvers for ast

commit c65dac28ce91403ff4810623c00ea7540ad97927
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Thu Feb 22 08:25:46 2024 +0000

    Rolling back postprocessors :( (#286)

    * rollback postproc

    * Make private interaction mismatch error msg more informative

commit 819657351194b01048fd50ee14e20415015cdd6c
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Thu Feb 22 07:55:50 2024 +0000

    Add anti-cot solver to In-Context RL (#279)

commit 44dd342c3ab5bea59be36cec41e21cc818281f04
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Wed Feb 21 14:40:44 2024 +0000

    Correct token counts for SkillAcq README (#277)

commit d9a8da7a9042adef3336c47b1720021f6689bf78
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Wed Feb 21 14:40:29 2024 +0000

    Correct token counts for IV README (#277)

commit 3b741a791ec2724eca7405d539ab3c5fb9e006d8
Merge: 7dcd3a64 23ec5f9d
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Wed Feb 21 11:03:33 2024 +0000

    Merge pull request #248 from openai/james/icrl

    First working version of InContext RL

commit 7dcd3a6493f0e4f81a71974dde6f5d3487dfc868
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Wed Feb 21 08:03:35 2024 +0000

    OpenAISolver generate from prefix (#264)

    * allow OpenAISolver to continue generating from prefix

    * return raw_completion_result in OpenAISolver SolverResult

    * rename start -> prefix. Check msgs length to avoid IndexError

    * correct typing

    * simplify prefix, store in fixed_start

    * add spaces around prefix

commit 9c0804cfb9d55b5e47e644df0b0b50951e0148da
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Wed Feb 21 08:47:21 2024 +0100

    Replace `find-letter` with `first-letter` (#274)

    * use first letter rather than find-letter

    * first letters (plural)

commit bd6f1850582e2c312ae546d3da0082384f4f34c5
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Wed Feb 21 07:45:20 2024 +0000

    Bugfix: Postproc for empty/short strings (#275)

    * Apply handling for edge cases where string is empty or shorter than required

    * Add tests

    * Simplify RemoveQuotes using string.strip args

    Co-authored-by: Giulio Starace <giulio.starace@gmail.com>

    * Simplify RemovePeriod using string.strip args

    Co-authored-by: Giulio Starace <giulio.starace@gmail.com>

    * Revert RemoveQuotes to only remove matching pairs; extend tests

    ---------

    Co-authored-by: Giulio Starace <giulio.starace@gmail.com>

commit 0d05bdb3472a982050049d3e2361308a65ae3a93
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Wed Feb 21 04:57:33 2024 +0100

    HumanCLI solver convenience wrapper for Coherence evals (#262)

    * include task name in current state

    * allow for custom input prompt

    (default behaviour is unchanged)

    * human cli solver for track the stat

    * humancli wrapper for already said that

    * input_prompt is always a string, we provide a default

    * no need for `is None` check now

    Co-authored-by: Chan Jun Shern <JunShern@users.noreply.github.com>

    * autoformat with black

    * my solvers no longer need to override input prompt

    * need to provide at least one SolverSpec arg for it to be recognised as one

    ---------

    Co-authored-by: Chan Jun Shern <JunShern@users.noreply.github.com>

commit 2f29882a14d61cca682773e0ac5adeef805c30ff
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Wed Feb 21 04:28:04 2024 +0100

    Address evals hanging on certain samples (#260)

    * dont do nested multithreading

    * more aggressive timeout

    * switch back to 40 seconds

commit 53085bf9c6ce76489d824556d982d885c1a78d93
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Wed Feb 21 03:10:37 2024 +0000

    add guards for solvers that subclass Solver in diff ways (#270)

commit 23ec5f9dc1883088a3a5153b7b8660cfd46119f1
Author: james-aung <james.aung@c-openai.com>
Date:   Tue Feb 20 15:16:30 2024 +0000

    create variant which only runs on built in gymnasium environments

commit 78975fb5428855076fd3df19b09381ebe09fe6c1
Author: james-aung <james.aung@c-openai.com>
Date:   Tue Feb 20 15:11:13 2024 +0000

    Add gymnasium as a dependency

commit fcab6456b5b6ca7aa91083d016bebb0e6d7c77ed
Author: james-aung <james.aung@c-openai.com>
Date:   Tue Feb 20 15:08:38 2024 +0000

    Add Q-table initialization in QlearningSolver

commit f263ba676c147d7d7aaa73e81245849c1cbaae2d
Author: james-aung <james.aung@c-openai.com>
Date:   Tue Feb 20 14:54:55 2024 +0000

    remove action and observation space counts from sample

commit e850311a692e725035331f5511d88ac596c9f35c
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Tue Feb 20 15:17:01 2024 +0100

    clarify task even more (#273)

commit c0416c152a93992e2f318b5b38a4fc56334dbb79
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Tue Feb 20 12:12:28 2024 +0000

    Mode Collapse - Main Experiment (#265)

    * skeleton eval

    * add solver prediction evaluation

    * add measuring probability of predicting moves

    * remove probability calculation

    * add eval yaml, add n_samples param

    * add dataset creation script

    * add gpt-3.5-turbo dataset

    * add other model datasets

    * fix failing test, update eval call

    * add documentation for variant rules

    * improve task desc, define rules of all pieces

    * simplify message construction

    Co-authored-by: Giulio Starace <giulio.starace@gmail.com>

    * rename prop -> proportion

    * strip now handled by postprocessors

    * replace dataset creation notebook with script

    * update datasets

    * remove unused SolverEval arg

    * pass jsonl_dir rather than relying on global args

    * get_model_predictions returns rather than dumping

    * improve documentation, make clear which rules of chess apply

    * default args as False

    * fix typo

    * changed large list to set

    * rename n_samples n_special_moves

    * fix previous move filtering logic

    ---------

    Co-authored-by: Giulio Starace <giulio.starace@gmail.com>

commit 2814496a01da52c1c365741134d58e15fc2c8b24
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Tue Feb 20 09:20:37 2024 +0000

    Update gpt-3.5-turbo-0125 to just gpt-3.5-turbo (#271)

commit 45aeb778dc5ec9637d8160e971bfe182b4a16fcb
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Mon Feb 19 17:03:34 2024 +0100

    Coherence - Task description adjustments (#268)

    * remove notion of distractors from track_the_stat

    * take a word from the distractor question to use in the example transcript

commit 27bd87811d0b46e07f9822fbd66dd14b0c5d31a0
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Mon Feb 19 15:04:42 2024 +0100

    Already said that - mvp (#259)

    * implement build message

    * implement build_base_task_message

    * implement build_distractor_question_message

    * implement parse_solver_output

    * minor fixes

    * fix old implementations; distractor also evaluated now

    * mark violations as mistakes

    * no longer need rng

    * fill out yaml

    * actually use slef.task_description 🤦

    * clarify instructions

    * adjust weighting

    * fix parse solver output

    * clearer docstring

    * modularize a bit

    * track fp and fn rate

    * add adversarial flag

    * make it extra clear when a sample is for the main task

    * replace rectangles and next-val-series with find-letter and which-is-heavier

    * balance how often we show new vs old words

    * .strip() handled by postprocessor; .lower() handled if match

    * switch to dict

    * use oneliner

commit 2e5c35437484629327d8bd037c1fceecd2fa52ab
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Mon Feb 19 07:49:42 2024 +0000

    Update evals/registry/solvers/incontext_rl.yaml

    Co-authored-by: Chan Jun Shern <JunShern@users.noreply.github.com>

commit 7d71c63b5b44673488514e6ae88de81d3b70e21a
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Mon Feb 19 07:15:38 2024 +0000

    Add postprocessors for Solver outputs (#245)

    * basic postprocessors working

    * Add postprocessors to subclass constructors

    * add to defaults

    * Add test for combination of postprocs

    * Add docstrings

    * Change import from just classname to full_path:classname

    * Log postprocessor events

    * Add README

    * Fix missing recorder in tests

commit 5259ff8bb65c1668cdb8a7fba80d1e157a73e0e7
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 16:59:09 2024 +0000

    cleanup

commit edffa345885775aa792dc1d5b707e7d941c550cb
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 16:31:27 2024 +0000

    update qlearning baseline workbook

commit e4a77cda0db30b8d351082e1ed8702207848295a
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 15:51:52 2024 +0000

    attempt at qlearning solver

commit 26156168a2fcc64d9058ffe18c36edd8f67cbf05
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Fri Feb 16 15:33:36 2024 +0100

    Already said that (coherence) - Dataset (#254)

    * move data generation to scripts/ subdir

    * implement bulk of high-level of data generation

    * only missing distractor corpora

    * the distractors dont live in the dataset

    they already exist as their own separate datasets

    * move distractor stuff to its own module

    * num words as an arg

    * fix hanging

    * script and words dataset

    * make it executable

    * distractorsample dataclass

    * skeleton/scaffolding

    * missing samples_jsonl from yaml

    * wrong import

    * leverage evals.registry and evals.data

    * make it more robust

    * implement rectangles

    * implement next-val-series

    * implement ambig sentences

    * implement reverse word sort

    * make ideal just a string, not a set of strings

    * show distractor-specific example

    * add note about ifmain

    * remove TODO

    * assume running from repo root

    * add distractors tests

commit ecd011dff7d58cbd235fa7cd8c56707703fbbf5d
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 14:17:20 2024 +0000

    implement random solver

commit 58ff0b1bf49d1cea9b4c4c1db2101a3493fdee08
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 13:40:28 2024 +0000

    update yaml

commit 68377f4f259bfd1049d8ec78f1cc2b771b36ecd8
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 13:39:57 2024 +0000

    keep CurrentState up to date

commit efd57bd44306c6605060bb14e98400b5e238235e
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 13:17:46 2024 +0000

    update samples

commit 5d7fd725ada83c563a66f7db7d588e83293adb51
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 13:17:31 2024 +0000

    major refacto

commit e20bb6e6bb4f7d69a34e8b1e7e38b70f5bc4e06a
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 13:17:25 2024 +0000

    add more linebreaks to prompts

commit a10afc384b3a6a844481a8d6ac69e2502a6c4c9a
Merge: 6ef1e812 96a6bf34
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 11:41:38 2024 +0000

    Merge branch 'main' of github.com:openai/dangerous-capability-evaluations into james/icrl
    merge changes from main

commit 96a6bf3401baa78fdd676bad9dc1ee9648cf123d
Merge: e807f2dc d6e3e915
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Fri Feb 16 11:41:32 2024 +0000

    Merge pull request #255 from openai/feature/20Questions

    Small improvements to human CLI usage

commit 6ef1e8122f7d190da3f6520c2d7ee70238ac4820
Merge: 12b6b191 e807f2dc
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 11:40:19 2024 +0000

    Merge branch 'main' of github.com:openai/dangerous-capability-evaluations into james/icrl
    merge main

commit e807f2dc4e85d3c8ea9c0207028ccb03caa06e25
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Fri Feb 16 12:05:34 2024 +0100

    simplify median state (#257)

commit 12b6b191985ec82512c2f0293a411124b266148e
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 10:59:53 2024 +0000

    Refactor environment setup and task description generation

commit 96442b5bc254f2d3e4f9ae463d99f0de601511cb
Author: james-aung <james.aung@c-openai.com>
Date:   Fri Feb 16 10:43:55 2024 +0000

    add baselines notebook

commit 471fc9be934eb12da52e76a358cfb0603d801027
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Fri Feb 16 09:32:19 2024 +0000

    Add gpt-3.5-turbo-instruct, remove trailing whitespace from completion prompts (#256)

    * support gpt-3.5-turbo-instruct completion model

    * remove trailing whitespace from end of prompt

    * correct spelling of completion model

    * use chat_to_prefixes for rendering completion text, but strip whitespace

    * add documentation about removing trailing whitespace

commit d6e3e915fcb9cb341538737f01d98103c30b050c
Author: Andrei Alexandru <inwaves@live.com>
Date:   Fri Feb 16 09:47:18 2024 +0200

    Small improvements to human CLI usage

commit c8dc96a9ddf13a11e5fa890896032ee446c8f868
Author: Andrei Alexandru <inwaves@users.noreply.github.com>
Date:   Fri Feb 16 09:00:31 2024 +0200

    20 questions: add shortlist variant (#253)

    * Add shortlist variant

    * Add a shortlist variant for the 'full' spec in the YAML file

    * Address feedback for shortlist variant

    * Replace how we extract the guess with regular expression

    * Address feedback for regex

commit ca4773f0e624f66a0a8c15202b4b2328aea7d3a5
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Thu Feb 15 17:11:09 2024 +0000

    Suppress excessive 'HTTP/1.1 200 OK' logs from openai library (#252)

commit b9014652eec22e2ef2e5d4b540b10c175d3c3d31
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Thu Feb 15 14:13:54 2024 +0100

    Coherence - Already said that distractors eval - Skeleton (#251)

    * scaffolding

    * further scaffoldign

    * [wip] more skeleton

    * cleaner skeleton

    * more skeleton, add utils

    * fix name

    * implement agg metrics

    * set constant for num turns

    * add docstrings

    * add mocks

    * fix flow to enable tracking perf on distractor task

    * rename violation to violation_occurred

    * move convo loop to its own func

    * make elif more readable

    * remove redudnant empty messages

    * address constants

commit b4875c129c61b35aea0d1cf7c3d1cac262c4654d
Author: Andrei Alexandru <inwaves@users.noreply.github.com>
Date:   Thu Feb 15 10:11:25 2024 +0200

    20 questions: adding features I (#250)

    * Add features: logit bias, 'skip' option, handling incorrect guesses and rule violations

    * Add counter for gamemaster 'skip' replies

    * Remove finished TODOs

commit b97ec783ec75483148f63c5e5c109fba7e9430c9
Author: james-aung <james.aung@c-openai.com>
Date:   Wed Feb 14 15:37:35 2024 +0000

    calculate rolling reward

commit 8797064be11d6fda3ab37dc529e81fb409321729
Author: james-aung <james.aung@c-openai.com>
Date:   Wed Feb 14 14:12:12 2024 +0000

    track which steps episodes end

commit 1b7661bea9c41902b5575a4ce05c8a13b338b535
Author: james-aung <james.aung@c-openai.com>
Date:   Wed Feb 14 14:04:44 2024 +0000

    reintroduce backup max_steps

commit 2466a1df070f7bbec90c26624692344bd6ea699c
Author: james-aung <james.aung@c-openai.com>
Date:   Wed Feb 14 13:58:14 2024 +0000

    sample now runs until context limit is reached

commit 557ddddb33914a6e5e778d684b9a76f5a014c8ea
Author: james-aung <james.aung@c-openai.com>
Date:   Wed Feb 14 12:50:38 2024 +0000

    remove n_trials concept

commit 5ee3fff6f1f04b535058da26de342401af1c6726
Author: james-aung <james.aung@c-openai.com>
Date:   Wed Feb 14 12:05:53 2024 +0000

    adjust task description based on feedback

commit a236da4c8bd8924343b1818d261951349c02b365
Author: james-aung <james.aung@c-openai.com>
Date:   Tue Feb 13 17:30:12 2024 +0000

    tidy up

commit 9a5a6f27a49e608df693c0631c0788d6744e2966
Author: James Aung <129281094+james-aung@users.noreply.github.com>
Date:   Tue Feb 13 16:59:42 2024 +0000

    Update evals/elsuite/incontext_rl/requirements.txt

commit 2f2609043b5293ffdc6fe8e47be33c2c71b4ae97
Author: Oliver Jaffe <oliver.jaffe@hotmail.co.uk>
Date:   Tue Feb 13 12:34:08 2024 +0000

    Mode Collapse - Chess Framework (#247)

    * chess game skeleton logic

    * implemented algebraic notation parser

    * logic for moving pieces

    * prevent moves that threaten the king

    * add support for castling moves

    * add support for en passant moves

    * add support for pawn double moves

    * fix pawn moves, don't allow them to capture forwards

    * add support for promotion moves

    * add method to find piece on board

    * compute normal moves before promotions, otherwise no promotions are possible

    * add testing for our framework vs. python-chess

    * verbose option for test

    * add missing typing

    * skip board testing to stop CI failing

    * prevent pawns promoting to kings

    * simplify notation parsing. Assume full start and end position is given

    * rename notation function to intended name

    * rename board function to more descriptive name

    * rename possible_moves -> possible_transformations, since wasn't using Move object

    * move import within test to prevent CI failing

    * Piece object now calculates its possible moves

    * add file containing move variants

    * rename pieces -> piece_id_to_instance

    * add running test upon executing file. rename testing vars

    * notation parser no longer requires player_id

    * rename notation parser functions to reflect that move object is being processed

    * removed unused initial_board and _find_pieces func

    * add documentation for _update_board

    * remove unnecessary validation of move

    * split some get_piece_moves logic into separate function

    * rename functions to indicate if transformation or move is returned

    * refactor to avoid circular imports. now Board isn't passed to notation parser or Piece

    * add documentation to board classes

    * nit: update typing

commit 36c7faac0ab15e53e523f14d801fef1306f7ce99
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Tue Feb 13 11:27:25 2024 +0100

    Coherence - Explicit State baseline solver (#249)

    * implement eval side

    * fix serialization issue

    * make it a dict explicitly

    * implement explicitstate solver

    * idk how this crept back in

    * rename to track the stat

    * fix imports and ids

    * move helper functions to utils.py

    * keep logger, not logging

commit b10f23c436b64562068c3c880dbd3494ece7da95
Author: Giulio Starace <giulio.starace@gmail.com>
Date:   Tue Feb 13 10:53:35 2024 +0100

    Coherence - Implicit State Tracking (#246)

    * implement turns, add task desc

    * append messages

    * implement task_fn and parse_solver_output

    * singleturn was missing Message construction

    * fill in prompts

    * remove fulfilled todos

    * 500 samples

    * avoid info logs from httpx

    * integrate singleturn task desc

    * use stateless solver to avoid state leak in singleturn

    * make a submodule for the prompts

    * further modularize prompts

    * remove extra space in warning

    * fix lsit rendering in singleturn example

    * more metrics

    * switch to aritmean median

    * strip notion of singleturn/multiturn dichotomy

    * fix serialization issue

    * rename this to implicit state tracking

    * get existing logger and use that instead of logging

    * add note to todo

    * rename to track_the_stat

    * finish renaming

commit 066a6bb011e50041e93b72ecf82cb9d6446e691a
Author: Andrei Alexandru <inwaves@users.noreply.github.com>
Date:   Mon Feb 12 10:47:34 2024 +0200

    Initial PR for 20 questions (#240)

    * Twenty questions MVP

    * Refactor to conversation loop

    * Fix tests to use Message class

    * Some questions have clarifications after the ?, like (eg.)

    * Log word complexity metric

    * Tweak guesser prompt

    * Add lexicon from Maddela and Xu, 2018

    * Add dev5, remove completed TODOs

    * Add score hint in guesser prompt

    * Add flexible max score to prompt

    * Consistent naming for max_questions

    * Move from accuracy to score

    * Add scoring, TODOs, fix dev5 num_samples

    * Remove test file, it's no longer needed

    * Shuffle before we sample

    * Move from specific gpt-4 version to -turbo alias

    * Add fall-back condition

commit f525f70c3f55533c9a328feef3fa69e59c05a03f
Author: Chan Jun Shern <JunShern@users.noreply.github.com>
Date:   Mon Feb 12 08:41:13 2024 +0000

    Add "System: " prefix when rendering system messages for base model (#244)

    * Modify base model msg rendering; OAISolver uses 'System:' prefix

    * Fix failing tests

commit d99e6c46e512ab306b800050692a041f4179eb0c
Author: james-aung <james.aung@c-openai.com>
Date:   Sun Feb 11 20:19:35 2024 +0000

    change to 50 steps

commit c71cb3a6b096a15612cdc54b6e5254c73d1fcdd0
Author: james-aung <james.aung@c-openai.com>
Date:   Sun Feb 11 19:22:26 2024 +0000

    adjust variants

commit c82ee84b33f4ac821275d00c169674075eca4642
Author: james-aung <james.aung@c-openai.com>
Date:   Sun Feb 11 19:21:49 2024 +0000

    Cast cumulative_reward to float in InContextRl eval

commit 9dd1ea31720e973f73ff9f3ae1498b835cda26a7
Author: james-aung <james.aung@c-openai.com>
Date:   Sun Feb 11 18:36:44 2024 +0000

    fixed dataset by removing duplicate sample

commit e17e8e87c4d93e8268d526708dee6a137a04720e
Author: james-aung <james.aung@c-openai.com>
Date:   Sun Feb 11 18:19:03 2024 +0000

    split out a explanations variant

commit a0e2278fc1988b9eae4d483a9b4bf1b5fbe90bf9
Author: james-aung <james.aung@c-openai.com>
Date:   Sun Feb 11 18:18:37 2024 +0000

    add cliffwalking and 10armed bandit envs

commit d5725e76a69cd141e684619212dda0b8444d3193
Author: james-aung <james.aung@c-openai.com>
Date:   Sun Feb 11 18:18:10 2024 +0000

    pass through explanined observation to step messages

commit 20268cbc85977238aeeda3a24a87a80af4a54fd9
Author: james-aung <james.aung@c-openai.com>
Date:   Sun Feb 11 18:17:43 2024 +0000

    update prompts

commit 8a399ea54121b6f592aae5344a7d1e6347c09992
Author: james-aung <james.aung@c-openai.com>
Date:   Sat Feb 10 19:33:45 2024 +0000

    Update to 60 steps

commit f99a6130366fda51dcb45f3e1703666e1a688dc6
Author: james-aung <james.aung@c-openai.com>
Date:   Sat Feb 10 19:33:32 2024 +0000

    Suppress annoying logging

commit 13c05dc5b4956e10d26c3947510974d39ed1aaf5
Author: james-aung <james.aung@c-openai.com>
Date:   Sat Feb 10 19:11:12 2024 +0000

    cleanup

commit ea7d93f875e154c64ed0b12f800eb7a1bba6cea5
Author: james-aung <james.aung@c-openai.com>
Date:   Sat Feb 10 15:12:36 2024 +0000

    add suport for multiple trials per env

commit 8dd4ebe98b0795a611f95a6501d799fa6c414bd8
Author: james-aung <james.aung@c-openai.com>
Date:   Sat Feb 10 15:12:18 2024 +0000

    toggle explanations on

commit 300902e1233318deda0b467bb979ab2d98fa4ce3
Author: james-aung <james.aung@c-openai.com>
Date:   Sat Feb 10 15:12:06 2024 +0000

    update samples

commit e1a5499394394fb6a62ab4c4759f5ee024598adb
Author: james-aung <james.aung@c-openai.com>
Date:   Sat Feb 10 14:05:10 2024 +0000

    Update incontext_rl.yaml with new n_steps value

commit 1a773de8876be16e5f9715f39adb1eaa1563b5c1
Author: james-aung <james.aung@c-openai.com>
Date:   Sat Feb 10 14:05:04 2024 +0000

    add exlanations to dataset

commit d9dd017a75eec5e387624ae00323d10130393325
Author: james-aung <james.aung@c-openai.com>
Date:   Sat Feb 10 14:04:57 2024 +0000

    allow for explanations

commit bfd80b7f864d8388e17798c0822c3b68867020f1
Author: james-aung <james.aung@c-openai.com>
Date:   Sat Feb 10 13:57:08 2024 +0000

    ffix spacing

commit 7188886ec34ceb57f84e362400019786554eb775
Author: james-aung <james.aung@c-openai.com>
Date:   Sat Feb 10 13:47:52 2024 +0000

    add action parsing

commit 678957f66d2253d0740c2bffa111a66a19d45867
Author: james-aung <james.aung@c-openai.com>
Date:   Sat Feb 10 13:47:42 2024 +0000

    remove old notebook

commit 21132e552a46466efd5f97ca6aac5a3c5cbccf2a
Author…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants