Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add customizable evaluation dimensions #256

Merged
merged 10 commits into from
Dec 8, 2024

Conversation

bugsz
Copy link
Contributor

@bugsz bugsz commented Nov 28, 2024

Provide a way for people to customize the evaluation dimensions they want.
Currently use CustomEvaluationDimension to specify a dimension, and CustomEvaluationDimensionList to group.
To create the dimension, one can directly use a dictionary, or compose the existing metrics by specifying their names.

TODOs:

  • Move the file to evaluator.py
  • Figure out the best practice to store and use these dimensions (is using the name to specify the best way?)

Closes #

📑 Description

✅ Checks

  • My pull request adheres to the code style of this project
  • My code requires changes to the documentation
  • I have updated the documentation as required
  • All the tests have passed
  • Branch name follows type/descript (e.g. feature/add-llm-agents)
  • Ready for code review

ℹ Additional Information

Copy link

codecov bot commented Nov 28, 2024

Codecov Report

Attention: Patch coverage is 38.09524% with 39 lines in your changes missing coverage. Please review.

Project coverage is 35.94%. Comparing base (dbd8294) to head (6155292).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sotopia/database/evaluation_dimensions.py 37.09% 39 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (dbd8294) and HEAD (6155292). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (dbd8294) HEAD (6155292)
3 2
@@             Coverage Diff             @@
##             main     #256       +/-   ##
===========================================
- Coverage   74.47%   35.94%   -38.54%     
===========================================
  Files          61       50       -11     
  Lines        3162     2557      -605     
===========================================
- Hits         2355      919     -1436     
- Misses        807     1638      +831     
Files with missing lines Coverage Δ
sotopia/database/__init__.py 95.45% <100.00%> (-4.55%) ⬇️
sotopia/database/evaluation_dimensions.py 37.09% <37.09%> (ø)

... and 32 files with indirect coverage changes

Copy link
Member

@XuhuiZhou XuhuiZhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job, @bugsz can you also add relevant docs?

examples/experiment_eval.py Outdated Show resolved Hide resolved
return validator

@staticmethod
def generate_dimension_model(dimension_ids: list[str]) -> Type[BaseModel]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this function used for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create an validator for the evaluation metric?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why name it as generate? then

Also consider add a docstring explanation here?

)

@staticmethod
def generate_dimension_model_from_name(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename this? Also can we get rid of the printing here? And add the existing names to the doc?

sotopia/database/evaluation_dimensions.py Outdated Show resolved Hide resolved
sotopia/database/evaluation_dimensions.py Outdated Show resolved Hide resolved
range_low: int = Field(index=True)


class CustomEvaluationDimensionList(JsonModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use this as a model to save a set of dimensions? like same name sotopia, then it automatically retrieve all the sotopia original dimensions and would be ready to use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is what I am thinking. Do we want to allow different evaluation metrics to have the same name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? E.g., ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, there is an original sotopia dimension and a refined one, say
[old] goal: provide a goal score of 1-10 where higher score indicates higher completion
[new] goal: provide a goal score of 1-10, where 1-3: xxx, 4-6: yyy

I think this is something we do not want to see, but sometimes we might need to have these two at the same time?

@ProKil
Copy link
Member

ProKil commented Dec 4, 2024

@bugsz Could you fix the mypy tests first?

@bugsz
Copy link
Contributor Author

bugsz commented Dec 4, 2024

@bugsz Could you fix the mypy tests first?

Fixed. Will update the doc

Copy link
Member

@XuhuiZhou XuhuiZhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Add pytests and be sure to only merge to /demo instead of /main

  • Make sure to update the API doc?

  • This change would break a lot of places!! please update them accordingly

(basically check anywhere mentioning ReachGoalLLMEvaluator?)

range_low: int = Field(index=True)


class CustomEvaluationDimensionList(JsonModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? E.g., ?

return validator

@staticmethod
def generate_dimension_model(dimension_ids: list[str]) -> Type[BaseModel]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why name it as generate? then

Also consider add a docstring explanation here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this then?

@@ -108,6 +108,20 @@ def _iterate_env_agent_combo_not_in_db(
env_ids: list[str] = [],
tag: str | None = None,
) -> Generator[EnvAgentCombo[Observation, AgentAction], None, None]:
# method 1 for loading evaluation metric
evaluation_dimensions = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not a good thing to write things like this? (remove method 1?) move it as an example to the doc?

@XuhuiZhou XuhuiZhou changed the base branch from main to demo December 7, 2024 22:14
@@ -0,0 +1,92 @@
## Overview

Evaluation dimensions are used to evaluate the quality of social interactions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure mention they can use SotopiaDimension here as well, and people don't need to initialize the database when using that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either move the content from docs/pages/concepts/evaluation_dimension.md to here or either remove this file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And don't forget to update the API doc? you can use chatgpt to do that for ya, but it's good that we have it

)
custom_dimension.save()
pk = custom_dimension.pk
dimension = CustomEvaluationDimension(uuid_str=pk)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CustomEvaluationDimension(uuid_str=pk) this is not how you fetch data from the Redis database

XuhuiZhou and others added 3 commits December 8, 2024 15:02
…(pk) (#262)

Co-authored-by: openhands <openhands@all-hands.dev>
* Fix test_create_custom_dimension to use CustomEvaluationDimension.get(pk)

* Update documentation for SotopiaDimension and EvaluationDimensionBuilder

* [autofix.ci] apply automated fixes

* Add API documentation for evaluation dimensions

* Refine API documentation for evaluation_dimensions.py to match style

* [autofix.ci] apply automated fixes

---------

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
@XuhuiZhou XuhuiZhou changed the base branch from demo to main December 8, 2024 20:43
@XuhuiZhou XuhuiZhou changed the base branch from main to demo December 8, 2024 20:49
@XuhuiZhou XuhuiZhou merged commit 5a9f4b7 into demo Dec 8, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants