-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add customizable evaluation dimensions #256
Conversation
Codecov ReportAttention: Patch coverage is
@@ Coverage Diff @@
## main #256 +/- ##
===========================================
- Coverage 74.47% 35.94% -38.54%
===========================================
Files 61 50 -11
Lines 3162 2557 -605
===========================================
- Hits 2355 919 -1436
- Misses 807 1638 +831
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job, @bugsz can you also add relevant docs?
return validator | ||
|
||
@staticmethod | ||
def generate_dimension_model(dimension_ids: list[str]) -> Type[BaseModel]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this function used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create an validator for the evaluation metric?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why name it as generate? then
Also consider add a docstring explanation here?
) | ||
|
||
@staticmethod | ||
def generate_dimension_model_from_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this? Also can we get rid of the printing here? And add the existing names to the doc?
range_low: int = Field(index=True) | ||
|
||
|
||
class CustomEvaluationDimensionList(JsonModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use this as a model to save a set of dimensions? like same name sotopia, then it automatically retrieve all the sotopia original dimensions and would be ready to use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that is what I am thinking. Do we want to allow different evaluation metrics to have the same name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? E.g., ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, there is an original sotopia dimension and a refined one, say
[old] goal: provide a goal score of 1-10 where higher score indicates higher completion
[new] goal: provide a goal score of 1-10, where 1-3: xxx, 4-6: yyy
I think this is something we do not want to see, but sometimes we might need to have these two at the same time?
@bugsz Could you fix the mypy tests first? |
Fixed. Will update the doc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Add pytests and be sure to only merge to /demo instead of /main
-
Make sure to update the API doc?
-
This change would break a lot of places!! please update them accordingly
(basically check anywhere mentioning ReachGoalLLMEvaluator?)
range_low: int = Field(index=True) | ||
|
||
|
||
class CustomEvaluationDimensionList(JsonModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? E.g., ?
return validator | ||
|
||
@staticmethod | ||
def generate_dimension_model(dimension_ids: list[str]) -> Type[BaseModel]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why name it as generate? then
Also consider add a docstring explanation here?
docs/pages/concepts/evaluation.mdx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this then?
examples/experiment_eval.py
Outdated
@@ -108,6 +108,20 @@ def _iterate_env_agent_combo_not_in_db( | |||
env_ids: list[str] = [], | |||
tag: str | None = None, | |||
) -> Generator[EnvAgentCombo[Observation, AgentAction], None, None]: | |||
# method 1 for loading evaluation metric | |||
evaluation_dimensions = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a good thing to write things like this? (remove method 1?) move it as an example to the doc?
@@ -0,0 +1,92 @@ | |||
## Overview | |||
|
|||
Evaluation dimensions are used to evaluate the quality of social interactions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure mention they can use SotopiaDimension
here as well, and people don't need to initialize the database when using that
docs/pages/concepts/evaluation.mdx
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either move the content from docs/pages/concepts/evaluation_dimension.md
to here or either remove this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And don't forget to update the API doc? you can use chatgpt to do that for ya, but it's good that we have it
tests/database/test_database.py
Outdated
) | ||
custom_dimension.save() | ||
pk = custom_dimension.pk | ||
dimension = CustomEvaluationDimension(uuid_str=pk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CustomEvaluationDimension(uuid_str=pk)
this is not how you fetch data from the Redis database
…(pk) (#262) Co-authored-by: openhands <openhands@all-hands.dev>
* Fix test_create_custom_dimension to use CustomEvaluationDimension.get(pk) * Update documentation for SotopiaDimension and EvaluationDimensionBuilder * [autofix.ci] apply automated fixes * Add API documentation for evaluation dimensions * Refine API documentation for evaluation_dimensions.py to match style * [autofix.ci] apply automated fixes --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>
Provide a way for people to customize the evaluation dimensions they want.
Currently use
CustomEvaluationDimension
to specify a dimension, andCustomEvaluationDimensionList
to group.To create the dimension, one can directly use a dictionary, or compose the existing metrics by specifying their names.
TODOs:
evaluator.py
Closes #
📑 Description
✅ Checks
type/descript
(e.g.feature/add-llm-agents
)ℹ Additional Information