Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Add free-form data fields to Instance #3057

Open
yifanmai opened this issue Oct 10, 2024 · 1 comment
Open

Proposal: Add free-form data fields to Instance #3057

yifanmai opened this issue Oct 10, 2024 · 1 comment

Comments

@yifanmai
Copy link
Collaborator

yifanmai commented Oct 10, 2024

Background

The current schema for Instance is very limited - it allows a single Input and a List[Reference]. This is appropriate for most question answering scenarios and traditional NLP tasks, but other scenarios contain additional structured data for each instance.

Examples:

  • AIR-Bench, HarmBench, SimpleSafetyTests needs to store the taxonomy of categories. It currently puts these in references.
  • BBQ needs to store whether the question is ambiguous or not. It currently puts this in the tags of every reference (the same "ambiguous" / "non-ambiguous" tag is duplicated across references).
  • XSTest needs to store whether the question is safe or not. It currently uses the last reference for this.
  • MedAlign in Medhelm #3038 needs to store the EHR context separately from the question, because it needs to truncate the EHR context in a special way before it is interpolated into the prompt.
  • MTBench, which is not yet implemented, needs to store the category of the instance, which is used to determine the prompt for the LLM-as-judge.

Proposal

Add the field data: Dict[str, Any] to Instance. The value of data can contain nested dictionaries and arrays, but only contains leaf str and number values, and should be serializable to JSON.

Scenarios can use data however they wish, e.g. setting data to {"category": ""} or {"ehr": "...", "question": "..."}

The frontend can traverse the data object and render a list of key value pairs, where the keys are the JSON path.

Open Questions

  • Should there be two separate fields, data and metadata?
  • Should be the field be named extra_data instead of data?

Alternatives Considered

  • Allow subclassing Input. The subclass can define new fields needed by the scenario. This allows better type-checking than the proposal. However, this doesn't work because subclasses of Input are not preserved after round-trip to JSON using dacite or cattrs. The extra fields defined in the subclass will be removed after the round trip. We need to perform this round trip in various places e.g. helm-summarize.
  • Make Instance itself a free-form dict. We have too many existing Instance field accesses, so this would require too many changes and potentially break other things.
@yifanmai yifanmai changed the title Proposal: Add free-form data field to Instance Proposal: Add free-form data fields to Instance Oct 17, 2024
@yifanmai
Copy link
Collaborator Author

MMLU and GPQA both need this because they both need to store an extra chain-of-thought annotation. See: #3088

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant