Proposal: Add free-form data fields to Instance #3057

yifanmai · 2024-10-10T17:52:17Z

Background

The current schema for Instance is very limited - it allows a single Input and a List[Reference]. This is appropriate for most question answering scenarios and traditional NLP tasks, but other scenarios contain additional structured data for each instance.

Examples:

AIR-Bench, HarmBench, SimpleSafetyTests needs to store the taxonomy of categories. It currently puts these in references.
BBQ needs to store whether the question is ambiguous or not. It currently puts this in the tags of every reference (the same "ambiguous" / "non-ambiguous" tag is duplicated across references).
XSTest needs to store whether the question is safe or not. It currently uses the last reference for this.
MedAlign in Medhelm #3038 needs to store the EHR context separately from the question, because it needs to truncate the EHR context in a special way before it is interpolated into the prompt.
MTBench, which is not yet implemented, needs to store the category of the instance, which is used to determine the prompt for the LLM-as-judge.

Proposal

Add the field data: Dict[str, Any] to Instance. The value of data can contain nested dictionaries and arrays, but only contains leaf str and number values, and should be serializable to JSON.

Scenarios can use data however they wish, e.g. setting data to {"category": ""} or {"ehr": "...", "question": "..."}

The frontend can traverse the data object and render a list of key value pairs, where the keys are the JSON path.

Open Questions

Should there be two separate fields, data and metadata?
Should be the field be named extra_data instead of data?

Alternatives Considered

Allow subclassing Input. The subclass can define new fields needed by the scenario. This allows better type-checking than the proposal. However, this doesn't work because subclasses of Input are not preserved after round-trip to JSON using dacite or cattrs. The extra fields defined in the subclass will be removed after the round trip. We need to perform this round trip in various places e.g. helm-summarize.
Make Instance itself a free-form dict. We have too many existing Instance field accesses, so this would require too many changes and potentially break other things.

The text was updated successfully, but these errors were encountered:

yifanmai · 2024-10-24T00:34:11Z

MMLU and GPQA both need this because they both need to store an extra chain-of-thought annotation. See: #3088

yifanmai changed the title ~~Proposal: Add free-form data field to Instance~~ Proposal: Add free-form data fields to Instance Oct 17, 2024

yifanmai mentioned this issue Oct 24, 2024

Add extra_data field to Instance #3094

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Add free-form data fields to Instance #3057

Proposal: Add free-form data fields to Instance #3057

yifanmai commented Oct 10, 2024 •

edited

Loading

yifanmai commented Oct 24, 2024

Proposal: Add free-form data fields to Instance #3057

Proposal: Add free-form data fields to Instance #3057

Comments

yifanmai commented Oct 10, 2024 • edited Loading

Background

Proposal

Open Questions

Alternatives Considered

yifanmai commented Oct 24, 2024

yifanmai commented Oct 10, 2024 •

edited

Loading