You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current schema for Instance is very limited - it allows a single Input and a List[Reference]. This is appropriate for most question answering scenarios and traditional NLP tasks, but other scenarios contain additional structured data for each instance.
Examples:
AIR-Bench, HarmBench, SimpleSafetyTests needs to store the taxonomy of categories. It currently puts these in references.
BBQ needs to store whether the question is ambiguous or not. It currently puts this in the tags of every reference (the same "ambiguous" / "non-ambiguous" tag is duplicated across references).
XSTest needs to store whether the question is safe or not. It currently uses the last reference for this.
MedAlign in Medhelm #3038 needs to store the EHR context separately from the question, because it needs to truncate the EHR context in a special way before it is interpolated into the prompt.
MTBench, which is not yet implemented, needs to store the category of the instance, which is used to determine the prompt for the LLM-as-judge.
Proposal
Add the field data: Dict[str, Any] to Instance. The value of data can contain nested dictionaries and arrays, but only contains leaf str and number values, and should be serializable to JSON.
Scenarios can use data however they wish, e.g. setting data to {"category": ""} or {"ehr": "...", "question": "..."}
The frontend can traverse the data object and render a list of key value pairs, where the keys are the JSON path.
Open Questions
Should there be two separate fields, data and metadata?
Should be the field be named extra_data instead of data?
Alternatives Considered
Allow subclassing Input. The subclass can define new fields needed by the scenario. This allows better type-checking than the proposal. However, this doesn't work because subclasses of Input are not preserved after round-trip to JSON using dacite or cattrs. The extra fields defined in the subclass will be removed after the round trip. We need to perform this round trip in various places e.g. helm-summarize.
Make Instance itself a free-form dict. We have too many existing Instance field accesses, so this would require too many changes and potentially break other things.
The text was updated successfully, but these errors were encountered:
yifanmai
changed the title
Proposal: Add free-form data field to Instance
Proposal: Add free-form data fields to Instance
Oct 17, 2024
Background
The current schema for
Instance
is very limited - it allows a singleInput
and aList[Reference]
. This is appropriate for most question answering scenarios and traditional NLP tasks, but other scenarios contain additional structured data for each instance.Examples:
Proposal
Add the field
data: Dict[str, Any]
toInstance
. The value ofdata
can contain nested dictionaries and arrays, but only contains leafstr
andnumber
values, and should be serializable to JSON.Scenarios can use
data
however they wish, e.g. setting data to{"category": ""}
or{"ehr": "...", "question": "..."}
The frontend can traverse the data object and render a list of key value pairs, where the keys are the JSON path.
Open Questions
data
andmetadata
?extra_data
instead ofdata
?Alternatives Considered
Input
. The subclass can define new fields needed by the scenario. This allows better type-checking than the proposal. However, this doesn't work because subclasses ofInput
are not preserved after round-trip to JSON usingdacite
orcattrs
. The extra fields defined in the subclass will be removed after the round trip. We need to perform this round trip in various places e.g.helm-summarize
.Instance
itself a free-form dict. We have too many existingInstance
field accesses, so this would require too many changes and potentially break other things.The text was updated successfully, but these errors were encountered: