You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 8, 2025. It is now read-only.
How can we automatically align models to datasets? Specifically, how can we most effectively align elements of a model to features within datasets.
Currently, models and datasets are profiled separately by MIT and SKEMA. Both datasets and models end up having (optional) groundings which--for each feature of the data or model--tie it to an element in the TA2 Domain Knowledge Graph (DKG). As far as I know, DKG code lives here.
For example, a model may have a compartment called infected which is grounded to
There is no intersection between these groundings, but clearly there is a relationship between infected compartment in the model and infections feature in the dataset. This makes it potentially challenging to identify relevant data to use for model calibration/simulation since for calibration you must match data to specific model compartments/elements.
Potential Solutions
Embed the groundings for both models and datasets and enable users to perform semantic search over both. This would include embedding dataset and model descriptions. When a user is search for data relevant to their model they would use free text search which would be powered by a semantic backend to surface the most useful data.
Create an /align_data_to_model endpoint which, for a given model_id attempts to find relevant data features on an model element to data feature basis. For example, an SIR model's susceptible, infected, and recovered compartments would be automatically matched and ranked to features (potentially from multiple datasets) based on groundings or whatever other information we can efficiently use.
The first approach will fit best inside TDS and is something we may want to do anyway. Vector/semantic search over content besides papers seems quite useful. We could even support semantic code search which would be potentially very useful.
The second approach will fit best inside this repository since it mirrors some of the existing endpoints (e.g. aligning a model to its paper).
Considerations
It is likely that we will need multiple examples of models and datasets for testing and development. Here is an example model which are often referred to as an AMR: ASKEM Model Representation.
Here is an example data card but note that this data card is not in the canonical dataset format for TDS. We can generate/pull some in the appropriate format--but for now at least this helps get a sense of how DKG groundings roughly appear for data.
The text was updated successfully, but these errors were encountered:
During the TA1 working group there were comments that -
It would be nice to be able to try different embedding models.
This is fairly easy but I can make it even easier..
The need for benchmarks - we could think of this as a really fine difficult
Some other conversation points -
Is there even a good grounding and if so, how much does it help?? And if so, can we get to the grounding..?
HMI workflow to to grounding usage linking to help the grounding team..
More complicated model testing..
Implementing this as an endpoint requires the generation of embeddings over models and datasets which will first be addressed by this TDS issue so is currently blocked
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Challenge
How can we automatically align models to datasets? Specifically, how can we most effectively align elements of a model to features within datasets.
Currently, models and datasets are profiled separately by MIT and SKEMA. Both datasets and models end up having (optional)
groundings
which--for each feature of the data or model--tie it to an element in the TA2 Domain Knowledge Graph (DKG). As far as I know, DKG code lives here.For example, a model may have a compartment called
infected
which is grounded toLet's say there is a dataset that has the feature
infections
which is grounded toThere is no intersection between these groundings, but clearly there is a relationship between
infected
compartment in the model andinfections
feature in the dataset. This makes it potentially challenging to identify relevant data to use for model calibration/simulation since for calibration you must match data to specific model compartments/elements.Potential Solutions
/align_data_to_model
endpoint which, for a givenmodel_id
attempts to find relevant data features on an model element to data feature basis. For example, an SIR model'ssusceptible, infected, and recovered
compartments would be automatically matched and ranked to features (potentially from multiple datasets) based on groundings or whatever other information we can efficiently use.The first approach will fit best inside TDS and is something we may want to do anyway. Vector/semantic search over content besides papers seems quite useful. We could even support semantic code search which would be potentially very useful.
The second approach will fit best inside this repository since it mirrors some of the existing endpoints (e.g. aligning a model to its paper).
Considerations
It is likely that we will need multiple examples of
models
anddatasets
for testing and development. Here is an example model which are often referred to as anAMR
: ASKEM Model Representation.Here is an example data card but note that this data card is not in the canonical dataset format for TDS. We can generate/pull some in the appropriate format--but for now at least this helps get a sense of how DKG groundings roughly appear for data.
The text was updated successfully, but these errors were encountered: