Skip to content
This repository has been archived by the owner on Jul 23, 2024. It is now read-only.

[Task Submission] BLM_tasks (blm_tasks) #14

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions src/genbench/tasks/blm_tasks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from genbench import TaskDict


class BlmTasks(TaskDict):
pass
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Empty file.
54 changes: 54 additions & 0 deletions src/genbench/tasks/blm_tasks/agr_f/config.jsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
{
name: 'BLM_tasks (agr_f)',


description: 'BLM_tasks (agr_f) aims to measure the detection of rules related to the subject-verb agreement (in French) in neural networks. The dataset was automatically generated based on manually collected seeds and predefined patterns, and using overlapping generation rules that combine different number of attractors and grammatical numbers for NPs and verbs',

keywords: [
'rule-like generalization',
'underlying problem structure',
'grammatical phenomena',
'subject-verb agreement',
'French',
],

authors: [
'Paola Merlo',
'Chunyang Jiang',
'Aixiu An',
'Maria A. Rodriguez',
'Vivi Nastase',
],

data_source: {
type: 'manual',
train: 'https://raw.githubusercontent.com/CLCL-Geneva/GenBench/main/BLMs/agrF_train.jsonl',
test: 'https://raw.githubusercontent.com/CLCL-Geneva/GenBench/main/BLMs/agrF_test.jsonl',
},

has_validation_set: false,
has_train_set: true,

task_type: 'multiple_choice',

field_mapping: {
input: 'input',
target: 'target',
target_options: 'target_options',
},

evaluation_metrics: [
{
hf_id: 'f1',
git_commit_sha: '3a4c40f7397dcd7d9dccf0659616dc6b14072dcb',
best_score: 1.0,
},
],

preparation_strategies: {
finetuning: {
objective: 'maximum_likelihood',
},
},

}
60 changes: 60 additions & 0 deletions src/genbench/tasks/blm_tasks/agr_f/doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# BLM_tasks (agr_f)

## Abstract

BLM-AgrF is an instance of Blackbird's Language Matrices (BLM). This novel linguistic dataset is generatively constructed to support investigations in representation learning of grammatical rules. Each instance, consisting of a sequence of sentences and a candidate answer set, was built using a combination of rules, to provide a layered and structured dataset for learning more complex models. The various layers of the dataset allow for a variety of explorations, from disentangled sentence representations to capture structure and regularities within a sentence, to modular architectures that could capture structure and regularities in the sentence sequences. The purposefully built candidate answers supports more in-depth analyses of the behaviour of a system, and provide insights into the source of prediction errors.

The sentence structure is constructed to illustrate several underlying generative rules that describe different aspects of the linguistic phenomenon. These rules need to be identified and disentangled to correctly generalize and thus identify the correct answer. The sequence structure was designed in a similar manner to visual IQ tests, and follows a generative process of overlapping rules. The output is multiple choice. The correct sentence should be the correct continuation of the input sequence w.r.t. the dataset's generation rules.

## Examples
BLM-AgrF (agr_f) is a dataset capturing subject-verb agreement in French:

Input:

|---|-----------|------------------|-----------------|--------|
| 1 | The vase | with the flower | | leaks. |
| 2 | The vases | with the flower | | leak. |
| 3 | The vase | with the flowers | | leaks. |
| 4 | The vases | with the flowers | | leak. |
| 5 | The vase | with the flower | from the garden | leaks. |
| 6 | The vases | with the flower | from the garden | leak. |
| 7 | The vase | with the flowers | from the garden | leaks. |
| 8 | ??? |

Choices:

| | |
|-----------------------------------------------------------|---------|
| The vase with the flower and the garden leaks. | Coord |
| extbf{The vases with the flowers from the garden leak.} | Correct |
| The vase with the flower leaks. | WNA |
| The vase with the flower from the garden leak. | AE |
| The vases with the flower from the garden leak. | WN1 |
| The vases with the flowers from the gardens leak. | WN2 |
|-----------------------------------------------------------|---------|

## Usage
The task is formatted as multiple choice. The input consists of a sequence of 7 sentences, separated by the end of sentence marker (</s>). The options are provided as a list of sentences, and the index of the correct one is specified as the target:

{
"input": "Les soirées dans l'appartement ont gêné les résidents . </s> La soirée dans l'appartement a gêné les invités . </s> Les visites aux artistes approchent . </s> La menace de les attaquer inquiète les médecins . </s> Les visites au village des artisanats approchent . </s> La soirée dans l'appartement des propriétaires a gêné les voisins . </s> Les dangers de les réformes dans les écoles inquiètent les médecins .",
"target": 5,
"target_options": ["L'avion pour le vol au-dessus des canyon s'écrase .", "L'avion pour les vols au-dessus des canyon s'écrasent .", "L'avion pour les vols au-dessus des canyon dans les réserves indiennes s'écrase .", "L'avion pour les vols et des canyon s'écrasent .", "Les instruments pour les vols au-dessus des canyon s'écrase .", "L'avion pour les vols au-dessus des canyon s'écrase ."]
}

## Data Source
The dataset was automatically generated based on manually selected seeds and predefined sentence templates.

## Limitations and Bias
The sentences and the sequence of sentences for each dataset have a prescribed structure.

## GenBench Eval card

- *Generalisation type* The generalisation type evaluated is 'compositional' because the dataset is generated with overlapping (and compositional) rules, that a system should detect
- *Motivation* The motivation is both 'intrinsic' and 'cognitive': 'cognitive' because the dataset would test the capabilities of the system to detect the kind of information humans perceive in the provided data; 'intrinsic' because if a system can learn to detect specific linguistic information, we could adjust the model to detect different types of information.
- *Shift source* the data is automatically generated from manually collected seeds, and by applying prespecified (but naturalistic) templates.
- *Shift locus* is 'pretrained-trained' because we imagine a system would use representations of the data from a pretrained model to address the task of identifying specific linguistic phenomena.
- *Shift type*


![GenBench Eval Card](GenBench_eval_card.png)
121 changes: 121 additions & 0 deletions src/genbench/tasks/blm_tasks/agr_f/task.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
from collections import OrderedDict
from typing import Any, Dict, List, Mapping

import evaluate
from datasets import Dataset

from genbench import Task
from genbench.api import EvaluationResult, TaskType
from genbench.utils.logging import get_logger


logger = get_logger(__name__)


def make_list(N, ind_1):
binary_list = [0] * N
binary_list[ind_1] = 1
return binary_list


class BlmTasksAgrF(Task):
def evaluate_predictions(
self,
*,
predictions: List[Mapping[str, Any]] = None,
gold: Dataset = None,
) -> EvaluationResult:
result = OrderedDict()
for metric_config in self.config.evaluation_metrics:
hf_id = metric_config.hf_id
if isinstance(hf_id, str):
hf_id = [hf_id]

metric = evaluate.load(*hf_id, revision=metric_config.git_commit_sha)

refs_lst = [g["target"] for g in gold]
preds_lst = [pred["target"] for pred in predictions]

ref_type = type(refs_lst[0])
pred_type = type(preds_lst[0])
if pred_type != ref_type:
if self.config.task_type != TaskType.MULTIPLE_CHOICE:
raise ValueError(
f"Predictions and references have different types: preds: {pred_type} and refs: {ref_type}. "
)
# Convert predictions to the same type as the references
if pred_type == str and ref_type == int:
logger.warning("Predictions are strings, but references are ints. Converting predictions to ints.")
converted_preds = []
for pred, ref in zip(preds_lst, gold):
assert "target_options" in ref
converted_preds.append(ref["target_options"].index(pred))
preds_lst = converted_preds
elif pred_type == int and ref_type == str:
logger.warning("Predictions are ints, but references are strings. Converting references to ints.")
converted_refs = []
for pred, ref in zip(preds_lst, gold):
assert "target_options" in ref
converted_refs.append(ref["target_options"].index(ref["target"]))
refs_lst = converted_refs
else:
if self.config.task_type == TaskType.MULTIPLE_CHOICE and pred_type != int:
# Convert both predictions and references to int
logger.warning(
"Predictions and references have the same type, but it is not int. Converting both to int."
)

N = len(ref["target_options"])

converted_preds = []
converted_refs = []
for pred, ref in zip(preds_lst, gold):
assert "target_options" in ref
# converted_preds.append(ref["target_options"].index(pred))
# converted_refs.append(ref["target_options"].index(ref["target"]))

converted_preds.extend(make_list(N, ref["target_options"].index(pred)))
converted_refs.append(make_list(N, ref["target_options"].index(ref["target"])))

preds_lst = converted_preds
refs_lst = converted_refs

extra_kwargs = metric_config.compute_extra_kwargs or {}
output: dict = metric.compute(predictions=preds_lst, references=refs_lst, **extra_kwargs)

if output is None:
raise ValueError(
f"Metric {metric_config.hf_id} returned None. " f"Please check the metric implementation."
)

# Update output keys to include the metric id
metric_id = "_".join(hf_id)
output = {f"hf_{metric_id}__{k}": v for k, v in output.items()}

result.update(output)

return result

def format_example(self, example: Dict[str, Any]) -> Dict[str, Any]:
"""Perform preprocessing/formatting on an example-level.

By default, this method does nothing more than mapping original data source
fields to the expected fields.

`example` directly comes from the data source (e.g. downloaded HF dataset),
and it may contain fields such as `question` or `answer`. This method should
prepare the example used in the task. i.e. should create fields `input`,
`target`, `target_scores`, or `target_labels` depending on the task type.

Args:
example: A dictionary containing key-value pairs for an example from the source dataset.


Returns:
A dictionary containing key-value pairs for the preprocessed/formatted example.
The dictionary should contain keys `input`, `target`, `target_scores`, or `target_label`
depending on the task type.
"""
# print("config: {}".format(self.config))
# print("Example: {}".format(example))
return {"input": example["input"], "target": example["target"], "target_options": example["target_options"]}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
54 changes: 54 additions & 0 deletions src/genbench/tasks/blm_tasks/agr_f_type_I_train/config.jsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
{
name: 'BLM_tasks (agr_f_type_I_train)',


description: 'BLM_tasks (agr_f_type_I_train) aims to measure the detection of rules related to the subject-verb agreement (in French) in neural networks. The dataset was automatically generated based on manually collected seeds and predefined patterns, and using overlapping generation rules that combine different number of attractors and grammatical numbers for NPs and verbs. Compared to the agrF task, the training data for this subtask has minimal lexical variation both among the sentences in the input sequence, and between the input and output sentences.',

keywords: [
'rule-like generalization',
'underlying problem structure',
'grammatical phenomena',
'subject-verb agreement',
'French',
],

authors: [
'Paola Merlo',
'Chunyang Jiang',
'Aixiu An',
'Maria A. Rodriguez',
'Vivi Nastase',
],

data_source: {
type: 'manual',
train: 'https://raw.githubusercontent.com/CLCL-Geneva/GenBench/main/BLMs/agrF_typeI_train.jsonl',
test: 'https://raw.githubusercontent.com/CLCL-Geneva/GenBench/main/BLMs/agrF_test.jsonl',
},

has_validation_set: false,
has_train_set: true,

task_type: 'multiple_choice',

field_mapping: {
input: 'input',
target: 'target',
target_options: 'target_options',
},

evaluation_metrics: [
{
hf_id: 'f1',
git_commit_sha: '3a4c40f7397dcd7d9dccf0659616dc6b14072dcb',
best_score: 1.0,
},
],

preparation_strategies: {
finetuning: {
objective: 'maximum_likelihood',
},
},

}
60 changes: 60 additions & 0 deletions src/genbench/tasks/blm_tasks/agr_f_type_I_train/doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# BLM_tasks (agr_f_type_I_train)

## Abstract

BLM-AgrF is an instance of Blackbird's Language Matrices (BLM). This novel linguistic dataset is generatively constructed to support investigations in representation learning of grammatical rules. Each instance, consisting of a sequence of sentences and a candidate answer set, was built using a combination of rules, to provide a layered and structured dataset for learning more complex models. The various layers of the dataset allow for a variety of explorations, from disentangled sentence representations to capture structure and regularities within a sentence, to modular architectures that could capture structure and regularities in the sentence sequences. The purposefully built candidate answers supports more in-depth analyses of the behaviour of a system, and provide insights into the source of prediction errors.

The sentence structure is constructed to illustrate several underlying generative rules that describe different aspects of the linguistic phenomenon. These rules need to be identified and disentangled to correctly generalize and thus identify the correct answer. The sequence structure was designed in a similar manner to visual IQ tests, and follows a generative process of overlapping rules. The output is multiple choice. The correct sentence should be the correct continuation of the input sequence w.r.t. the dataset's generation rules.

## Examples
BLM-AgrF (agr_f_type_I_train) is a dataset capturing subject-verb agreement in French:

Input:

|---|-----------|------------------|-----------------|--------|
| 1 | The vase | with the flower | | leaks. |
| 2 | The vases | with the flower | | leak. |
| 3 | The vase | with the flowers | | leaks. |
| 4 | The vases | with the flowers | | leak. |
| 5 | The vase | with the flower | from the garden | leaks. |
| 6 | The vases | with the flower | from the garden | leak. |
| 7 | The vase | with the flowers | from the garden | leaks. |
| 8 | ??? |

Choices:

| | |
|-----------------------------------------------------------|---------|
| The vase with the flower and the garden leaks. | Coord |
| extbf{The vases with the flowers from the garden leak.} | Correct |
| The vase with the flower leaks. | WNA |
| The vase with the flower from the garden leak. | AE |
| The vases with the flower from the garden leak. | WN1 |
| The vases with the flowers from the gardens leak. | WN2 |
|-----------------------------------------------------------|---------|

## Usage
The task is formatted as multiple choice. The input consists of a sequence of 7 sentences, separated by the end of sentence marker (</s>). The options are provided as a list of sentences, and the index of the correct one is specified as the target:

{
"input": "Les soirées dans l'appartement ont gêné les résidents . </s> La soirée dans l'appartement a gêné les invités . </s> Les visites aux artistes approchent . </s> La menace de les attaquer inquiète les médecins . </s> Les visites au village des artisanats approchent . </s> La soirée dans l'appartement des propriétaires a gêné les voisins . </s> Les dangers de les réformes dans les écoles inquiètent les médecins .",
"target": 5,
"target_options": ["L'avion pour le vol au-dessus des canyon s'écrase .", "L'avion pour les vols au-dessus des canyon s'écrasent .", "L'avion pour les vols au-dessus des canyon dans les réserves indiennes s'écrase .", "L'avion pour les vols et des canyon s'écrasent .", "Les instruments pour les vols au-dessus des canyon s'écrase .", "L'avion pour les vols au-dessus des canyon s'écrase ."]
}

## Data Source
The dataset was automatically generated based on manually selected seeds and predefined sentence templates. Compared to the 'agr_f' task, the training data for this subtask has minimal lexical variation both among the sentences in the input sequence, and between the input and output sentences.

## Limitations and Bias
The sentences and the sequence of sentences for each dataset have a prescribed structure.

## GenBench Eval card

- *Generalisation type* The generalisation type evaluated is 'compositional' because the dataset is generated with overlapping (and compositional) rules, that a system should detect
- *Motivation* The motivation is both 'intrinsic' and 'cognitive': 'cognitive' because the dataset would test the capabilities of the system to detect the kind of information humans perceive in the provided data; 'intrinsic' because if a system can learn to detect specific linguistic information, we could adjust the model to detect different types of information.
- *Shift source* the data is automatically generated from manually collected seeds, and by applying prespecified (but naturalistic) templates.
- *Shift locus* is 'pretrained-trained' because we imagine a system would use representations of the data from a pretrained model to address the task of identifying specific linguistic phenomena.
- *Shift type* There is a difference in the lexical distribution in the training data and the test -- there is minimal variation in the lexical material in the training instances, whereas the test set has maximal lexical variation.


![GenBench Eval Card](GenBench_eval_card.png)
Loading