GenBench · CLCL-Geneva · Aug 1, 2023 · Aug 1, 2023 · Aug 30, 2023 · Aug 30, 2023
diff --git a/src/genbench/tasks/blm_tasks/__init__.py b/src/genbench/tasks/blm_tasks/__init__.py
@@ -0,0 +1,5 @@
+from genbench import TaskDict
+
+
+class BlmTasks(TaskDict):
+    pass
diff --git a/src/genbench/tasks/blm_tasks/agr_f/GenBench_eval_card.png b/src/genbench/tasks/blm_tasks/agr_f/GenBench_eval_card.png
diff --git a/src/genbench/tasks/blm_tasks/agr_f/__init__.py b/src/genbench/tasks/blm_tasks/agr_f/__init__.py
diff --git a/src/genbench/tasks/blm_tasks/agr_f/config.jsonnet b/src/genbench/tasks/blm_tasks/agr_f/config.jsonnet
@@ -0,0 +1,54 @@
+{
+    name: 'BLM_tasks (agr_f)',
+
+
+    description: 'BLM_tasks (agr_f) aims to measure the detection of rules related to the subject-verb agreement (in French) in neural networks. The dataset was automatically generated based on manually collected seeds and predefined patterns, and using overlapping generation rules that combine different number of attractors and grammatical numbers for NPs and verbs',
+
+	  keywords: [
+	    'rule-like generalization',
+	    'underlying problem structure',
+	    'grammatical phenomena',
+	    'subject-verb agreement',
+	    'French',
+	  ],
+
+    authors: [
+        'Paola Merlo',
+        'Chunyang Jiang',
+        'Aixiu An',
+        'Maria A. Rodriguez',
+        'Vivi Nastase',
+    ],
+
+  data_source: {
+    type: 'manual',
+    train: 'https://raw.githubusercontent.com/CLCL-Geneva/GenBench/main/BLMs/agrF_train.jsonl',
+    test: 'https://raw.githubusercontent.com/CLCL-Geneva/GenBench/main/BLMs/agrF_test.jsonl',
+  },
+
+  has_validation_set: false,
+  has_train_set: true,
+
+  task_type: 'multiple_choice',
+
+  field_mapping: {
+    input: 'input',
+    target: 'target',
+    target_options: 'target_options',
+  },
+
+  evaluation_metrics: [
+    {
+      hf_id: 'f1',
+      git_commit_sha: '3a4c40f7397dcd7d9dccf0659616dc6b14072dcb',
+      best_score: 1.0,      
+    },
+  ],
+
+  preparation_strategies: {
+    finetuning: {
+      objective: 'maximum_likelihood',
+    },
+  },
+
+}
diff --git a/src/genbench/tasks/blm_tasks/agr_f/doc.md b/src/genbench/tasks/blm_tasks/agr_f/doc.md
@@ -0,0 +1,60 @@
+# BLM_tasks (agr_f)
+
+## Abstract
+
+BLM-AgrF is an instance of Blackbird's Language Matrices (BLM). This novel linguistic dataset is generatively constructed to support investigations in representation learning of grammatical rules. Each instance, consisting of a sequence of sentences and a candidate answer set, was built using a combination of rules, to provide a layered and structured dataset for learning more complex models. The various layers of the dataset allow for a variety of explorations, from disentangled sentence representations to capture structure and regularities within a sentence, to modular architectures that could capture structure and regularities in the sentence sequences. The purposefully built candidate answers supports  more in-depth analyses of the behaviour of a system, and provide insights into the source of prediction errors. 
+
+The sentence structure is constructed to illustrate several underlying generative rules that describe different aspects of the linguistic phenomenon. These rules need to be identified and disentangled to correctly generalize and thus identify the correct answer. The sequence structure was designed in a similar manner to visual IQ tests, and follows a generative process of overlapping rules. The output is multiple choice. The correct sentence should be the correct continuation of the input sequence w.r.t. the dataset's generation rules.
+
+## Examples
+BLM-AgrF (agr_f) is a dataset capturing subject-verb agreement in French:
+
+Input: 
+
+|---|-----------|------------------|-----------------|--------|
+| 1 | The vase  | with the flower  |                 | leaks. |
+| 2 | The vases | with the flower  |                 | leak.  |
+| 3 | The vase  | with the flowers |                 | leaks. |
+| 4 | The vases | with the flowers |                 | leak.  |
+| 5 | The vase  | with the flower  | from the garden | leaks. |
+| 6 | The vases | with the flower  | from the garden | leak.  |
+| 7 | The vase  | with the flowers | from the garden | leaks. |
+| 8 | ???       |
+
+Choices:
+
+|                                                           |         |
+|-----------------------------------------------------------|---------|
+| The vase with the flower  and the garden  leaks.          | Coord   |
+| extbf{The vases with the flowers  from the garden  leak.} | Correct |
+| The vase with the flower   leaks.                         | WNA     |
+| The vase with the flower   from the garden  leak.         | AE      |
+| The vases with the flower  from the garden  leak.         | WN1     |
+| The vases with the flowers  from the gardens  leak.       | WN2     |
+|-----------------------------------------------------------|---------|
+
+## Usage
+The task is formatted as multiple choice. The input consists of a sequence of 7 sentences, separated by the end of sentence marker (</s>). The options are provided as a list of sentences, and the index of the correct one is specified as the target:
+
+{
+   "input": "Les soirées dans l'appartement ont gêné les résidents . </s> La soirée dans l'appartement a gêné les invités . </s> Les visites aux artistes approchent . </s> La menace de les attaquer inquiète les médecins . </s> Les visites au village des artisanats approchent . </s> La soirée dans l'appartement des propriétaires a gêné les voisins . </s> Les dangers de les réformes   dans les écoles inquiètent les médecins .", 
+   "target": 5, 
+   "target_options": ["L'avion pour le vol au-dessus des canyon s'écrase .", "L'avion pour les vols au-dessus des canyon s'écrasent .", "L'avion pour les vols au-dessus des canyon dans les réserves indiennes s'écrase .", "L'avion pour les vols et des canyon s'écrasent .", "Les instruments pour les vols au-dessus des canyon s'écrase .", "L'avion pour les vols au-dessus des canyon s'écrase ."]
+}
+
+## Data Source
+The dataset was automatically generated based on manually selected seeds and predefined sentence templates.
+
+## Limitations and Bias
+The sentences and the sequence of sentences for each dataset have a prescribed structure. 
+
+## GenBench Eval card
+
+- *Generalisation type* The generalisation type evaluated is 'compositional' because the dataset is generated with overlapping (and compositional) rules, that a system should detect
+- *Motivation* The motivation is both 'intrinsic' and 'cognitive': 'cognitive' because the dataset would test the capabilities of the system to detect the kind of information humans perceive in the provided data; 'intrinsic' because if a system can learn to detect specific linguistic information, we could adjust the model to detect different types of information.
+- *Shift source* the data is automatically generated from manually collected seeds, and by applying prespecified (but naturalistic) templates.
+- *Shift locus* is 'pretrained-trained' because we imagine a system would use representations of the data from a pretrained model to address the task of identifying specific linguistic phenomena.
+- *Shift type* 
+
+
+![GenBench Eval Card](GenBench_eval_card.png)
diff --git a/src/genbench/tasks/blm_tasks/agr_f/task.py b/src/genbench/tasks/blm_tasks/agr_f/task.py
@@ -0,0 +1,121 @@
+from collections import OrderedDict
+from typing import Any, Dict, List, Mapping
+
+import evaluate
+from datasets import Dataset
+
+from genbench import Task
+from genbench.api import EvaluationResult, TaskType
+from genbench.utils.logging import get_logger
+
+
+logger = get_logger(__name__)
+
+
+def make_list(N, ind_1):
+    binary_list = [0] * N
+    binary_list[ind_1] = 1
+    return binary_list
+
+
+class BlmTasksAgrF(Task):
+    def evaluate_predictions(
+        self,
+        *,
+        predictions: List[Mapping[str, Any]] = None,
+        gold: Dataset = None,
+    ) -> EvaluationResult:
+        result = OrderedDict()
+        for metric_config in self.config.evaluation_metrics:
+            hf_id = metric_config.hf_id
+            if isinstance(hf_id, str):
+                hf_id = [hf_id]
+
+            metric = evaluate.load(*hf_id, revision=metric_config.git_commit_sha)
+
+            refs_lst = [g["target"] for g in gold]
+            preds_lst = [pred["target"] for pred in predictions]
+
+            ref_type = type(refs_lst[0])
+            pred_type = type(preds_lst[0])
+            if pred_type != ref_type:
+                if self.config.task_type != TaskType.MULTIPLE_CHOICE:
+                    raise ValueError(
+                        f"Predictions and references have different types: preds: {pred_type} and refs: {ref_type}. "
+                    )
+                # Convert predictions to the same type as the references
+                if pred_type == str and ref_type == int:
+                    logger.warning("Predictions are strings, but references are ints. Converting predictions to ints.")
+                    converted_preds = []
+                    for pred, ref in zip(preds_lst, gold):
+                        assert "target_options" in ref
+                        converted_preds.append(ref["target_options"].index(pred))
+                    preds_lst = converted_preds
+                elif pred_type == int and ref_type == str:
+                    logger.warning("Predictions are ints, but references are strings. Converting references to ints.")
+                    converted_refs = []
+                    for pred, ref in zip(preds_lst, gold):
+                        assert "target_options" in ref
+                        converted_refs.append(ref["target_options"].index(ref["target"]))
+                    refs_lst = converted_refs
+            else:
+                if self.config.task_type == TaskType.MULTIPLE_CHOICE and pred_type != int:
+                    # Convert both predictions and references to int
+                    logger.warning(
+                        "Predictions and references have the same type, but it is not int. Converting both to int."
+                    )
+
+                    N = len(ref["target_options"])
+
+                    converted_preds = []
+                    converted_refs = []
+                    for pred, ref in zip(preds_lst, gold):
+                        assert "target_options" in ref
+                        # converted_preds.append(ref["target_options"].index(pred))
+                        # converted_refs.append(ref["target_options"].index(ref["target"]))
+
+                        converted_preds.extend(make_list(N, ref["target_options"].index(pred)))
+                        converted_refs.append(make_list(N, ref["target_options"].index(ref["target"])))
+
+                    preds_lst = converted_preds
+                    refs_lst = converted_refs
+
+            extra_kwargs = metric_config.compute_extra_kwargs or {}
+            output: dict = metric.compute(predictions=preds_lst, references=refs_lst, **extra_kwargs)
+
+            if output is None:
+                raise ValueError(
+                    f"Metric {metric_config.hf_id} returned None. " f"Please check the metric implementation."
+                )
+
+            # Update output keys to include the metric id
+            metric_id = "_".join(hf_id)
+            output = {f"hf_{metric_id}__{k}": v for k, v in output.items()}
+
+            result.update(output)
+
+        return result
+
+    def format_example(self, example: Dict[str, Any]) -> Dict[str, Any]:
+        """Perform preprocessing/formatting on an example-level.
+
+        By default, this method does nothing more than mapping original data source
+        fields to the expected fields.
+
+        `example` directly comes from the data source (e.g. downloaded HF dataset),
+        and it may contain fields such as `question` or `answer`. This method should
+        prepare the example used in the task. i.e. should create fields `input`,
+        `target`, `target_scores`, or `target_labels` depending on the task type.
+
+        Args:
+            example: A dictionary containing key-value pairs for an example from the source dataset.
+
+
+        Returns:
+            A dictionary containing key-value pairs for the preprocessed/formatted example.
+            The dictionary should contain keys `input`, `target`, `target_scores`, or `target_label`
+            depending on the task type.
+        """
+        # print("config: {}".format(self.config))
+        # print("Example: {}".format(example))
+        return {"input": example["input"], "target": example["target"], "target_options": example["target_options"]}
diff --git a/src/genbench/tasks/blm_tasks/agr_f_type_I_train/GenBench_eval_card.png b/src/genbench/tasks/blm_tasks/agr_f_type_I_train/GenBench_eval_card.png
diff --git a/src/genbench/tasks/blm_tasks/agr_f_type_I_train/__init__.py b/src/genbench/tasks/blm_tasks/agr_f_type_I_train/__init__.py
diff --git a/src/genbench/tasks/blm_tasks/agr_f_type_I_train/config.jsonnet b/src/genbench/tasks/blm_tasks/agr_f_type_I_train/config.jsonnet
@@ -0,0 +1,54 @@
+{
+    name: 'BLM_tasks (agr_f_type_I_train)',
+
+
+    description: 'BLM_tasks (agr_f_type_I_train) aims to measure the detection of rules related to the subject-verb agreement (in French) in neural networks. The dataset was automatically generated based on manually collected seeds and predefined patterns, and using overlapping generation rules that combine different number of attractors and grammatical numbers for NPs and verbs. Compared to the agrF task, the training data for this subtask has minimal lexical variation both among the sentences in the input sequence, and between the input and output sentences.',
+
+	  keywords: [
+	    'rule-like generalization',
+	    'underlying problem structure',
+	    'grammatical phenomena',
+	    'subject-verb agreement',
+	    'French',
+	  ],
+
+    authors: [
+        'Paola Merlo',
+        'Chunyang Jiang',
+        'Aixiu An',
+        'Maria A. Rodriguez',
+        'Vivi Nastase',
+    ],
+
+  data_source: {
+    type: 'manual',
+    train: 'https://raw.githubusercontent.com/CLCL-Geneva/GenBench/main/BLMs/agrF_typeI_train.jsonl',
+    test: 'https://raw.githubusercontent.com/CLCL-Geneva/GenBench/main/BLMs/agrF_test.jsonl',
+  },
+
+  has_validation_set: false,
+  has_train_set: true,
+
+  task_type: 'multiple_choice',
+
+  field_mapping: {
+    input: 'input',
+    target: 'target',
+    target_options: 'target_options',
+  },
+
+  evaluation_metrics: [
+    {
+      hf_id: 'f1',
+      git_commit_sha: '3a4c40f7397dcd7d9dccf0659616dc6b14072dcb',
+      best_score: 1.0,      
+    },
+  ],
+
+  preparation_strategies: {
+    finetuning: {
+      objective: 'maximum_likelihood',
+    },
+  },
+
+}
diff --git a/src/genbench/tasks/blm_tasks/agr_f_type_I_train/doc.md b/src/genbench/tasks/blm_tasks/agr_f_type_I_train/doc.md
@@ -0,0 +1,60 @@
+# BLM_tasks (agr_f_type_I_train)
+
+## Abstract
+
+BLM-AgrF is an instance of Blackbird's Language Matrices (BLM). This novel linguistic dataset is generatively constructed to support investigations in representation learning of grammatical rules. Each instance, consisting of a sequence of sentences and a candidate answer set, was built using a combination of rules, to provide a layered and structured dataset for learning more complex models. The various layers of the dataset allow for a variety of explorations, from disentangled sentence representations to capture structure and regularities within a sentence, to modular architectures that could capture structure and regularities in the sentence sequences. The purposefully built candidate answers supports  more in-depth analyses of the behaviour of a system, and provide insights into the source of prediction errors. 
+
+The sentence structure is constructed to illustrate several underlying generative rules that describe different aspects of the linguistic phenomenon. These rules need to be identified and disentangled to correctly generalize and thus identify the correct answer. The sequence structure was designed in a similar manner to visual IQ tests, and follows a generative process of overlapping rules. The output is multiple choice. The correct sentence should be the correct continuation of the input sequence w.r.t. the dataset's generation rules.
+
+## Examples
+BLM-AgrF (agr_f_type_I_train) is a dataset capturing subject-verb agreement in French:
+
+Input: 
+
+|---|-----------|------------------|-----------------|--------|
+| 1 | The vase  | with the flower  |                 | leaks. |
+| 2 | The vases | with the flower  |                 | leak.  |
+| 3 | The vase  | with the flowers |                 | leaks. |
+| 4 | The vases | with the flowers |                 | leak.  |
+| 5 | The vase  | with the flower  | from the garden | leaks. |
+| 6 | The vases | with the flower  | from the garden | leak.  |
+| 7 | The vase  | with the flowers | from the garden | leaks. |
+| 8 | ???       |
+
+Choices:
+
+|                                                           |         |
+|-----------------------------------------------------------|---------|
+| The vase with the flower  and the garden  leaks.          | Coord   |
+| extbf{The vases with the flowers  from the garden  leak.} | Correct |
+| The vase with the flower   leaks.                         | WNA     |
+| The vase with the flower   from the garden  leak.         | AE      |
+| The vases with the flower  from the garden  leak.         | WN1     |
+| The vases with the flowers  from the gardens  leak.       | WN2     |
+|-----------------------------------------------------------|---------|
+
+## Usage
+The task is formatted as multiple choice. The input consists of a sequence of 7 sentences, separated by the end of sentence marker (</s>). The options are provided as a list of sentences, and the index of the correct one is specified as the target:
+
+{
+   "input": "Les soirées dans l'appartement ont gêné les résidents . </s> La soirée dans l'appartement a gêné les invités . </s> Les visites aux artistes approchent . </s> La menace de les attaquer inquiète les médecins . </s> Les visites au village des artisanats approchent . </s> La soirée dans l'appartement des propriétaires a gêné les voisins . </s> Les dangers de les réformes   dans les écoles inquiètent les médecins .", 
+   "target": 5, 
+   "target_options": ["L'avion pour le vol au-dessus des canyon s'écrase .", "L'avion pour les vols au-dessus des canyon s'écrasent .", "L'avion pour les vols au-dessus des canyon dans les réserves indiennes s'écrase .", "L'avion pour les vols et des canyon s'écrasent .", "Les instruments pour les vols au-dessus des canyon s'écrase .", "L'avion pour les vols au-dessus des canyon s'écrase ."]
+}
+
+## Data Source
+The dataset was automatically generated based on manually selected seeds and predefined sentence templates. Compared to the 'agr_f' task, the training data for this subtask has minimal lexical variation both among the sentences in the input sequence, and between the input and output sentences.
+
+## Limitations and Bias
+The sentences and the sequence of sentences for each dataset have a prescribed structure. 
+
+## GenBench Eval card
+
+- *Generalisation type* The generalisation type evaluated is 'compositional' because the dataset is generated with overlapping (and compositional) rules, that a system should detect
+- *Motivation* The motivation is both 'intrinsic' and 'cognitive': 'cognitive' because the dataset would test the capabilities of the system to detect the kind of information humans perceive in the provided data; 'intrinsic' because if a system can learn to detect specific linguistic information, we could adjust the model to detect different types of information.
+- *Shift source* the data is automatically generated from manually collected seeds, and by applying prespecified (but naturalistic) templates.
+- *Shift locus* is 'pretrained-trained' because we imagine a system would use representations of the data from a pretrained model to address the task of identifying specific linguistic phenomena.
+- *Shift type* There is a difference in the lexical distribution in the training data and the test -- there is minimal variation in the lexical material in the training instances, whereas the test set has maximal lexical variation.
+
+
+![GenBench Eval Card](GenBench_eval_card.png)