feat: add required & include parameter & support for span_getter in e…

…ds.contextual_matcher
aphp · May 12, 2024 · db55239 · db55239
1 parent 0c6361b
commit db55239
Show file tree

Hide file tree

Showing 8 changed files with 381 additions and 236 deletions.
diff --git a/changelog.md b/changelog.md
@@ -6,6 +6,8 @@
 
 - Expose the defaults patterns of `eds.negation`, `eds.hypothesis`, `eds.family`, `eds.history` and `eds.reported_speech` under a `eds.negation.default_patterns` attribute
 - Added a `context_getter` SpanGetter argument to the `eds.matcher` class to only retrieve entities inside the spans returned by the getter
+- Added a `filter_expr` parameter to scorers to filter the documents to score
+- Added a new `required` field to `eds.contextual_matcher` assign patterns to only match if the required field has been found, and an `include` parameter (similar to `exclude`) to search for required patterns without assigning them to the entity
 
 ## v0.11.2
 

diff --git a/docs/assets/stylesheets/extra.css b/docs/assets/stylesheets/extra.css
@@ -166,3 +166,20 @@ body, input {
 .md-typeset code a:not(.md-annotation__index) {
     border-bottom: 1px dashed var(--md-typeset-a-color);
 }
+
+.doc-param-details .subdoc {
+    padding: 0;
+    box-shadow: none;
+    border-color: var(--md-typeset-table-color);
+}
+
+.doc-param-details .subdoc > div > div > div>  table {
+    padding: 0;
+    box-shadow: none;
+    border: none;
+}
+
+.doc-param-details .subdoc > summary {
+    margin: 0;
+    font-weight: normal;
+}
diff --git a/docs/pipes/core/contextual-matcher.md b/docs/pipes/core/contextual-matcher.md
@@ -206,74 +206,6 @@ Let us see what we can get from this pipeline with a few examples
 
 However, most of the configuration is provided in the `patterns` key, as a **pattern dictionary** or a **list of pattern dictionaries**
 
-## The pattern dictionary
-
-### Description
-
-A patterr is a nested dictionary with the following keys:
-
-=== "`source`"
-
-    A label describing the pattern
-
-=== "`regex`"
-
-    A single Regex or a list of Regexes
-
-=== "`regex_attr`"
-
-    An attributes to overwrite the given `attr` when matching with Regexes.
-
-=== "`terms`"
-
-    A single term or a list of terms (for exact matches)
-
-=== "`exclude`"
-
-    A dictionary (or list of dictionaries) to define exclusion rules. Exclusion rules are given as Regexes, and if a
-    match is found in the surrounding context of an extraction, the extraction is removed. Each dictionary should have the following keys:
-
-    === "`window`"
-
-        Size of the context to use (in number of words). You can provide the window as:
-
-        - A positive integer, in this case the used context will be taken **after** the extraction
-        - A negative integer, in this case the used context will be taken **before** the extraction
-        - A tuple of integers `(start, end)`, in this case the used context will be the snippet from `start` tokens before the extraction to `end` tokens after the extraction
-
-    === "`regex`"
-
-        A single Regex or a list of Regexes.
-
-=== "`assign`"
-
-    A dictionary to refine the extraction. Similarily to the `exclude` key, you can provide a dictionary to
-    use on the context **before** and **after** the extraction.
-
-    === "`name`"
-
-        A name (string)
-
-    === "`window`"
-
-        Size of the context to use (in number of words). You can provide the window as:
-
-        - A positive integer, in this case the used context will be taken **after** the extraction
-        - A negative integer, in this case the used context will be taken **before** the extraction
-        - A tuple of integers `(start, end)`, in this case the used context will be the snippet from `start` tokens before the extraction to `end` tokens after the extraction
-
-    === "`regex`"
-
-        A dictionary where keys are labels and values are **Regexes with a single capturing group**
-
-    === "`replace_entity`"
-
-        If set to `True`, the match from the corresponding assign key will be used as entity, instead of the main match. See [this paragraph][the-replace_entity-parameter]
-
-    === "`reduce_mode`"
-
-        Set how multiple assign matches are handled. See the documentation of the [`reduce_mode` parameter][the-reduce_mode-parameter]
-
 ### A full pattern dictionary example
 
 ```python
@@ -300,6 +232,8 @@ dict(
             regex=r"(neonatal)",
             expand_entity=True,
             window=3,
+            # keep the extraction only if neonatal is found
+            required=True,
         ),
         dict(
             name="trans",

diff --git a/edsnlp/matchers/regex.py b/edsnlp/matchers/regex.py
@@ -1,6 +1,6 @@
 import re
 from bisect import bisect_left, bisect_right
-from typing import Any, Dict, List, Optional, Tuple, Union
+from typing import Any, Dict, Iterator, List, Optional, Tuple, Union
 
 from loguru import logger
 from spacy.tokens import Doc, Span
@@ -465,7 +465,7 @@ def __call__(
         doclike: Union[Doc, Span],
         as_spans=False,
         return_groupdict=False,
-    ) -> Union[Span, Tuple[Span, Dict[str, Any]]]:
+    ) -> Iterator[Union[Span, Tuple[Span, Dict[str, Any]]]]:
         """
         Performs matching. Yields matches.