From ee159b8543c8d882dfa9dfad5f946269b6ff2a2c Mon Sep 17 00:00:00 2001
From: broaddeep <43122784+broaddeep@users.noreply.github.com>
Date: Thu, 8 Apr 2021 17:10:14 +0900
Subject: [PATCH] Support match alignments (#7321)

* Support match alignments

* change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case

* remove added errors, utilize bint type, cleanup whitespace

* fix no new line in end of file

* Minor formatting

* Skip alignments processing if as_spans is set

* Add with_alignments to Matcher API docs

* Update website/docs/api/matcher.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
---
 .github/contributors/broaddeep.md         | 106 +++++++++++++++++
 spacy/matcher/matcher.pxd                 |   6 +
 spacy/matcher/matcher.pyx                 | 135 ++++++++++++++++++----
 spacy/tests/matcher/test_matcher_logic.py |  87 ++++++++++++++
 website/docs/api/matcher.md               |  15 +--
 5 files changed, 321 insertions(+), 28 deletions(-)
 create mode 100644 .github/contributors/broaddeep.md

diff --git a/.github/contributors/broaddeep.md b/.github/contributors/broaddeep.md
new file mode 100644
index 00000000000..d6c4b3cf303
--- /dev/null
+++ b/.github/contributors/broaddeep.md
@@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Dongjun Park         |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2021-03-06           |
+| GitHub username                | broaddeep            |
+| Website (optional)             |                      |
diff --git a/spacy/matcher/matcher.pxd b/spacy/matcher/matcher.pxd
index 52a30d94cca..455f978cc3e 100644
--- a/spacy/matcher/matcher.pxd
+++ b/spacy/matcher/matcher.pxd
@@ -46,6 +46,12 @@ cdef struct TokenPatternC:
     int32_t nr_py
     quantifier_t quantifier
     hash_t key
+    int32_t token_idx
+
+
+cdef struct MatchAlignmentC:
+    int32_t token_idx
+    int32_t length
 
 
 cdef struct PatternStateC:
diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx
index 26dca05eb94..dae12c3f66e 100644
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@@ -196,7 +196,7 @@ cdef class Matcher:
                 else:
                     yield doc
 
-    def __call__(self, object doclike, *, as_spans=False, allow_missing=False):
+    def __call__(self, object doclike, *, as_spans=False, allow_missing=False, with_alignments=False):
         """Find all token sequences matching the supplied pattern.
 
         doclike (Doc or Span): The document to match over.
@@ -204,10 +204,16 @@ cdef class Matcher:
             start, end) tuples.
         allow_missing (bool): Whether to skip checks for missing annotation for
             attributes included in patterns. Defaults to False.
+        with_alignments (bool): Return match alignment information, which is
+            `List[int]` with length of matched span. Each entry denotes the
+            corresponding index of token pattern. If as_spans is set to True,
+            this setting is ignored.
         RETURNS (list): A list of `(match_id, start, end)` tuples,
             describing the matches. A match tuple describes a span
             `doc[start:end]`. The `match_id` is an integer. If as_spans is set
             to True, a list of Span objects is returned.
+            If with_alignments is set to True and as_spans is set to False,
+            A list of `(match_id, start, end, alignments)` tuples is returned.
         """
         if isinstance(doclike, Doc):
             doc = doclike
@@ -217,6 +223,9 @@ cdef class Matcher:
             length = doclike.end - doclike.start
         else:
             raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__))
+        # Skip alignments calculations if as_spans is set
+        if as_spans:
+            with_alignments = False
         cdef Pool tmp_pool = Pool()
         if not allow_missing:
             for attr in (TAG, POS, MORPH, LEMMA, DEP):
@@ -232,18 +241,20 @@ cdef class Matcher:
                     error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr))
                     raise ValueError(error_msg)
         matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
-                                extensions=self._extensions, predicates=self._extra_predicates)
+                                extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments)
         final_matches = []
         pairs_by_id = {}
-        # For each key, either add all matches, or only the filtered, non-overlapping ones
-        for (key, start, end) in matches:
+        # For each key, either add all matches, or only the filtered,
+        # non-overlapping ones this `match` can be either (start, end) or
+        # (start, end, alignments) depending on `with_alignments=` option.
+        for key, *match in matches:
             span_filter = self._filter.get(key)
             if span_filter is not None:
                 pairs = pairs_by_id.get(key, [])
-                pairs.append((start,end))
+                pairs.append(match)
                 pairs_by_id[key] = pairs
             else:
-                final_matches.append((key, start, end))
+                final_matches.append((key, *match))
         matched = <char*>tmp_pool.alloc(length, sizeof(char))
         empty = <char*>tmp_pool.alloc(length, sizeof(char))
         for key, pairs in pairs_by_id.items():
@@ -255,14 +266,18 @@ cdef class Matcher:
                 sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length
             else:
                 raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=span_filter))
-            for (start, end) in sorted_pairs:
+            for match in sorted_pairs:
+                start, end = match[:2]
                 assert 0 <= start < end  # Defend against segfaults
                 span_len = end-start
                 # If no tokens in the span have matched
                 if memcmp(&matched[start], &empty[start], span_len * sizeof(matched[0])) == 0:
-                    final_matches.append((key, start, end))
+                    final_matches.append((key, *match))
                     # Mark tokens that have matched
                     memset(&matched[start], 1, span_len * sizeof(matched[0]))
+        if with_alignments:
+            final_matches_with_alignments = final_matches
+            final_matches = [(key, start, end) for key, start, end, alignments in final_matches]
         # perform the callbacks on the filtered set of results
         for i, (key, start, end) in enumerate(final_matches):
             on_match = self._callbacks.get(key, None)
@@ -270,6 +285,22 @@ cdef class Matcher:
                 on_match(self, doc, i, final_matches)
         if as_spans:
             return [Span(doc, start, end, label=key) for key, start, end in final_matches]
+        elif with_alignments:
+            # convert alignments List[Dict[str, int]] --> List[int]
+            final_matches = []
+            # when multiple alignment (belongs to the same length) is found,
+            # keeps the alignment that has largest token_idx
+            for key, start, end, alignments in final_matches_with_alignments:
+                sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False)
+                alignments = [0] * (end-start)
+                for align in sorted_alignments:
+                    if align['length'] >= end-start:
+                        continue
+                    # Since alignments are sorted in order of (length, token_idx)
+                    # this overwrites smaller token_idx when they have same length.
+                    alignments[align['length']] = align['token_idx']
+                final_matches.append((key, start, end, alignments))
+            return final_matches
         else:
             return final_matches
 
@@ -288,9 +319,9 @@ def unpickle_matcher(vocab, patterns, callbacks):
     return matcher
 
 
-cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()):
+cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple(), bint with_alignments=0):
     """Find matches in a doc, with a compiled array of patterns. Matches are
-    returned as a list of (id, start, end) tuples.
+    returned as a list of (id, start, end) tuples or (id, start, end, alignments) tuples (if with_alignments != 0)
 
     To augment the compiled patterns, we optionally also take two Python lists.
 
@@ -302,6 +333,8 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
     """
     cdef vector[PatternStateC] states
     cdef vector[MatchC] matches
+    cdef vector[vector[MatchAlignmentC]] align_states
+    cdef vector[vector[MatchAlignmentC]] align_matches
     cdef PatternStateC state
     cdef int i, j, nr_extra_attr
     cdef Pool mem = Pool()
@@ -328,12 +361,14 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
     for i in range(length):
         for j in range(n):
             states.push_back(PatternStateC(patterns[j], i, 0))
-        transition_states(states, matches, predicate_cache,
-            doclike[i], extra_attr_values, predicates)
+        if with_alignments != 0:
+            align_states.resize(states.size())
+        transition_states(states, matches, align_states, align_matches, predicate_cache,
+            doclike[i], extra_attr_values, predicates, with_alignments)
         extra_attr_values += nr_extra_attr
         predicate_cache += len(predicates)
     # Handle matches that end in 0-width patterns
-    finish_states(matches, states)
+    finish_states(matches, states, align_matches, align_states, with_alignments)
     seen = set()
     for i in range(matches.size()):
         match = (
@@ -346,16 +381,22 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e
         # first .?, or the second .? -- it doesn't matter, it's just one match.
         # Skip 0-length matches. (TODO: fix algorithm)
         if match not in seen and matches[i].length > 0:
-            output.append(match)
+            if with_alignments != 0:
+                # since the length of align_matches equals to that of match, we can share same 'i'
+                output.append(match + (align_matches[i],))
+            else:
+                output.append(match)
             seen.add(match)
     return output
 
 
 cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches,
+                            vector[vector[MatchAlignmentC]]& align_states, vector[vector[MatchAlignmentC]]& align_matches,
                             int8_t* cached_py_predicates,
-        Token token, const attr_t* extra_attrs, py_predicates) except *:
+        Token token, const attr_t* extra_attrs, py_predicates, bint with_alignments) except *:
     cdef int q = 0
     cdef vector[PatternStateC] new_states
+    cdef vector[vector[MatchAlignmentC]] align_new_states
     cdef int nr_predicate = len(py_predicates)
     for i in range(states.size()):
         if states[i].pattern.nr_py >= 1:
@@ -370,23 +411,39 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
         # it in the states list, because q doesn't advance.
         state = states[i]
         states[q] = state
+        # Separate from states, performance is guaranteed for users who only need basic options (without alignments).
+        # `align_states` always corresponds to `states` 1:1.
+        if with_alignments != 0:
+            align_state = align_states[i]
+            align_states[q] = align_state
         while action in (RETRY, RETRY_ADVANCE, RETRY_EXTEND):
+            # Update alignment before the transition of current state
+            # 'MatchAlignmentC' maps 'original token index of current pattern' to 'current matching length'
+            if with_alignments != 0:
+                align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length))
             if action == RETRY_EXTEND:
                 # This handles the 'extend'
                 new_states.push_back(
                     PatternStateC(pattern=states[q].pattern, start=state.start,
                                   length=state.length+1))
+                if with_alignments != 0:
+                    align_new_states.push_back(align_states[q])
             if action == RETRY_ADVANCE:
                 # This handles the 'advance'
                 new_states.push_back(
                     PatternStateC(pattern=states[q].pattern+1, start=state.start,
                                   length=state.length+1))
+                if with_alignments != 0:
+                    align_new_states.push_back(align_states[q])
             states[q].pattern += 1
             if states[q].pattern.nr_py != 0:
                 update_predicate_cache(cached_py_predicates,
                     states[q].pattern, token, py_predicates)
             action = get_action(states[q], token.c, extra_attrs,
                                 cached_py_predicates)
+        # Update alignment before the transition of current state
+        if with_alignments != 0:
+            align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length))
         if action == REJECT:
             pass
         elif action == ADVANCE:
@@ -399,29 +456,50 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match
                 matches.push_back(
                     MatchC(pattern_id=ent_id, start=state.start,
                             length=state.length+1))
+                # `align_matches` always corresponds to `matches` 1:1
+                if with_alignments != 0:
+                    align_matches.push_back(align_states[q])
             elif action == MATCH_DOUBLE:
                 # push match without last token if length > 0
                 if state.length > 0:
                     matches.push_back(
                         MatchC(pattern_id=ent_id, start=state.start,
                                 length=state.length))
+                    # MATCH_DOUBLE emits matches twice,
+                    # add one more to align_matches in order to keep 1:1 relationship
+                    if with_alignments != 0:
+                        align_matches.push_back(align_states[q])
                 # push match with last token
                 matches.push_back(
                     MatchC(pattern_id=ent_id, start=state.start,
                             length=state.length+1))
+                # `align_matches` always corresponds to `matches` 1:1
+                if with_alignments != 0:
+                    align_matches.push_back(align_states[q])
             elif action == MATCH_REJECT:
                 matches.push_back(
                     MatchC(pattern_id=ent_id, start=state.start,
                             length=state.length))
+                # `align_matches` always corresponds to `matches` 1:1
+                if with_alignments != 0:
+                    align_matches.push_back(align_states[q])
             elif action == MATCH_EXTEND:
                 matches.push_back(
                     MatchC(pattern_id=ent_id, start=state.start,
                            length=state.length))
+                # `align_matches` always corresponds to `matches` 1:1
+                if with_alignments != 0:
+                    align_matches.push_back(align_states[q])
                 states[q].length += 1
                 q += 1
     states.resize(q)
     for i in range(new_states.size()):
         states.push_back(new_states[i])
+    # `align_states` always corresponds to `states` 1:1
+    if with_alignments != 0:
+        align_states.resize(q)
+        for i in range(align_new_states.size()):
+            align_states.push_back(align_new_states[i])
 
 
 cdef int update_predicate_cache(int8_t* cache,
@@ -444,15 +522,27 @@ cdef int update_predicate_cache(int8_t* cache,
                 raise ValueError(Errors.E125.format(value=result))
 
 
-cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states) except *:
+cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states,
+                        vector[vector[MatchAlignmentC]]& align_matches,
+                        vector[vector[MatchAlignmentC]]& align_states,
+                        bint with_alignments) except *:
     """Handle states that end in zero-width patterns."""
     cdef PatternStateC state
+    cdef vector[MatchAlignmentC] align_state
     for i in range(states.size()):
         state = states[i]
+        if with_alignments != 0:
+            align_state = align_states[i]
         while get_quantifier(state) in (ZERO_PLUS, ZERO_ONE):
+            # Update alignment before the transition of current state
+            if with_alignments != 0:
+                align_state.push_back(MatchAlignmentC(state.pattern.token_idx, state.length))
             is_final = get_is_final(state)
             if is_final:
                 ent_id = get_ent_id(state.pattern)
+                # `align_matches` always corresponds to `matches` 1:1
+                if with_alignments != 0:
+                    align_matches.push_back(align_state)
                 matches.push_back(
                     MatchC(pattern_id=ent_id, start=state.start, length=state.length))
                 break
@@ -607,7 +697,7 @@ cdef int8_t get_quantifier(PatternStateC state) nogil:
 cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) except NULL:
     pattern = <TokenPatternC*>mem.alloc(len(token_specs) + 1, sizeof(TokenPatternC))
     cdef int i, index
-    for i, (quantifier, spec, extensions, predicates) in enumerate(token_specs):
+    for i, (quantifier, spec, extensions, predicates, token_idx) in enumerate(token_specs):
         pattern[i].quantifier = quantifier
         # Ensure attrs refers to a null pointer if nr_attr == 0
         if len(spec) > 0:
@@ -628,6 +718,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
             pattern[i].py_predicates[j] = index
         pattern[i].nr_py = len(predicates)
         pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0)
+        pattern[i].token_idx = token_idx
     i = len(token_specs)
     # Use quantifier to identify final ID pattern node (rather than previous
     # uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs)
@@ -638,6 +729,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs)
     pattern[i].nr_attr = 1
     pattern[i].nr_extra_attr = 0
     pattern[i].nr_py = 0
+    pattern[i].token_idx = -1
     return pattern
 
 
@@ -655,7 +747,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
     """This function interprets the pattern, converting the various bits of
     syntactic sugar before we compile it into a struct with init_pattern.
 
-    We need to split the pattern up into three parts:
+    We need to split the pattern up into four parts:
     * Normal attribute/value pairs, which are stored on either the token or lexeme,
         can be handled directly.
     * Extension attributes are handled specially, as we need to prefetch the
@@ -664,13 +756,14 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
         functions and store them. So we store these specially as well.
     * Extension attributes that have extra predicates are stored within the
         extra_predicates.
+    * Token index that this pattern belongs to.
     """
     tokens = []
     string_store = vocab.strings
-    for spec in token_specs:
+    for token_idx, spec in enumerate(token_specs):
         if not spec:
             # Signifier for 'any token'
-            tokens.append((ONE, [(NULL_ATTR, 0)], [], []))
+            tokens.append((ONE, [(NULL_ATTR, 0)], [], [], token_idx))
             continue
         if not isinstance(spec, dict):
             raise ValueError(Errors.E154.format())
@@ -679,7 +772,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates):
         extensions = _get_extensions(spec, string_store, extensions_table)
         predicates = _get_extra_predicates(spec, extra_predicates, vocab)
         for op in ops:
-            tokens.append((op, list(attr_values), list(extensions), list(predicates)))
+            tokens.append((op, list(attr_values), list(extensions), list(predicates), token_idx))
     return tokens
 
 
diff --git a/spacy/tests/matcher/test_matcher_logic.py b/spacy/tests/matcher/test_matcher_logic.py
index 5f4c2991a97..9f575fe053a 100644
--- a/spacy/tests/matcher/test_matcher_logic.py
+++ b/spacy/tests/matcher/test_matcher_logic.py
@@ -204,3 +204,90 @@ def test_matcher_remove():
     # removing again should throw an error
     with pytest.raises(ValueError):
         matcher.remove("Rule")
+
+
+def test_matcher_with_alignments_greedy_longest(en_vocab):
+    cases = [
+        ("aaab", "a* b", [0, 0, 0, 1]),
+        ("baab", "b a* b", [0, 1, 1, 2]),
+        ("aaab", "a a a b", [0, 1, 2, 3]),
+        ("aaab", "a+ b", [0, 0, 0, 1]),
+        ("aaba", "a+ b a+", [0, 0, 1, 2]),
+        ("aabaa", "a+ b a+", [0, 0, 1, 2, 2]),
+        ("aaba", "a+ b a*", [0, 0, 1, 2]),
+        ("aaaa", "a*", [0, 0, 0, 0]),
+        ("baab", "b a* b b*", [0, 1, 1, 2]),
+        ("aabb", "a* b* a*", [0, 0, 1, 1]),
+        ("aaab", "a+ a+ a b", [0, 1, 2, 3]),
+        ("aaab", "a+ a+ a+ b", [0, 1, 2, 3]),
+        ("aaab", "a+ a a b", [0, 1, 2, 3]),
+        ("aaab", "a+ a a", [0, 1, 2]),
+        ("aaab", "a+ a a?", [0, 1, 2]),
+        ("aaaa", "a a a a a?", [0, 1, 2, 3]),
+        ("aaab", "a+ a b", [0, 0, 1, 2]),
+        ("aaab", "a+ a+ b", [0, 0, 1, 2]),
+    ]
+    for string, pattern_str, result in cases:
+        matcher = Matcher(en_vocab)
+        doc = Doc(matcher.vocab, words=list(string))
+        pattern = []
+        for part in pattern_str.split():
+            if part.endswith("+"):
+                pattern.append({"ORTH": part[0], "OP": "+"})
+            elif part.endswith("*"):
+                pattern.append({"ORTH": part[0], "OP": "*"})
+            elif part.endswith("?"):
+                pattern.append({"ORTH": part[0], "OP": "?"})
+            else:
+                pattern.append({"ORTH": part})
+        matcher.add("PATTERN", [pattern], greedy="LONGEST")
+        matches = matcher(doc, with_alignments=True)
+        n_matches = len(matches)
+
+        _, s, e, expected = matches[0]
+
+        assert expected == result, (string, pattern_str, s, e, n_matches)
+
+
+def test_matcher_with_alignments_nongreedy(en_vocab):
+    cases = [
+        (0, "aaab", "a* b", [[0, 1], [0, 0, 1], [0, 0, 0, 1], [1]]),
+        (1, "baab", "b a* b", [[0, 1, 1, 2]]),
+        (2, "aaab", "a a a b", [[0, 1, 2, 3]]),
+        (3, "aaab", "a+ b",   [[0, 1], [0, 0, 1], [0, 0, 0, 1]]),
+        (4, "aaba", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2]]),
+        (5, "aabaa", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2], [0, 0, 1, 2, 2], [0, 1, 2, 2] ]),
+        (6, "aaba", "a+ b a*", [[0, 1], [0, 0, 1], [0, 0, 1, 2], [0, 1, 2]]),
+        (7, "aaaa", "a*", [[0], [0, 0], [0, 0, 0], [0, 0, 0, 0]]),
+        (8, "baab", "b a* b b*", [[0, 1, 1, 2]]),
+        (9, "aabb", "a* b* a*", [[1], [2], [2, 2], [0, 1], [0, 0, 1], [0, 0, 1, 1], [0, 1, 1], [1, 1]]),
+        (10, "aaab", "a+ a+ a b", [[0, 1, 2, 3]]),
+        (11, "aaab", "a+ a+ a+ b", [[0, 1, 2, 3]]),
+        (12, "aaab", "a+ a a b", [[0, 1, 2, 3]]),
+        (13, "aaab", "a+ a a", [[0, 1, 2]]),
+        (14, "aaab", "a+ a a?", [[0, 1], [0, 1, 2]]),
+        (15, "aaaa", "a a a a a?", [[0, 1, 2, 3]]),
+        (16, "aaab", "a+ a b", [[0, 1, 2], [0, 0, 1, 2]]),
+        (17, "aaab", "a+ a+ b", [[0, 1, 2], [0, 0, 1, 2]]),
+    ]
+    for case_id, string, pattern_str, results in cases:
+        matcher = Matcher(en_vocab)
+        doc = Doc(matcher.vocab, words=list(string))
+        pattern = []
+        for part in pattern_str.split():
+            if part.endswith("+"):
+                pattern.append({"ORTH": part[0], "OP": "+"})
+            elif part.endswith("*"):
+                pattern.append({"ORTH": part[0], "OP": "*"})
+            elif part.endswith("?"):
+                pattern.append({"ORTH": part[0], "OP": "?"})
+            else:
+                pattern.append({"ORTH": part})
+
+        matcher.add("PATTERN", [pattern])
+        matches = matcher(doc, with_alignments=True)
+        n_matches = len(matches)
+
+        for _, s, e, expected in matches:
+            assert expected in results, (case_id, string, pattern_str, s, e, n_matches)
+            assert len(expected) == e - s
diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md
index 95a76586af7..c15ee7a47ef 100644
--- a/website/docs/api/matcher.md
+++ b/website/docs/api/matcher.md
@@ -120,13 +120,14 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`.
 > matches = matcher(doc)
 > ```
 
-| Name                                       | Description                                                                                                                                                                                                                                                                                              |
-| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `doclike`                                  | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~                                                                                                                                                                                                                                                  |
-| _keyword-only_                             |                                                                                                                                                                                                                                                                                                          |
-| `as_spans` <Tag variant="new">3</Tag>      | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~                                                                                                                                            |
-| `allow_missing` <Tag variant="new">3</Tag> | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~                                                                                                                                                                                         |
-| **RETURNS**                                | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
+| Name                                           | Description                                                                                                                                                                                                                                                                                              |
+| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `doclike`                                      | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~                                                                                                                                                                                                                                                  |
+| _keyword-only_                                 |                                                                                                                                                                                                                                                                                                          |
+| `as_spans` <Tag variant="new">3</Tag>          | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~                                                                                                                                            |
+| `allow_missing` <Tag variant="new">3</Tag>     | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~                                                                                                                                                                                         |
+| `with_alignments` <Tag variant="new">3.1</Tag> | Return match alignment information as part of the match tuple as `List[int]` with the same length as the matched span. Each entry denotes the corresponding index of the token pattern. If `as_spans` is set to `True`, this setting is ignored. Defaults to `False`. ~~bool~~                             |
+| **RETURNS**                                    | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ |
 
 ## Matcher.\_\_len\_\_ {#len tag="method" new="2"}