From ee159b8543c8d882dfa9dfad5f946269b6ff2a2c Mon Sep 17 00:00:00 2001 From: broaddeep <43122784+broaddeep@users.noreply.github.com> Date: Thu, 8 Apr 2021 17:10:14 +0900 Subject: [PATCH] Support match alignments (#7321) * Support match alignments * change naming from match_alignments to with_alignments, add conditional flow if with_alignments is given, validate with_alignments, add related test case * remove added errors, utilize bint type, cleanup whitespace * fix no new line in end of file * Minor formatting * Skip alignments processing if as_spans is set * Add with_alignments to Matcher API docs * Update website/docs/api/matcher.md Co-authored-by: Sofie Van Landeghem Co-authored-by: Adriane Boyd Co-authored-by: Sofie Van Landeghem --- .github/contributors/broaddeep.md | 106 +++++++++++++++++ spacy/matcher/matcher.pxd | 6 + spacy/matcher/matcher.pyx | 135 ++++++++++++++++++---- spacy/tests/matcher/test_matcher_logic.py | 87 ++++++++++++++ website/docs/api/matcher.md | 15 +-- 5 files changed, 321 insertions(+), 28 deletions(-) create mode 100644 .github/contributors/broaddeep.md diff --git a/.github/contributors/broaddeep.md b/.github/contributors/broaddeep.md new file mode 100644 index 00000000000..d6c4b3cf303 --- /dev/null +++ b/.github/contributors/broaddeep.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Dongjun Park | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2021-03-06 | +| GitHub username | broaddeep | +| Website (optional) | | diff --git a/spacy/matcher/matcher.pxd b/spacy/matcher/matcher.pxd index 52a30d94cca..455f978cc3e 100644 --- a/spacy/matcher/matcher.pxd +++ b/spacy/matcher/matcher.pxd @@ -46,6 +46,12 @@ cdef struct TokenPatternC: int32_t nr_py quantifier_t quantifier hash_t key + int32_t token_idx + + +cdef struct MatchAlignmentC: + int32_t token_idx + int32_t length cdef struct PatternStateC: diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx index 26dca05eb94..dae12c3f66e 100644 --- a/spacy/matcher/matcher.pyx +++ b/spacy/matcher/matcher.pyx @@ -196,7 +196,7 @@ cdef class Matcher: else: yield doc - def __call__(self, object doclike, *, as_spans=False, allow_missing=False): + def __call__(self, object doclike, *, as_spans=False, allow_missing=False, with_alignments=False): """Find all token sequences matching the supplied pattern. doclike (Doc or Span): The document to match over. @@ -204,10 +204,16 @@ cdef class Matcher: start, end) tuples. allow_missing (bool): Whether to skip checks for missing annotation for attributes included in patterns. Defaults to False. + with_alignments (bool): Return match alignment information, which is + `List[int]` with length of matched span. Each entry denotes the + corresponding index of token pattern. If as_spans is set to True, + this setting is ignored. RETURNS (list): A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is an integer. If as_spans is set to True, a list of Span objects is returned. + If with_alignments is set to True and as_spans is set to False, + A list of `(match_id, start, end, alignments)` tuples is returned. """ if isinstance(doclike, Doc): doc = doclike @@ -217,6 +223,9 @@ cdef class Matcher: length = doclike.end - doclike.start else: raise ValueError(Errors.E195.format(good="Doc or Span", got=type(doclike).__name__)) + # Skip alignments calculations if as_spans is set + if as_spans: + with_alignments = False cdef Pool tmp_pool = Pool() if not allow_missing: for attr in (TAG, POS, MORPH, LEMMA, DEP): @@ -232,18 +241,20 @@ cdef class Matcher: error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr)) raise ValueError(error_msg) matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length, - extensions=self._extensions, predicates=self._extra_predicates) + extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments) final_matches = [] pairs_by_id = {} - # For each key, either add all matches, or only the filtered, non-overlapping ones - for (key, start, end) in matches: + # For each key, either add all matches, or only the filtered, + # non-overlapping ones this `match` can be either (start, end) or + # (start, end, alignments) depending on `with_alignments=` option. + for key, *match in matches: span_filter = self._filter.get(key) if span_filter is not None: pairs = pairs_by_id.get(key, []) - pairs.append((start,end)) + pairs.append(match) pairs_by_id[key] = pairs else: - final_matches.append((key, start, end)) + final_matches.append((key, *match)) matched = tmp_pool.alloc(length, sizeof(char)) empty = tmp_pool.alloc(length, sizeof(char)) for key, pairs in pairs_by_id.items(): @@ -255,14 +266,18 @@ cdef class Matcher: sorted_pairs = sorted(pairs, key=lambda x: (x[1]-x[0], -x[0]), reverse=True) # reverse sort by length else: raise ValueError(Errors.E947.format(expected=["FIRST", "LONGEST"], arg=span_filter)) - for (start, end) in sorted_pairs: + for match in sorted_pairs: + start, end = match[:2] assert 0 <= start < end # Defend against segfaults span_len = end-start # If no tokens in the span have matched if memcmp(&matched[start], &empty[start], span_len * sizeof(matched[0])) == 0: - final_matches.append((key, start, end)) + final_matches.append((key, *match)) # Mark tokens that have matched memset(&matched[start], 1, span_len * sizeof(matched[0])) + if with_alignments: + final_matches_with_alignments = final_matches + final_matches = [(key, start, end) for key, start, end, alignments in final_matches] # perform the callbacks on the filtered set of results for i, (key, start, end) in enumerate(final_matches): on_match = self._callbacks.get(key, None) @@ -270,6 +285,22 @@ cdef class Matcher: on_match(self, doc, i, final_matches) if as_spans: return [Span(doc, start, end, label=key) for key, start, end in final_matches] + elif with_alignments: + # convert alignments List[Dict[str, int]] --> List[int] + final_matches = [] + # when multiple alignment (belongs to the same length) is found, + # keeps the alignment that has largest token_idx + for key, start, end, alignments in final_matches_with_alignments: + sorted_alignments = sorted(alignments, key=lambda x: (x['length'], x['token_idx']), reverse=False) + alignments = [0] * (end-start) + for align in sorted_alignments: + if align['length'] >= end-start: + continue + # Since alignments are sorted in order of (length, token_idx) + # this overwrites smaller token_idx when they have same length. + alignments[align['length']] = align['token_idx'] + final_matches.append((key, start, end, alignments)) + return final_matches else: return final_matches @@ -288,9 +319,9 @@ def unpickle_matcher(vocab, patterns, callbacks): return matcher -cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple()): +cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, extensions=None, predicates=tuple(), bint with_alignments=0): """Find matches in a doc, with a compiled array of patterns. Matches are - returned as a list of (id, start, end) tuples. + returned as a list of (id, start, end) tuples or (id, start, end, alignments) tuples (if with_alignments != 0) To augment the compiled patterns, we optionally also take two Python lists. @@ -302,6 +333,8 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e """ cdef vector[PatternStateC] states cdef vector[MatchC] matches + cdef vector[vector[MatchAlignmentC]] align_states + cdef vector[vector[MatchAlignmentC]] align_matches cdef PatternStateC state cdef int i, j, nr_extra_attr cdef Pool mem = Pool() @@ -328,12 +361,14 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e for i in range(length): for j in range(n): states.push_back(PatternStateC(patterns[j], i, 0)) - transition_states(states, matches, predicate_cache, - doclike[i], extra_attr_values, predicates) + if with_alignments != 0: + align_states.resize(states.size()) + transition_states(states, matches, align_states, align_matches, predicate_cache, + doclike[i], extra_attr_values, predicates, with_alignments) extra_attr_values += nr_extra_attr predicate_cache += len(predicates) # Handle matches that end in 0-width patterns - finish_states(matches, states) + finish_states(matches, states, align_matches, align_states, with_alignments) seen = set() for i in range(matches.size()): match = ( @@ -346,16 +381,22 @@ cdef find_matches(TokenPatternC** patterns, int n, object doclike, int length, e # first .?, or the second .? -- it doesn't matter, it's just one match. # Skip 0-length matches. (TODO: fix algorithm) if match not in seen and matches[i].length > 0: - output.append(match) + if with_alignments != 0: + # since the length of align_matches equals to that of match, we can share same 'i' + output.append(match + (align_matches[i],)) + else: + output.append(match) seen.add(match) return output cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& matches, + vector[vector[MatchAlignmentC]]& align_states, vector[vector[MatchAlignmentC]]& align_matches, int8_t* cached_py_predicates, - Token token, const attr_t* extra_attrs, py_predicates) except *: + Token token, const attr_t* extra_attrs, py_predicates, bint with_alignments) except *: cdef int q = 0 cdef vector[PatternStateC] new_states + cdef vector[vector[MatchAlignmentC]] align_new_states cdef int nr_predicate = len(py_predicates) for i in range(states.size()): if states[i].pattern.nr_py >= 1: @@ -370,23 +411,39 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match # it in the states list, because q doesn't advance. state = states[i] states[q] = state + # Separate from states, performance is guaranteed for users who only need basic options (without alignments). + # `align_states` always corresponds to `states` 1:1. + if with_alignments != 0: + align_state = align_states[i] + align_states[q] = align_state while action in (RETRY, RETRY_ADVANCE, RETRY_EXTEND): + # Update alignment before the transition of current state + # 'MatchAlignmentC' maps 'original token index of current pattern' to 'current matching length' + if with_alignments != 0: + align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length)) if action == RETRY_EXTEND: # This handles the 'extend' new_states.push_back( PatternStateC(pattern=states[q].pattern, start=state.start, length=state.length+1)) + if with_alignments != 0: + align_new_states.push_back(align_states[q]) if action == RETRY_ADVANCE: # This handles the 'advance' new_states.push_back( PatternStateC(pattern=states[q].pattern+1, start=state.start, length=state.length+1)) + if with_alignments != 0: + align_new_states.push_back(align_states[q]) states[q].pattern += 1 if states[q].pattern.nr_py != 0: update_predicate_cache(cached_py_predicates, states[q].pattern, token, py_predicates) action = get_action(states[q], token.c, extra_attrs, cached_py_predicates) + # Update alignment before the transition of current state + if with_alignments != 0: + align_states[q].push_back(MatchAlignmentC(states[q].pattern.token_idx, states[q].length)) if action == REJECT: pass elif action == ADVANCE: @@ -399,29 +456,50 @@ cdef void transition_states(vector[PatternStateC]& states, vector[MatchC]& match matches.push_back( MatchC(pattern_id=ent_id, start=state.start, length=state.length+1)) + # `align_matches` always corresponds to `matches` 1:1 + if with_alignments != 0: + align_matches.push_back(align_states[q]) elif action == MATCH_DOUBLE: # push match without last token if length > 0 if state.length > 0: matches.push_back( MatchC(pattern_id=ent_id, start=state.start, length=state.length)) + # MATCH_DOUBLE emits matches twice, + # add one more to align_matches in order to keep 1:1 relationship + if with_alignments != 0: + align_matches.push_back(align_states[q]) # push match with last token matches.push_back( MatchC(pattern_id=ent_id, start=state.start, length=state.length+1)) + # `align_matches` always corresponds to `matches` 1:1 + if with_alignments != 0: + align_matches.push_back(align_states[q]) elif action == MATCH_REJECT: matches.push_back( MatchC(pattern_id=ent_id, start=state.start, length=state.length)) + # `align_matches` always corresponds to `matches` 1:1 + if with_alignments != 0: + align_matches.push_back(align_states[q]) elif action == MATCH_EXTEND: matches.push_back( MatchC(pattern_id=ent_id, start=state.start, length=state.length)) + # `align_matches` always corresponds to `matches` 1:1 + if with_alignments != 0: + align_matches.push_back(align_states[q]) states[q].length += 1 q += 1 states.resize(q) for i in range(new_states.size()): states.push_back(new_states[i]) + # `align_states` always corresponds to `states` 1:1 + if with_alignments != 0: + align_states.resize(q) + for i in range(align_new_states.size()): + align_states.push_back(align_new_states[i]) cdef int update_predicate_cache(int8_t* cache, @@ -444,15 +522,27 @@ cdef int update_predicate_cache(int8_t* cache, raise ValueError(Errors.E125.format(value=result)) -cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states) except *: +cdef void finish_states(vector[MatchC]& matches, vector[PatternStateC]& states, + vector[vector[MatchAlignmentC]]& align_matches, + vector[vector[MatchAlignmentC]]& align_states, + bint with_alignments) except *: """Handle states that end in zero-width patterns.""" cdef PatternStateC state + cdef vector[MatchAlignmentC] align_state for i in range(states.size()): state = states[i] + if with_alignments != 0: + align_state = align_states[i] while get_quantifier(state) in (ZERO_PLUS, ZERO_ONE): + # Update alignment before the transition of current state + if with_alignments != 0: + align_state.push_back(MatchAlignmentC(state.pattern.token_idx, state.length)) is_final = get_is_final(state) if is_final: ent_id = get_ent_id(state.pattern) + # `align_matches` always corresponds to `matches` 1:1 + if with_alignments != 0: + align_matches.push_back(align_state) matches.push_back( MatchC(pattern_id=ent_id, start=state.start, length=state.length)) break @@ -607,7 +697,7 @@ cdef int8_t get_quantifier(PatternStateC state) nogil: cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) except NULL: pattern = mem.alloc(len(token_specs) + 1, sizeof(TokenPatternC)) cdef int i, index - for i, (quantifier, spec, extensions, predicates) in enumerate(token_specs): + for i, (quantifier, spec, extensions, predicates, token_idx) in enumerate(token_specs): pattern[i].quantifier = quantifier # Ensure attrs refers to a null pointer if nr_attr == 0 if len(spec) > 0: @@ -628,6 +718,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) pattern[i].py_predicates[j] = index pattern[i].nr_py = len(predicates) pattern[i].key = hash64(pattern[i].attrs, pattern[i].nr_attr * sizeof(AttrValueC), 0) + pattern[i].token_idx = token_idx i = len(token_specs) # Use quantifier to identify final ID pattern node (rather than previous # uninitialized quantifier == 0/ZERO + nr_attr == 0 + non-zero-length attrs) @@ -638,6 +729,7 @@ cdef TokenPatternC* init_pattern(Pool mem, attr_t entity_id, object token_specs) pattern[i].nr_attr = 1 pattern[i].nr_extra_attr = 0 pattern[i].nr_py = 0 + pattern[i].token_idx = -1 return pattern @@ -655,7 +747,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates): """This function interprets the pattern, converting the various bits of syntactic sugar before we compile it into a struct with init_pattern. - We need to split the pattern up into three parts: + We need to split the pattern up into four parts: * Normal attribute/value pairs, which are stored on either the token or lexeme, can be handled directly. * Extension attributes are handled specially, as we need to prefetch the @@ -664,13 +756,14 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates): functions and store them. So we store these specially as well. * Extension attributes that have extra predicates are stored within the extra_predicates. + * Token index that this pattern belongs to. """ tokens = [] string_store = vocab.strings - for spec in token_specs: + for token_idx, spec in enumerate(token_specs): if not spec: # Signifier for 'any token' - tokens.append((ONE, [(NULL_ATTR, 0)], [], [])) + tokens.append((ONE, [(NULL_ATTR, 0)], [], [], token_idx)) continue if not isinstance(spec, dict): raise ValueError(Errors.E154.format()) @@ -679,7 +772,7 @@ def _preprocess_pattern(token_specs, vocab, extensions_table, extra_predicates): extensions = _get_extensions(spec, string_store, extensions_table) predicates = _get_extra_predicates(spec, extra_predicates, vocab) for op in ops: - tokens.append((op, list(attr_values), list(extensions), list(predicates))) + tokens.append((op, list(attr_values), list(extensions), list(predicates), token_idx)) return tokens diff --git a/spacy/tests/matcher/test_matcher_logic.py b/spacy/tests/matcher/test_matcher_logic.py index 5f4c2991a97..9f575fe053a 100644 --- a/spacy/tests/matcher/test_matcher_logic.py +++ b/spacy/tests/matcher/test_matcher_logic.py @@ -204,3 +204,90 @@ def test_matcher_remove(): # removing again should throw an error with pytest.raises(ValueError): matcher.remove("Rule") + + +def test_matcher_with_alignments_greedy_longest(en_vocab): + cases = [ + ("aaab", "a* b", [0, 0, 0, 1]), + ("baab", "b a* b", [0, 1, 1, 2]), + ("aaab", "a a a b", [0, 1, 2, 3]), + ("aaab", "a+ b", [0, 0, 0, 1]), + ("aaba", "a+ b a+", [0, 0, 1, 2]), + ("aabaa", "a+ b a+", [0, 0, 1, 2, 2]), + ("aaba", "a+ b a*", [0, 0, 1, 2]), + ("aaaa", "a*", [0, 0, 0, 0]), + ("baab", "b a* b b*", [0, 1, 1, 2]), + ("aabb", "a* b* a*", [0, 0, 1, 1]), + ("aaab", "a+ a+ a b", [0, 1, 2, 3]), + ("aaab", "a+ a+ a+ b", [0, 1, 2, 3]), + ("aaab", "a+ a a b", [0, 1, 2, 3]), + ("aaab", "a+ a a", [0, 1, 2]), + ("aaab", "a+ a a?", [0, 1, 2]), + ("aaaa", "a a a a a?", [0, 1, 2, 3]), + ("aaab", "a+ a b", [0, 0, 1, 2]), + ("aaab", "a+ a+ b", [0, 0, 1, 2]), + ] + for string, pattern_str, result in cases: + matcher = Matcher(en_vocab) + doc = Doc(matcher.vocab, words=list(string)) + pattern = [] + for part in pattern_str.split(): + if part.endswith("+"): + pattern.append({"ORTH": part[0], "OP": "+"}) + elif part.endswith("*"): + pattern.append({"ORTH": part[0], "OP": "*"}) + elif part.endswith("?"): + pattern.append({"ORTH": part[0], "OP": "?"}) + else: + pattern.append({"ORTH": part}) + matcher.add("PATTERN", [pattern], greedy="LONGEST") + matches = matcher(doc, with_alignments=True) + n_matches = len(matches) + + _, s, e, expected = matches[0] + + assert expected == result, (string, pattern_str, s, e, n_matches) + + +def test_matcher_with_alignments_nongreedy(en_vocab): + cases = [ + (0, "aaab", "a* b", [[0, 1], [0, 0, 1], [0, 0, 0, 1], [1]]), + (1, "baab", "b a* b", [[0, 1, 1, 2]]), + (2, "aaab", "a a a b", [[0, 1, 2, 3]]), + (3, "aaab", "a+ b", [[0, 1], [0, 0, 1], [0, 0, 0, 1]]), + (4, "aaba", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2]]), + (5, "aabaa", "a+ b a+", [[0, 1, 2], [0, 0, 1, 2], [0, 0, 1, 2, 2], [0, 1, 2, 2] ]), + (6, "aaba", "a+ b a*", [[0, 1], [0, 0, 1], [0, 0, 1, 2], [0, 1, 2]]), + (7, "aaaa", "a*", [[0], [0, 0], [0, 0, 0], [0, 0, 0, 0]]), + (8, "baab", "b a* b b*", [[0, 1, 1, 2]]), + (9, "aabb", "a* b* a*", [[1], [2], [2, 2], [0, 1], [0, 0, 1], [0, 0, 1, 1], [0, 1, 1], [1, 1]]), + (10, "aaab", "a+ a+ a b", [[0, 1, 2, 3]]), + (11, "aaab", "a+ a+ a+ b", [[0, 1, 2, 3]]), + (12, "aaab", "a+ a a b", [[0, 1, 2, 3]]), + (13, "aaab", "a+ a a", [[0, 1, 2]]), + (14, "aaab", "a+ a a?", [[0, 1], [0, 1, 2]]), + (15, "aaaa", "a a a a a?", [[0, 1, 2, 3]]), + (16, "aaab", "a+ a b", [[0, 1, 2], [0, 0, 1, 2]]), + (17, "aaab", "a+ a+ b", [[0, 1, 2], [0, 0, 1, 2]]), + ] + for case_id, string, pattern_str, results in cases: + matcher = Matcher(en_vocab) + doc = Doc(matcher.vocab, words=list(string)) + pattern = [] + for part in pattern_str.split(): + if part.endswith("+"): + pattern.append({"ORTH": part[0], "OP": "+"}) + elif part.endswith("*"): + pattern.append({"ORTH": part[0], "OP": "*"}) + elif part.endswith("?"): + pattern.append({"ORTH": part[0], "OP": "?"}) + else: + pattern.append({"ORTH": part}) + + matcher.add("PATTERN", [pattern]) + matches = matcher(doc, with_alignments=True) + n_matches = len(matches) + + for _, s, e, expected in matches: + assert expected in results, (case_id, string, pattern_str, s, e, n_matches) + assert len(expected) == e - s diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md index 95a76586af7..c15ee7a47ef 100644 --- a/website/docs/api/matcher.md +++ b/website/docs/api/matcher.md @@ -120,13 +120,14 @@ Find all token sequences matching the supplied patterns on the `Doc` or `Span`. > matches = matcher(doc) > ``` -| Name | Description | -| ------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | -| _keyword-only_ | | -| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | -| `allow_missing` 3 | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ | -| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | +| Name | Description | +| ---------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `doclike` | The `Doc` or `Span` to match over. ~~Union[Doc, Span]~~ | +| _keyword-only_ | | +| `as_spans` 3 | Instead of tuples, return a list of [`Span`](/api/span) objects of the matches, with the `match_id` assigned as the span label. Defaults to `False`. ~~bool~~ | +| `allow_missing` 3 | Whether to skip checks for missing annotation for attributes included in patterns. Defaults to `False`. ~~bool~~ | +| `with_alignments` 3.1 | Return match alignment information as part of the match tuple as `List[int]` with the same length as the matched span. Each entry denotes the corresponding index of the token pattern. If `as_spans` is set to `True`, this setting is ignored. Defaults to `False`. ~~bool~~ | +| **RETURNS** | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. If `as_spans` is set to `True`, a list of `Span` objects is returned instead. ~~Union[List[Tuple[int, int, int]], List[Span]]~~ | ## Matcher.\_\_len\_\_ {#len tag="method" new="2"}