multi token patterns #19

koaning · 2022-10-07T09:15:15Z

I might be working on a tutorial on this project, so I figured I'd double-check explicitly: are multi-token phrases supported? My impression is that they're not, and that's totally fine, but I just wanted to make sure.

This example:

import spacy
from spacy import displacy
import concise_concepts

data = {
    "fruit": ["apple", "pear", "orange"],
    "vegetable": ["broccoli", "spinach", "tomato"],
    "meat": ["beef", "pork", "fish", "lamb"],
    "utensil": ["large oven", "warm stove", "big knife"]
}

text = """
    Heat the oil in a large pan and add the Onion, celery and carrots.
    Then, cook over a medium–low heat for 10 minutes, or until softened.
    Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.
    Later, add some oranges and chickens. """

nlp = spacy.load("en_core_web_lg", disable=["ner"])

# ent_score for entity condifence scoring
nlp.add_pipe("concise_concepts", config={"data": data, "ent_score": True})
doc = nlp(text)

options = {"colors": {"fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon", "utensil": "gray"},
           "ents": ["fruit", "vegetable", "meat", "utensil"]}

ents = doc.ents
for ent in ents:
    new_label = f"{ent.label_} ({float(ent._.ent_score):.0%})"
    options["colors"][new_label] = options["colors"].get(ent.label_.lower(), None)
    options["ents"].append(new_label)
    ent.label_ = new_label
doc.ents = ents

displacy.render(doc, style="ent", options=options)

Yields this error:

word ´large oven´ from key ´utensil´ not present in vector model
word ´warm stove´ from key ´utensil´ not present in vector model
word ´big knife´ from key ´utensil´ not present in vector model
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [4], in <cell line: 21>()
     18 nlp = spacy.load("en_core_web_lg", disable=["ner"])
     20 # ent_score for entity condifence scoring
---> 21 nlp.add_pipe("concise_concepts", config={"data": data, "ent_score": True})
     22 doc = nlp(text)
     24 options = {"colors": {"fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon", "utensil": "gray"},
     25            "ents": ["fruit", "vegetable", "meat", "utensil"]}

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/language.py:795, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    787     if not self.has_factory(factory_name):
    788         err = Errors.E002.format(
    789             name=factory_name,
    790             opts=", ".join(self.factory_names),
   (...)
    793             lang_code=self.lang,
    794         )
--> 795     pipe_component = self.create_pipe(
    796         factory_name,
    797         name=name,
    798         config=config,
    799         raw_config=raw_config,
    800         validate=validate,
    801     )
    802 pipe_index = self._get_pipe_index(before, after, first, last)
    803 self._pipe_meta[name] = self.get_factory_meta(factory_name)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/language.py:674, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
    671 cfg = {factory_name: config}
    672 # We're calling the internal _fill here to avoid constructing the
    673 # registered functions twice
--> 674 resolved = registry.resolve(cfg, validate=validate)
    675 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"]
    676 filled = Config(filled)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:747, in registry.resolve(cls, config, schema, overrides, validate)
    738 @classmethod
    739 def resolve(
    740     cls,
   (...)
    745     validate: bool = True,
    746 ) -> Dict[str, Any]:
--> 747     resolved, _ = cls._make(
    748         config, schema=schema, overrides=overrides, validate=validate, resolve=True
    749     )
    750     return resolved

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:796, in registry._make(cls, config, schema, overrides, resolve, validate)
    794 if not is_interpolated:
    795     config = Config(orig_config).interpolate()
--> 796 filled, _, resolved = cls._fill(
    797     config, schema, validate=validate, overrides=overrides, resolve=resolve
    798 )
    799 filled = Config(filled, section_order=section_order)
    800 # Check that overrides didn't include invalid properties not in config

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:868, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    865     getter = cls.get(reg_name, func_name)
    866     # We don't want to try/except this and raise our own error
    867     # here, because we want the traceback if the function fails.
--> 868     getter_result = getter(*args, **kwargs)
    869 else:
    870     # We're not resolving and calling the function, so replace
    871     # the getter_result with a Promise class
    872     getter_result = Promise(
    873         registry=reg_name, name=func_name, args=args, kwargs=kwargs
    874     )

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/__init__.py:47, in make_concise_concepts(nlp, name, data, topn, model_path, word_delimiter, ent_score, exclude_pos, exclude_dep, include_compound_words, case_sensitive)
      9 @Language.factory(
     10     "concise_concepts",
     11     default_config={
   (...)
     45     case_sensitive: bool,
     46 ):
---> 47     return Conceptualizer(
     48         nlp=nlp,
     49         name=name,
     50         data=data,
     51         topn=topn,
     52         model_path=model_path,
     53         word_delimiter=word_delimiter,
     54         ent_score=ent_score,
     55         exclude_pos=exclude_pos,
     56         exclude_dep=exclude_dep,
     57         include_compound_words=include_compound_words,
     58         case_sensitive=case_sensitive,
     59     )

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py:95, in Conceptualizer.__init__(self, nlp, name, data, topn, model_path, word_delimiter, ent_score, exclude_pos, exclude_dep, include_compound_words, case_sensitive)
     93 else:
     94     self.match_key = "LEMMA"
---> 95 self.run()
     96 self.data_upper = {k.upper(): v for k, v in data.items()}

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py:101, in Conceptualizer.run(self)
     99 self.determine_topn()
    100 self.set_gensim_model()
--> 101 self.verify_data()
    102 self.expand_concepts()
    103 self.verify_data(verbose=False)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py:193, in Conceptualizer.verify_data(self, verbose)
    188                 logger.warning(
    189                     f"word ´{word}´ from key ´{key}´ not present in vector"
    190                     " model"
    191                 )
    192     verified_data[key] = verified_values
--> 193     assert len(
    194         verified_values
    195     ), f"None of the entries for key {key} are present in the vector model"
    196 self.data = deepcopy(verified_data)
    197 self.original_data = deepcopy(self.data)

AssertionError: None of the entries for key utensil are present in the vector model

The text was updated successfully, but these errors were encountered:

koaning · 2022-10-07T09:21:05Z

I'm also thinking of a postprocessing trick now. If a token is detected as an entity, but it is part of a noun-chunk, we may also attempt to highlight the entire noun-chunk.

This would be for a separate tutorial, but I'm curious what you think of the idea.

davidberenstein1957 · 2022-10-07T09:24:48Z

They are, but the n-grams do actually need to be present in the embedding model. If not, the algorithm doesn´t have any input to expand over.

I can see 2 solutions:

I could choose to average out embeddings of the individual tokens in the n-grams. In case of "big knife" this will likely result in a lot less accurate results.
I could choose to only include the token of the n-gram that is most likely to align with the label. In case of "big knife", this would result in "knife" aligning with "utensils" and therefore result in the embedding of "knife" to be used as fallback.

davidberenstein1957 · 2022-10-07T09:28:11Z

Additionally, there is a function to add a flag include_compound_words, which should allow for the model to detect "big knife" based on only having an initial similarity result for "knife".

This I also one of the features that isn't properly added to the documentation.

Besides that, the exclude_pos and exclude_dep are too.

davidberenstein1957 · 2022-10-07T09:31:23Z

I generally like to compose the behaviour of the patterns along with your rule-based matcher explorer https://demos.explosion.ai/matcher.

koaning · 2022-10-07T09:31:36Z

Yeah, averaging the embeddings of inputs seems like it'll result in a bad time.

But it was indeed probably the include_compound_words feature that was missing from my initial trial.

There is also a third option, one that (hopefully) will get announced next week on our YouTube channel.

davidberenstein1957 · 2022-10-07T09:36:37Z

Now you got me curious about the third option.

But cool that you are working on a tutorial. Let me know if there are any hiccups or features you might think of.

#17 duplicate logging - #19 handling of error within missing tokens in model

davidberenstein1957 · 2022-10-09T10:34:30Z

@koaning I closed this for now. Will review the solution after your blogpost.

koaning · 2022-10-09T10:35:25Z

It will be a two-part thing, the first part will be on YouTube. The thing about the solution though is that it is already implemented in another library 😉

davidberenstein1957 · 2022-10-09T10:47:31Z

That library being? 😅 or are you talking about the doc.noun_chunks part?

koaning · 2022-10-11T15:03:24Z

https://twitter.com/explosion_ai/status/1579840012174204928

davidberenstein1957 · 2022-10-11T15:49:30Z

Cool. I´ll do some testing and look into a way to integrate this.

koaning · 2022-10-11T17:00:29Z

There are likely some other integrations inbound, but yeah, s2v is a great trick.

davidberenstein1957 added the documentation Improvements or additions to documentation label Oct 7, 2022

davidberenstein1957 self-assigned this Oct 7, 2022

davidberenstein1957 added the enhancement New feature or request label Oct 7, 2022

davidberenstein1957 closed this as completed in c23ad19 Oct 9, 2022

davidberenstein1957 added a commit that referenced this issue Oct 9, 2022

Merge pull request #21 from Pandora-Intelligence/#17-duplicate-loggign

3da123f

#17 duplicate logging - #19 handling of error within missing tokens in model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi token patterns #19

multi token patterns #19

koaning commented Oct 7, 2022 •

edited

Loading

koaning commented Oct 7, 2022

davidberenstein1957 commented Oct 7, 2022

davidberenstein1957 commented Oct 7, 2022 •

edited

Loading

davidberenstein1957 commented Oct 7, 2022

koaning commented Oct 7, 2022

davidberenstein1957 commented Oct 7, 2022

davidberenstein1957 commented Oct 9, 2022

koaning commented Oct 9, 2022

davidberenstein1957 commented Oct 9, 2022

koaning commented Oct 11, 2022

davidberenstein1957 commented Oct 11, 2022

koaning commented Oct 11, 2022

multi token patterns #19

multi token patterns #19

Comments

koaning commented Oct 7, 2022 • edited Loading

koaning commented Oct 7, 2022

davidberenstein1957 commented Oct 7, 2022

davidberenstein1957 commented Oct 7, 2022 • edited Loading

davidberenstein1957 commented Oct 7, 2022

koaning commented Oct 7, 2022

davidberenstein1957 commented Oct 7, 2022

davidberenstein1957 commented Oct 9, 2022

koaning commented Oct 9, 2022

davidberenstein1957 commented Oct 9, 2022

koaning commented Oct 11, 2022

davidberenstein1957 commented Oct 11, 2022

koaning commented Oct 11, 2022

koaning commented Oct 7, 2022 •

edited

Loading

davidberenstein1957 commented Oct 7, 2022 •

edited

Loading