Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi token patterns #19

Closed
koaning opened this issue Oct 7, 2022 · 12 comments
Closed

multi token patterns #19

koaning opened this issue Oct 7, 2022 · 12 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@koaning
Copy link
Contributor

koaning commented Oct 7, 2022

I might be working on a tutorial on this project, so I figured I'd double-check explicitly: are multi-token phrases supported? My impression is that they're not, and that's totally fine, but I just wanted to make sure.

This example:

import spacy
from spacy import displacy
import concise_concepts

data = {
    "fruit": ["apple", "pear", "orange"],
    "vegetable": ["broccoli", "spinach", "tomato"],
    "meat": ["beef", "pork", "fish", "lamb"],
    "utensil": ["large oven", "warm stove", "big knife"]
}

text = """
    Heat the oil in a large pan and add the Onion, celery and carrots.
    Then, cook over a medium–low heat for 10 minutes, or until softened.
    Add the courgette, garlic, red peppers and oregano and cook for 2–3 minutes.
    Later, add some oranges and chickens. """

nlp = spacy.load("en_core_web_lg", disable=["ner"])

# ent_score for entity condifence scoring
nlp.add_pipe("concise_concepts", config={"data": data, "ent_score": True})
doc = nlp(text)

options = {"colors": {"fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon", "utensil": "gray"},
           "ents": ["fruit", "vegetable", "meat", "utensil"]}

ents = doc.ents
for ent in ents:
    new_label = f"{ent.label_} ({float(ent._.ent_score):.0%})"
    options["colors"][new_label] = options["colors"].get(ent.label_.lower(), None)
    options["ents"].append(new_label)
    ent.label_ = new_label
doc.ents = ents

displacy.render(doc, style="ent", options=options)

Yields this error:

word ´large oven´ from key ´utensil´ not present in vector model
word ´warm stove´ from key ´utensil´ not present in vector model
word ´big knife´ from key ´utensil´ not present in vector model
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Input In [4], in <cell line: 21>()
     18 nlp = spacy.load("en_core_web_lg", disable=["ner"])
     20 # ent_score for entity condifence scoring
---> 21 nlp.add_pipe("concise_concepts", config={"data": data, "ent_score": True})
     22 doc = nlp(text)
     24 options = {"colors": {"fruit": "darkorange", "vegetable": "limegreen", "meat": "salmon", "utensil": "gray"},
     25            "ents": ["fruit", "vegetable", "meat", "utensil"]}

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/language.py:795, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    787     if not self.has_factory(factory_name):
    788         err = Errors.E002.format(
    789             name=factory_name,
    790             opts=", ".join(self.factory_names),
   (...)
    793             lang_code=self.lang,
    794         )
--> 795     pipe_component = self.create_pipe(
    796         factory_name,
    797         name=name,
    798         config=config,
    799         raw_config=raw_config,
    800         validate=validate,
    801     )
    802 pipe_index = self._get_pipe_index(before, after, first, last)
    803 self._pipe_meta[name] = self.get_factory_meta(factory_name)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/language.py:674, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
    671 cfg = {factory_name: config}
    672 # We're calling the internal _fill here to avoid constructing the
    673 # registered functions twice
--> 674 resolved = registry.resolve(cfg, validate=validate)
    675 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"]
    676 filled = Config(filled)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:747, in registry.resolve(cls, config, schema, overrides, validate)
    738 @classmethod
    739 def resolve(
    740     cls,
   (...)
    745     validate: bool = True,
    746 ) -> Dict[str, Any]:
--> 747     resolved, _ = cls._make(
    748         config, schema=schema, overrides=overrides, validate=validate, resolve=True
    749     )
    750     return resolved

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:796, in registry._make(cls, config, schema, overrides, resolve, validate)
    794 if not is_interpolated:
    795     config = Config(orig_config).interpolate()
--> 796 filled, _, resolved = cls._fill(
    797     config, schema, validate=validate, overrides=overrides, resolve=resolve
    798 )
    799 filled = Config(filled, section_order=section_order)
    800 # Check that overrides didn't include invalid properties not in config

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:868, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    865     getter = cls.get(reg_name, func_name)
    866     # We don't want to try/except this and raise our own error
    867     # here, because we want the traceback if the function fails.
--> 868     getter_result = getter(*args, **kwargs)
    869 else:
    870     # We're not resolving and calling the function, so replace
    871     # the getter_result with a Promise class
    872     getter_result = Promise(
    873         registry=reg_name, name=func_name, args=args, kwargs=kwargs
    874     )

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/__init__.py:47, in make_concise_concepts(nlp, name, data, topn, model_path, word_delimiter, ent_score, exclude_pos, exclude_dep, include_compound_words, case_sensitive)
      9 @Language.factory(
     10     "concise_concepts",
     11     default_config={
   (...)
     45     case_sensitive: bool,
     46 ):
---> 47     return Conceptualizer(
     48         nlp=nlp,
     49         name=name,
     50         data=data,
     51         topn=topn,
     52         model_path=model_path,
     53         word_delimiter=word_delimiter,
     54         ent_score=ent_score,
     55         exclude_pos=exclude_pos,
     56         exclude_dep=exclude_dep,
     57         include_compound_words=include_compound_words,
     58         case_sensitive=case_sensitive,
     59     )

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py:95, in Conceptualizer.__init__(self, nlp, name, data, topn, model_path, word_delimiter, ent_score, exclude_pos, exclude_dep, include_compound_words, case_sensitive)
     93 else:
     94     self.match_key = "LEMMA"
---> 95 self.run()
     96 self.data_upper = {k.upper(): v for k, v in data.items()}

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py:101, in Conceptualizer.run(self)
     99 self.determine_topn()
    100 self.set_gensim_model()
--> 101 self.verify_data()
    102 self.expand_concepts()
    103 self.verify_data(verbose=False)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/concise_concepts/conceptualizer/Conceptualizer.py:193, in Conceptualizer.verify_data(self, verbose)
    188                 logger.warning(
    189                     f"word ´{word}´ from key ´{key}´ not present in vector"
    190                     " model"
    191                 )
    192     verified_data[key] = verified_values
--> 193     assert len(
    194         verified_values
    195     ), f"None of the entries for key {key} are present in the vector model"
    196 self.data = deepcopy(verified_data)
    197 self.original_data = deepcopy(self.data)

AssertionError: None of the entries for key utensil are present in the vector model
@koaning
Copy link
Contributor Author

koaning commented Oct 7, 2022

I'm also thinking of a postprocessing trick now. If a token is detected as an entity, but it is part of a noun-chunk, we may also attempt to highlight the entire noun-chunk.

This would be for a separate tutorial, but I'm curious what you think of the idea.

@davidberenstein1957
Copy link
Owner

They are, but the n-grams do actually need to be present in the embedding model. If not, the algorithm doesn´t have any input to expand over.

I can see 2 solutions:

  • I could choose to average out embeddings of the individual tokens in the n-grams. In case of "big knife" this will likely result in a lot less accurate results.
  • I could choose to only include the token of the n-gram that is most likely to align with the label. In case of "big knife", this would result in "knife" aligning with "utensils" and therefore result in the embedding of "knife" to be used as fallback.

@davidberenstein1957
Copy link
Owner

davidberenstein1957 commented Oct 7, 2022

Additionally, there is a function to add a flag include_compound_words, which should allow for the model to detect "big knife" based on only having an initial similarity result for "knife".

This I also one of the features that isn't properly added to the documentation.

Besides that, the exclude_pos and exclude_dep are too.

@davidberenstein1957
Copy link
Owner

I generally like to compose the behaviour of the patterns along with your rule-based matcher explorer https://demos.explosion.ai/matcher.

@koaning
Copy link
Contributor Author

koaning commented Oct 7, 2022

Yeah, averaging the embeddings of inputs seems like it'll result in a bad time.

But it was indeed probably the include_compound_words feature that was missing from my initial trial.

There is also a third option, one that (hopefully) will get announced next week on our YouTube channel.

@davidberenstein1957 davidberenstein1957 added the documentation Improvements or additions to documentation label Oct 7, 2022
@davidberenstein1957 davidberenstein1957 self-assigned this Oct 7, 2022
@davidberenstein1957 davidberenstein1957 added the enhancement New feature or request label Oct 7, 2022
@davidberenstein1957
Copy link
Owner

Now you got me curious about the third option.

But cool that you are working on a tutorial. Let me know if there are any hiccups or features you might think of.

davidberenstein1957 added a commit that referenced this issue Oct 9, 2022
#17 duplicate logging - #19 handling of error within missing tokens in model
@davidberenstein1957
Copy link
Owner

@koaning I closed this for now. Will review the solution after your blogpost.

@koaning
Copy link
Contributor Author

koaning commented Oct 9, 2022

It will be a two-part thing, the first part will be on YouTube. The thing about the solution though is that it is already implemented in another library 😉

@davidberenstein1957
Copy link
Owner

That library being? 😅 or are you talking about the doc.noun_chunks part?

@koaning
Copy link
Contributor Author

koaning commented Oct 11, 2022

@davidberenstein1957
Copy link
Owner

Cool. I´ll do some testing and look into a way to integrate this.

@koaning
Copy link
Contributor Author

koaning commented Oct 11, 2022

There are likely some other integrations inbound, but yeah, s2v is a great trick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants