Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Named Entity Recognize using Data Programming in Snorkel? #838

Closed
wenfeixiang1991 opened this issue Nov 27, 2017 · 3 comments
Closed
Assignees
Labels

Comments

@wenfeixiang1991
Copy link

Hi,

My purpose is extracting two entities(Industry and Company) in every Chinese raw text(or sentence), and each entity including few Chinese Characters. Modeling strategy is LSTM + CRF, but the train data which tagging every single Character is key! So I want to get train labeled data in Data Programming way by using Candidate Extractor + Label Function which is featured in snorkel.

After I read intro and cdr in tutorial and issue #599 and #810, I have some question about how to NER using snorkel:

  1. @ajratner mentioned that paper on entity tagging which will be posted very soon... in Training data for training a NER model  #599, where I can get it now?
  2. Such as @ajratner answered, if I using
    Industry = candidate_subclass('Industry', ['industry'])
    and
    Company = candidate_subclass('Company', ['company']), then I just get every sentence labeled Industry/Company or not in different notebook?, and the target is sentence classify? But in NER, don't we need tagging every word in sentence?
  3. @jason-fries mentioned that treat every word as a candidate and then use categorical variables... in Tagging sequence markup for entity extraction  #810. that make sense in snorkel features but if I do this way, every single character has labels but how can I distinguish different sentence, because one Chinese character has complicated meaning in different sentence, besides, it just seems not the right way to NER using snorkel.
  4. the official tutorial intro and cdr seems like using spaCy and TaggerOne recognize people name and medical entity respectively, then classify the relationship(relationship is already known like spouse and cure). But how to do NER and relation classify at the same time in snorkel or how to just NER in snorkel? is it possible that show us a tutorial officially?
  5. one last question is about viewer which is I interested in, I found that if we using viewer to label candidate, only binary labeling by viewer? but in Categorical_Classes
    Relationship = candidate_subclass('Relationship', ['person1', 'person2'], values=['Married', 'Employs', False]),
    we can not label candidate multi-classes in viewer? what is the right manual of viewer?

Thank you very much!

@fsonntag
Copy link
Contributor

fsonntag commented Dec 7, 2017

Hey,
I'm not part of the research group, but I'm also using snorkel on NER, so maybe I can answer you some question.
2. I also have two different entity types, but I combined it in one program. You are right, using Snorkel isn't doing classical sequence tagging. Basically for each candidate it is decided upon the labels generated from the label functions if it's an entity or not.
3. Not every single character will get labels. You still have to do tokenization and for Chinese word segmentation first. Upon on those tokens you create Candidate and only the valid Candidates will get labels.
4. You observed that correctly. So you obviously will have to omit or ignore the NER step in corpus parsing. You then have to write a candidate generator
For candidate generation you obviously then have to be more creative. But a simple start is taking all nouns:

simple_matcher = RegexMatchSpan(attrib='pos_tags', rgx='NN.*')
cand_extractor = CandidateExtractor(Industry,
                                        Ngrams(n_max=5),
                                        simple_matcher,
                                        symmetric_relations=False)

A label function will only take one context of a candidate, not two. So the label functions that you see in the tutorials

def LF(c):
         return 1 if condition(c[0]) and condition(c[1]) else return 0

would change to something like this

def LF(c):
         return 1 if condition(c[0]) else return 0

@ajratner
Copy link
Contributor

@fsonntag thanks for the response!! Just tacking on for @wenfeixiang1991 :

  1. Have to ask @jason-fries he's the boss there!

  2. There are two separate issues here- (a) binary vs. categorical, and (b) independent vs. structured prediction.

  • (a) gets to your question of whether two separate labels need to be in two separate notebooks, i.e. be two separate binary classification problems. The answer is no- Snorkel has categorical support, meaning you can classify it as one of k labels in one notebook / model--see the tutorial under tutorials/advanced!
  • (b) Currently Snorkel is just focused on classifying independent objects--in this case, not the sentence, but each Candidate phrase which might be a named entity mention. We have plans to extend Snorkel to the structured prediction setting---where e.g. you would model each sentence as a sequence of words/characters vs. just as a bag of candidate mentions--though; stay tuned!
  1. No super easy answer here, but per @fsonntag's response you have to find some decent heuristic for extracting Candidates that make sense to then learn to classify with Snorkel

  2. Right now we do these tasks separately; stay tuned though!

  3. Currently you might have to subclass the Viewer class, similar to the SentenceNgramViewer subclass

Hope some of this helps!

@wenfeixiang1991
Copy link
Author

Thank you a lot !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants