How to Named Entity Recognize using Data Programming in Snorkel? #838

wenfeixiang1991 · 2017-11-27T08:29:03Z

Hi,

My purpose is extracting two entities(Industry and Company) in every Chinese raw text(or sentence), and each entity including few Chinese Characters. Modeling strategy is LSTM + CRF, but the train data which tagging every single Character is key! So I want to get train labeled data in Data Programming way by using Candidate Extractor + Label Function which is featured in snorkel.

After I read intro and cdr in tutorial and issue #599 and #810, I have some question about how to NER using snorkel:

@ajratner mentioned that paper on entity tagging which will be posted very soon... in Training data for training a NER model #599, where I can get it now?
Such as @ajratner answered, if I using
Industry = candidate_subclass('Industry', ['industry'])
and
Company = candidate_subclass('Company', ['company']), then I just get every sentence labeled Industry/Company or not in different notebook?, and the target is sentence classify? But in NER, don't we need tagging every word in sentence?
@jason-fries mentioned that treat every word as a candidate and then use categorical variables... in Tagging sequence markup for entity extraction #810. that make sense in snorkel features but if I do this way, every single character has labels but how can I distinguish different sentence, because one Chinese character has complicated meaning in different sentence, besides, it just seems not the right way to NER using snorkel.
the official tutorial intro and cdr seems like using spaCy and TaggerOne recognize people name and medical entity respectively, then classify the relationship(relationship is already known like spouse and cure). But how to do NER and relation classify at the same time in snorkel or how to just NER in snorkel? is it possible that show us a tutorial officially?
one last question is about viewer which is I interested in, I found that if we using viewer to label candidate, only binary labeling by viewer? but in Categorical_Classes
Relationship = candidate_subclass('Relationship', ['person1', 'person2'], values=['Married', 'Employs', False]),
we can not label candidate multi-classes in viewer? what is the right manual of viewer?

Thank you very much!

The text was updated successfully, but these errors were encountered:

fsonntag · 2017-12-07T14:21:33Z

Hey,
I'm not part of the research group, but I'm also using snorkel on NER, so maybe I can answer you some question.
2. I also have two different entity types, but I combined it in one program. You are right, using Snorkel isn't doing classical sequence tagging. Basically for each candidate it is decided upon the labels generated from the label functions if it's an entity or not.
3. Not every single character will get labels. You still have to do tokenization and for Chinese word segmentation first. Upon on those tokens you create Candidate and only the valid Candidates will get labels.
4. You observed that correctly. So you obviously will have to omit or ignore the NER step in corpus parsing. You then have to write a candidate generator
For candidate generation you obviously then have to be more creative. But a simple start is taking all nouns:

simple_matcher = RegexMatchSpan(attrib='pos_tags', rgx='NN.*')
cand_extractor = CandidateExtractor(Industry,
                                        Ngrams(n_max=5),
                                        simple_matcher,
                                        symmetric_relations=False)

A label function will only take one context of a candidate, not two. So the label functions that you see in the tutorials

def LF(c):
         return 1 if condition(c[0]) and condition(c[1]) else return 0

would change to something like this

def LF(c):
         return 1 if condition(c[0]) else return 0

ajratner · 2017-12-13T08:23:09Z

@fsonntag thanks for the response!! Just tacking on for @wenfeixiang1991 :

Have to ask @jason-fries he's the boss there!
There are two separate issues here- (a) binary vs. categorical, and (b) independent vs. structured prediction.

(a) gets to your question of whether two separate labels need to be in two separate notebooks, i.e. be two separate binary classification problems. The answer is no- Snorkel has categorical support, meaning you can classify it as one of k labels in one notebook / model--see the tutorial under tutorials/advanced!
(b) Currently Snorkel is just focused on classifying independent objects--in this case, not the sentence, but each Candidate phrase which might be a named entity mention. We have plans to extend Snorkel to the structured prediction setting---where e.g. you would model each sentence as a sequence of words/characters vs. just as a bag of candidate mentions--though; stay tuned!

No super easy answer here, but per @fsonntag's response you have to find some decent heuristic for extracting Candidates that make sense to then learn to classify with Snorkel
Right now we do these tasks separately; stay tuned though!
Currently you might have to subclass the Viewer class, similar to the SentenceNgramViewer subclass

Hope some of this helps!

wenfeixiang1991 · 2018-02-13T09:13:19Z

Thank you a lot ！

ajratner added the Q&A label Dec 13, 2017

ajratner self-assigned this Dec 14, 2017

arturomp mentioned this issue Jan 26, 2018

Training data for training a NER model #599

Closed

wenfeixiang1991 closed this as completed Feb 13, 2018

cbockman mentioned this issue Jun 26, 2018

Snorkel structured predictions? #961

Closed

stephenbach mentioned this issue Jun 27, 2018

Generative model (de-noising component) for seq-2-seq datasets? #869

Closed

jbkoh mentioned this issue Aug 13, 2018

Applying Snorkel for sequence learning? #997

Closed

Mageswaran1989 mentioned this issue Aug 9, 2019

How to create training data for NER task using snorkel ? #1254

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Named Entity Recognize using Data Programming in Snorkel? #838

How to Named Entity Recognize using Data Programming in Snorkel? #838

wenfeixiang1991 commented Nov 27, 2017

fsonntag commented Dec 7, 2017 •

edited

Loading

ajratner commented Dec 13, 2017

wenfeixiang1991 commented Feb 13, 2018

How to Named Entity Recognize using Data Programming in Snorkel? #838

How to Named Entity Recognize using Data Programming in Snorkel? #838

Comments

wenfeixiang1991 commented Nov 27, 2017

fsonntag commented Dec 7, 2017 • edited Loading

ajratner commented Dec 13, 2017

wenfeixiang1991 commented Feb 13, 2018

fsonntag commented Dec 7, 2017 •

edited

Loading