-
Notifications
You must be signed in to change notification settings - Fork 10.2k
角色标注命名实体
hankcs edited this page Mar 16, 2018
·
1 revision
目前HanLP中的命名实体识别主要通过HMM-角色标注模型来实现,由于这是一整套理论,所以HanLP实现了通用的抽象工具,并且通过其子类实现了人名、地名、机构名的模型训练。
本文将介绍通用的抽象工具,以及如何继承该工具实现常见命名实体识别模型的训练。
该工具的设计目的是训练一个文本形式的模型(词典)出来,主要需要用户实现下列两个方法:
/**
* 加入到词典中,允许子类自定义过滤等等,这样比较灵活
* @param sentenceList
*/
abstract protected void addToDictionary(List<List<IWord>> sentenceList);
/**
* 角色标注,如果子类要进行label的调整或增加新的首尾等等,可以在此进行
*/
abstract protected void roleTag(List<List<IWord>> sentenceList);
在roleTag
方法中,用户需要将人民日报语料的标注转换为自己制定的角色标注,比如:
@Override
protected void roleTag(List<List<IWord>> sentenceList)
{
logger.info("开始标注角色");
int i = 0;
for (List<IWord> wordList : sentenceList)
{
logger.info(++i + " / " + sentenceList.size());
if (verbose) System.out.println("原始语料 " + wordList);
// 先标注A和K
IWord pre = new Word("##始##", "begin");
ListIterator<IWord> listIterator = wordList.listIterator();
while (listIterator.hasNext())
{
IWord word = listIterator.next();
if (!word.getLabel().equals(Nature.nr.toString()))
{
word.setLabel(NR.A.toString());
}
else
{
if (!pre.getLabel().equals(Nature.nr.toString()))
{
pre.setLabel(NR.K.toString());
}
}
pre = word;
}
if (verbose) System.out.println("标注非前 " + wordList);
// 然后标注LM
IWord next = new Word("##末##", "end");
while (listIterator.hasPrevious())
{
IWord word = listIterator.previous();
if (word.getLabel().equals(Nature.nr.toString()))
{
String label = next.getLabel();
if (label.equals("A")) next.setLabel("L");
else if (label.equals("K")) next.setLabel("M");
}
next = word;
}
if (verbose) System.out.println("标注中后 " + wordList);
// 拆分名字
listIterator = wordList.listIterator();
while (listIterator.hasNext())
{
IWord word = listIterator.next();
if (word.getLabel().equals(Nature.nr.toString()))
{
switch (word.getValue().length())
{
case 2:
if (word.getValue().startsWith("大")
|| word.getValue().startsWith("老")
|| word.getValue().startsWith("小")
)
{
listIterator.add(new Word(word.getValue().substring(1, 2), NR.B.toString()));
word.setValue(word.getValue().substring(0, 1));
word.setLabel(NR.F.toString());
}
else if (word.getValue().endsWith("哥")
|| word.getValue().endsWith("公")
|| word.getValue().endsWith("姐")
|| word.getValue().endsWith("老")
|| word.getValue().endsWith("某")
|| word.getValue().endsWith("嫂")
|| word.getValue().endsWith("氏")
|| word.getValue().endsWith("总")
)
{
listIterator.add(new Word(word.getValue().substring(1, 2), NR.G.toString()));
word.setValue(word.getValue().substring(0, 1));
word.setLabel(NR.B.toString());
}
else
{
listIterator.add(new Word(word.getValue().substring(1, 2), NR.E.toString()));
word.setValue(word.getValue().substring(0, 1));
word.setLabel(NR.B.toString());
}
break;
case 3:
listIterator.add(new Word(word.getValue().substring(1, 2), NR.C.toString()));
listIterator.add(new Word(word.getValue().substring(2, 3), NR.D.toString()));
word.setValue(word.getValue().substring(0, 1));
word.setLabel(NR.B.toString());
break;
}
}
}
if (verbose) System.out.println("姓名拆分 " + wordList);
// 上文成词
listIterator = wordList.listIterator();
pre = new Word("##始##", "begin");
while (listIterator.hasNext())
{
IWord word = listIterator.next();
if (word.getLabel().equals(NR.B.toString()))
{
String combine = pre.getValue() + word.getValue();
if (dictionary.contains(combine))
{
pre.setValue(combine);
pre.setLabel("U");
listIterator.remove();
}
}
pre = word;
}
if (verbose) System.out.println("上文成词 " + wordList);
// 头部成词
next = new Word("##末##", "end");
while (listIterator.hasPrevious())
{
IWord word = listIterator.previous();
if (word.getLabel().equals(NR.B.toString()))
{
String combine = word.getValue() + next.getValue();
if (dictionary.contains(combine))
{
next.setValue(combine);
next.setLabel(next.getLabel().equals(NR.C.toString()) ? NR.X.toString() : NR.Y.toString());
listIterator.remove();
}
}
next = word;
}
if (verbose) System.out.println("头部成词 " + wordList);
// 尾部成词
pre = new Word("##始##", "begin");
while (listIterator.hasNext())
{
IWord word = listIterator.next();
if (word.getLabel().equals(NR.D.toString()))
{
String combine = pre.getValue() + word.getValue();
if (dictionary.contains(combine))
{
pre.setValue(combine);
pre.setLabel(NR.Z.toString());
listIterator.remove();
}
}
pre = word;
}
if (verbose) System.out.println("尾部成词 " + wordList);
// 下文成词
next = new Word("##末##", "end");
while (listIterator.hasPrevious())
{
IWord word = listIterator.previous();
if (word.getLabel().equals(NR.D.toString()))
{
String combine = word.getValue() + next.getValue();
if (dictionary.contains(combine))
{
next.setValue(combine);
next.setLabel(NR.V.toString());
listIterator.remove();
}
}
next = word;
}
if (verbose) System.out.println("头部成词 " + wordList);
LinkedList<IWord> wordLinkedList = (LinkedList<IWord>) wordList;
wordLinkedList.addFirst(new Word(Predefine.TAG_BIGIN, "S"));
wordLinkedList.addLast(new Word(Predefine.TAG_END, "A"));
if (verbose) System.out.println("添加首尾 " + wordList);
}
}
上述代码根据张华平老师的论文《基于角色标注的中国人名自动识别研究》中指定的规范,通过一些规则将每个单词的label转换了。由于是直接在原链表上进行转换,所以并不需要输出任何数据。
然后用户需要实现addToDictionary
,该方法的目的是允许用户根据自己的业务逻辑确定哪些词语是模型需要的,哪些不是。比如:
@Override
protected void addToDictionary(List<List<IWord>> sentenceList)
{
logger.warning("开始制作词典");
// 将非A的词语保存下来
for (List<IWord> wordList : sentenceList)
{
for (IWord word : wordList)
{
if (!word.getLabel().equals(NR.A.toString()))
{
dictionaryMaker.add(word);
}
}
}
// 制作NGram词典
for (List<IWord> wordList : sentenceList)
{
IWord pre = null;
for (IWord word : wordList)
{
if (pre != null)
{
nGramDictionaryMaker.addPair(pre, word);
}
pre = word;
}
}
}
之后用户可以使用saveTxtTo
保存模型到磁盘。
同分词模型一样,得到三个文件:
-
SomeName.txt
:单词角色词典 -
SomeName.ngram.txt
:二元接续词典(在命名实体识别中没有用处) -
SomeName.tr.txt
:角色转移矩阵
如果用户使用了相同的训练代码,则可以直接通过配置文件修改相应的路径即可加载。否则需要仿照com/hankcs/hanlp/dictionary/nr/NRDictionary.java
和com/hankcs/hanlp/dictionary/nr/PersonDictionary.java
来编写相应的加载逻辑,仿照com/hankcs/hanlp/recognition/nr/PersonRecognition.java
编写识别逻辑。
HanLP: Han Language Processing - Natural Language Processing for the next decade