-
Notifications
You must be signed in to change notification settings - Fork 48
corpus structure and format
jidasheng edited this page Dec 6, 2019
·
2 revisions
- corpus structure
corpus_dir vocab.json tags.json dataset.txt
- all files are UTF-8 encoded
-
vocab.json
- a list of unique CHARs or WORDs that define the vocabulary
- chars/words that not in the vocabulary will be replaced by
UNKNOWN
- examples
- CHAR-based:
["市", "领", "导", "到", "成", "都", ...]
- WORD-based:
["市", "领导", "到", "成都", ...]
- CHAR-based:
-
tags.json
- a list of tags
- the tags can be any tag set of any order with no constraints
- the only thing to be concerned when predicting sequences
from bi_lstm_crf.app import WordsTagger model = WordsTagger(model_dir="xxx") tags, sequences = model(["市领导到成都..."], begin_tags="BS") print(tags) # [["B", "B", "I", "B", "B-LOC", "I-LOC", "I-LOC", "I-LOC", "I-LOC", "B", "I", "B", "I"]] print(sequences) # [['市', '领导', '到', ('成都', 'LOC'), ...]]
- argument
begin_tags
is used for converting the tags to sequences - most of the time, the default value
"BS"
is right, but:- when you using
BMEWO
format(B(Begin), M(Middle), E(End), W(Word), O(Outside)
) -
begin_tags
should be set to"BW"
- when you using
- argument
- examples
- WORD SEGMENTATION:
["B", "I"]
- NER:
["O", "B-ORG", "I-ORG", ...]
- WORD SEGMENTATION:
- a list of tags
-
dataset.txt
- format
[sentence][\tab][tags] ...
- the
[sentence]
should be a string or a list of string - for CHAR-based, a string is enough to represent a sentence
- the
- examples
- CHAR-based
市领导到成都... ["B", "B", "I", "B", "B", "I", ...] ...
- WORD-based
["市", "领导", "到", "成都", ...] ["B", "B", "I", "B", "B", "I", ...] ...
- CHAR-based
- format