——JioNLP:A Python Lib for Chinese NLP Preprocessing & Parsing
——installation method:pip install jionlp
Doing NLP tasks, need to clean and filter the corpus? Use JioNLP
Doing NLP tasks, need to extract key info? Use JioNLP
Doing NLP tasks, need to do text augmentation? Use JioNLP
Doing NLP tasks, need to get radical, pinyin, traditional info of Chinese character? Use JioNLP
In short, JioNLP offers a bundle of NLP task preprocessing and parsing tools, which is accurate, efficient, easy to use.
Main functions include: clean text, delete HTML tags, exceptional chars, redundent chars, convert full-angle chars to half-angle, extract email, qq, phone-num, parenthesis info, id cards, ip, url, money and case, nums, parse time text, extract keyphrase, load Chinese dictionaries, do Chinese text augmentation
jio.keyphrase.extract_keyphrase: extract keyphrases from a Chinese text
>>> import jionlp as jio
>>> text = '浑水创始人:七月开始调查贝壳,因为“好得难以置信” 2021年12月16日,做空机构浑水在社交媒体上公开表示,正在做空美股上市公司贝壳...'
>>> keyphrases = jio.keyphrase.extract_keyphrase(text)
>>> print(keyphrases)
>>> print(jio.keyphrase.extract_keyphrase.__doc__)
# ['浑水创始人', '开始调查贝壳', '做空机构浑水', '美股上市公司贝壳', '美国证监会']
jio.parse_money: parse a given money text to get a number, money case and definition of the money
import jionlp as jio
text_list = ['约4.287亿美元' , '两个亿卢布' , '六十四万零一百四十三元一角七分' , '3000多欧元' , '三五佰块钱' , '七百到九百亿泰铢' ]
moneys = [jio .parse_money (text ) for text in text_list ]
# 约4.287亿美元: {'num': '428700000.00', 'case': '美元', 'definition': 'blur'}
# 两个亿卢布: {'num': '200000000.00', 'case': '卢布', 'definition': 'accurate'}
# 六十四万零一百四十三元一角七分: {'num': '640143.17', 'case': '元', 'definition': 'accurate'}
# 3000多欧元: {'num': ['3000.00', '4000.00'], 'case': '欧元', 'definition': 'blur'}
# 三五百块钱: {'num': ['300.00', '500.00'], 'case': '元', 'definition': 'blur'}
# 七百到九百亿泰铢: {'num': ['70000000000.00', '90000000000.00'], 'case': '泰铢', 'definition': 'blur'}
jio.parse_time: parse a given time string
import time
import jionlp as jio
res = jio .parse_time ('今年9月' , time_base = {'year' : 2021 })
res = jio .parse_time ('零三年元宵节晚上8点半' , time_base = time .time ())
res = jio .parse_time ('一万个小时' )
res = jio .parse_time ('100天之后' , time .time ())
res = jio .parse_time ('四月十三' , lunar_date = False )
res = jio .parse_time ('每周五下午4点' , time .time (), period_results_num = 2 )
print (res )
# {'type': 'time_span', 'definition': 'accurate', 'time': ['2021-09-01 00:00:00', '2021-09-30 23:59:59']}
# {'type': 'time_point', 'definition': 'accurate', 'time': ['2003-02-15 20:30:00', '2003-02-15 20:30:59']}
# {'type': 'time_delta', 'definition': 'accurate', 'time': {'hour': 10000.0}}
# {'type': 'time_span', 'definition': 'blur', 'time': ['2021-10-22 00:00:00', 'inf']}
# {'type': 'time_period', 'definition': 'accurate', 'time': {'delta': {'day': 7},
# {'type': 'time_point', 'definition': 'accurate', 'time': ['2022-04-13 00:00:00', '2022-04-13 23:59:59']}
# 'point': {'time': [['2021-07-16 16:00:00', '2021-07-16 16:59:59'],
# ['2021-07-23 16:00:00', '2021-07-23 16:59:59']], 'string': '周五下午4点'}}}
$ git clone https://github.com/dongrixinyu/JioNLP
$ cd ./JioNLP
$ pip install .
import jionlp and check the main funcs and annotatiosn
>>> import jionlp as jio
>>> jio.help() # input the keywords, such as “回译”, which means back translation
>>> dir(jio)
>>> print(jio.extract_parentheses.__doc__)
If in Linux, the following command is a replacement of jio.help()
.
Star⭐ represents excellent features
Features
Function name
Description
Star
help search tool
help
if you have no idea of JioNLP features, this tool can help you to scan with keywords
time sementic parser
parse_time
get the timestamp and span of a given time text
⭐
keyphrase extraction
extract_keyphrase
extract the keyphrases of a given text
⭐
extractive summary
extract_summary
extract the summary of a given text
stopwords filter
remove_stopwords
delete the stopwords of a given words list generated from a text
⭐
sentence spliter
split_sentence
split a text to sentences
⭐
location parser
parse_location
get the province, city, county, town and countryside name of a location text
⭐
telephone number parser
phone_location cell_phone_location landline_phone_location
get the province, city, communication operators of a telephone number
news location recognizer
recognize_location
get the country, province, city, county name of a news text
⭐
solar lunar date conversion
lunar2solar solar2lunar
translate a lunar (solar) date to the solar (lunar) date
ID cards parser
parse_id_card
get the province, city, conty, birthday, gender, checking code of a given Chinese ID card number
⭐
idiom solitaire
idiom_solitaire
a word game that a list of Chinese idioms which the first char of the latter idiom has the same pronunciation with the last char of the former idiom
tranditional chars to simplified chars
tra2sim
translate traditional characters to simplified version
simplified chars to traditional chars
sim2tra
translate simplified characters to traditional version
characters to pinyin
pinyin
get the pinyin of chinese chars to add pronunciation info to the NLP model input
⭐
characters to radical
char_radical
get the radical info of Chinese chars to add to the NLP model input
⭐
money numbers to chars
money_num2char
get the character of a given money number
Features
Function name
Description
Star
back translation
BackTranslation
get augmented text via back translation
⭐
swap char position
swap_char_position
get augmented text via swapping the position of adjacent chars
homophone substitution
homophone_substitution
replace chars with the same pronunciation to get augmented text
⭐
randomly add & delete chars
random_add_delete
add and delete chars randomly in the text to get augmented text
NER entity replacement
replace_entity
replace the entity of the text via dictionary to get augmented text
⭐
3.Key info extraction and parsing with regular expression
6.Named Entity Recognition(NER) auxiliary tools
Features
Function name
Description
Star
extract money entity
extract_money
extract money entity text from the given text
⭐
extract time entity
extract_time
extract time entity text from the given text
⭐
Lexicon NER
LexiconNER
get entities from the text via dictionary
⭐
entity to tag
entity2tag
convert the entities info to tags for sequence labeling
tag to entity
tag2entity
convert the tags of sequence labeling to entities
char token to word token
char2word
convert char token data to word token data
word token to char token
word2char
convert word token data to char token data
entity compare
entity_compare
compare the predicted entities with the golden entities
⭐
NER acceleration of prediction
TokenSplitSentence TokenBreakLongSentence TokenBatchBucket
acceleration of NER prediction
⭐
split dataset
analyse_dataset
split dataset info training, valid, test part and analyse the KL divergence info
⭐
entity collector
collect_dataset_entities
collect all entities from labeled dataset to get a dictionary
Features
Function name
Description
Star
Naive bayes words analysis
analyse_freq_words
analyse the words frequency of different classes by naive bayes
⭐
split dataset
analyse_dataset
split dataset info training, valid, test part and analyse the KL divergence info
⭐
9.Chinese Word Segmentation(CWS)
Features
Function name
Description
Star
word to tag
cws.word2tag
convert the words list to a list of tags for CWS
tag to word
cws.tag2word
convert the list of tags to a words list for CWS
compute F1
cws.f1
compute F1 value of the CWS models
CWS dataset corrector
cws.CWSDCWithStandardWords
correct the CWS datasets with dictionaries
NLP preprocessing and parsing is significant and time-consuming, especially for Chinese. This library offers a bundle of features to tackle these nasty jobs and you can focus more on training models.
If having any suggestions or problems with bugs, you can raise an issue via github.
Welcome to join the wechat group of NLP technics
Please add new friend wechat name:dongrixinyu89
If this tool is useful to your development, please click the github star ⭐
Or scan the Paypal or Wechat QR code to donate money (●'◡'●) Thanks ~~
\