unsupervised_keyword_extraction

Using BERT pre-trained model embeddings for EmbedRank for unsupervised keyword extraction.

Getting Started

1. Create environment

create conda environment with python 3.7 version

conda create --name keyword_extraction python=3.7

Activate environment

conda activate keyword_extraction

Install requirements

sh install_dependencies.sh

2. Download a pre-trained BERT model

List of released pretrained BERT models (click to expand...)

BERT-Base, Uncased	12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased	24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased	12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased	24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New)	104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Cased (Old)	102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Chinese	Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
unzip cased_L-12_H-768_A-12.zip

3. Start Bert as a Service

sh run_bert_service.sh

4. Usage Example

import spacy
from bert_serving.client import BertClient

from model.embedrank_transformers import EmbedRankTransformers

if __name__ == '__main__':
    bc = BertClient(output_fmt='list')
    nlp = spacy.load("en_core_web_lg", disable=['ner'])

    fi = EmbedRankTransformers(nlp=nlp,
                               dnn=bc,
                               perturbation='replacement',
                               emb_method='subtraction',
                               mmr_beta=0.55,
                               top_n=10,
                               alias_threshold=0.8)

    text = """
    Evaluation of existing and new feature recognition algorithms. 2. Experimental
	results
For pt.1 see ibid., p.839-851. This is the second of two papers investigating
	the performance of general-purpose feature detection techniques. The
	first paper describes the development of a methodology to synthesize
	possible general feature detection face sets. Six algorithms resulting
	from the synthesis have been designed and implemented on a SUN
	Workstation in C++ using ACIS as the geometric modelling system. In
	this paper, extensive tests and comparative analysis are conducted on
	the feature detection algorithms, using carefully selected components
	from the public domain, mostly from the National Design Repository. The
	results show that the new and enhanced algorithms identify face sets
	that previously published algorithms cannot detect. The tests also show
	that each algorithm can detect, among other types, a certain type of
	feature that is unique to it. Hence, most of the algorithms discussed
	in this paper would have to be combined to obtain complete coverage
    """

    marked_target, keywords, keyword_relevance = fi.fit(text)
    print(marked_target)
    print(f'Keywords: {keywords}')
    print(f'Keyword Relevance: {keyword_relevance}')

    print(fi.extract_keywords(text))

5. Evaluation

You can evaluate model on many different datasets using script bellow. See here for mode details. (WARNING: if run_evaluation fails line 149, in build_printable printable[qrel] = pd.DataFrame(raw, columns=['app', *(table.columns.levels[1].get_values())[:-1]]) please replace .get_values() method with .values or downgrade pandas to some version that has it)

python -m run_evaluation

6. TrecEval Results

Evaluation on Inspec

	Models	F1_10	F1_15	F1_5	F1_all	P_10	P_15	P_5	map_10	map_15	map_5	map_all	recall_10	recall_15	recall_5
0	RAKE	0.206600 bl	0.220100 bl	0.152400 bl	0.220100 bl	0.250400 bl	0.216900 bl	0.282300 bl	0.100100 bl	0.115100 bl	0.070500 bl	0.115100 bl	0.188100 bl	0.236900 bl	0.110300 bl
1	YAKE	0.176300 ▼	0.187800 ▼	0.144500 ▼	0.187800 ▼	0.208300 ▼	0.181400 ▼	0.261700 ▼	0.092000 ▼	0.104000 ▼	0.072700	0.104000 ▼	0.165800 ▼	0.214100 ▼	0.105400 ᐁ
2	MultiPartiteRank	0.186600 ▼	0.201300 ▼	0.156000	0.201300 ▼	0.221000 ▼	0.190600 ▼	0.285600	0.101700	0.114100	0.081100 ▲	0.114100	0.171200 ▼	0.216600 ▼	0.113000
3	TopicalPageRank	0.226800 ▲	0.241000 ▲	0.174100 ▲	0.241000 ▲	0.272700 ▲	0.233700 ▲	0.319600 ▲	0.116500 ▲	0.133500 ▲	0.084200 ▲	0.133500 ▲	0.206600 ▲	0.257900 ▲	0.126100 ▲
4	TopicRank	0.177900 ▼	0.186800 ▼	0.149000	0.186800 ▼	0.211100 ▼	0.175300 ▼	0.272300	0.093800 ▼	0.103000 ▼	0.075100 ᐃ	0.103000 ▼	0.161300 ▼	0.195600 ▼	0.107800
5	SingleRank	0.224200 ▲	0.237900 ▲	0.170900 ▲	0.237900 ▲	0.269600 ▲	0.231400 ▲	0.313500 ▲	0.114400 ▲	0.131200 ▲	0.082600 ▲	0.131200 ▲	0.204800 ▲	0.256300 ▲	0.123800 ▲
6	TextRank	0.123500 ▼	0.127200 ▼	0.097500 ▼	0.127200 ▼	0.140900 ▼	0.106500 ▼	0.177800 ▼	0.050600 ▼	0.052900 ▼	0.040900 ▼	0.052900 ▼	0.102100 ▼	0.113100 ▼	0.068900 ▼
7	KPMiner	0.013400 ▼	0.013400 ▼	0.013300 ▼	0.013400 ▼	0.011700 ▼	0.007800 ▼	0.022900 ▼	0.006600 ▼	0.006600 ▼	0.006600 ▼	0.006600 ▼	0.008400 ▼	0.008400 ▼	0.008200 ▼
8	TFIDF	0.135900 ▼	0.153800 ▼	0.100400 ▼	0.153800 ▼	0.157100 ▼	0.146000 ▼	0.176400 ▼	0.059300 ▼	0.069900 ▼	0.043900 ▼	0.069900 ▼	0.129700 ▼	0.178100 ▼	0.074100 ▼
9	KEA	0.123000 ▼	0.134900 ▼	0.095200 ▼	0.134900 ▼	0.142700 ▼	0.128700 ▼	0.166600 ▼	0.053600 ▼	0.061300 ▼	0.041300 ▼	0.061300 ▼	0.117400 ▼	0.156100 ▼	0.070500 ▼
10	EmbedRank	0.258400 ▲	0.275100 ▲	0.204900 ▲	0.275100 ▲	0.314700 ▲	0.266800 ▲	0.384200 ▲	0.144400 ▲	0.165200 ▲	0.106200 ▲	0.165200 ▲	0.231900 ▲	0.288500 ▲	0.146700 ▲
11	SIFRank	0.265200 ▲	0.276800 ▲	0.198700 ▲	0.276800 ▲	0.323000 ▲	0.270800 ▲	0.368300 ▲	0.143600 ▲	0.163800 ▲	0.099900 ▲	0.163800 ▲	0.238500 ▲	0.291400 ▲	0.143100 ▲
12	SIFRankPlus	0.257700 ▲	0.275000 ▲	0.197100 ▲	0.275000 ▲	0.311500 ▲	0.268300 ▲	0.364000 ▲	0.142700 ▲	0.164400 ▲	0.102400 ▲	0.164400 ▲	0.233000 ▲	0.290300 ▲	0.142200 ▲
13	EmbedRankBERT	0.226400 ▲	0.226400 ▲	0.169800 ▲	0.226400 ▲	0.271900 ▲	0.181300 ▼	0.314700 ▲	0.112900 ▲	0.112900	0.081100 ▲	0.112900	0.202800 ▲	0.202800 ▼	0.122500 ▲
14	EmbedRankSentenceBERT	0.237200 ▲	0.246900 ▲	0.191500 ▲	0.246900 ▲	0.288100 ▲	0.235500 ▲	0.357800 ▲	0.130400 ▲	0.144800 ▲	0.097800 ▲	0.144800 ▲	0.214700 ▲	0.257000 ▲	0.137700 ▲

Evaluation on SemEval2017

	Models	F1_10	F1_15	F1_5	F1_all	P_10	P_15	P_5	map_10	map_15	map_5	map_all	recall_10	recall_15	recall_5
0	RAKE	0.216700 bl	0.246500 bl	0.140200 bl	0.246500 bl	0.299600 bl	0.272200 bl	0.309500 bl	0.093700 bl	0.114600 bl	0.058200 bl	0.114600 bl	0.179000 bl	0.240200 bl	0.093700 bl
1	YAKE	0.171900 ▼	0.199500 ▼	0.114000 ▼	0.199500 ▼	0.235900 ▼	0.219300 ▼	0.249100 ▼	0.073400 ▼	0.088400 ▼	0.049900 ▼	0.088400 ▼	0.143300 ▼	0.196300 ▼	0.076600 ▼
2	MultiPartiteRank	0.213100	0.238600	0.161600 ▲	0.238600	0.297000	0.264200	0.358600 ▲	0.106400 ▲	0.125700 ᐃ	0.077000 ▲	0.125700 ᐃ	0.175600	0.231900	0.108100 ▲
3	TopicalPageRank	0.253100 ▲	0.289400 ▲	0.173000 ▲	0.289400 ▲	0.350900 ▲	0.319300 ▲	0.382200 ▲	0.124600 ▲	0.152900 ▲	0.081500 ▲	0.152900 ▲	0.208700 ▲	0.281400 ▲	0.115900 ▲
4	TopicRank	0.203300 ᐁ	0.222400 ▼	0.159600 ▲	0.222400 ▼	0.285600	0.247600 ▼	0.357800 ▲	0.100500	0.116500	0.075300 ▲	0.116500	0.166300 ᐁ	0.213400 ▼	0.106200 ▲
5	SingleRank	0.248100 ▲	0.286300 ▲	0.170000 ▲	0.286300 ▲	0.343800 ▲	0.316400 ▲	0.373200 ▲	0.120700 ▲	0.149300 ▲	0.078600 ▲	0.149300 ▲	0.204500 ▲	0.278000 ▲	0.114000 ▲
6	TextRank	0.132800 ▼	0.149300 ▼	0.091300 ▼	0.149300 ▼	0.185000 ▼	0.158400 ▼	0.206500 ▼	0.050100 ▼	0.057100 ▼	0.035400 ▼	0.057100 ▼	0.107000 ▼	0.134700 ▼	0.060700 ▼
7	KPMiner	0.032200 ▼	0.032200 ▼	0.032000 ▼	0.032200 ▼	0.034100 ▼	0.022900 ▼	0.066900 ▼	0.016300 ▼	0.016400 ▼	0.016100 ▼	0.016400 ▼	0.018900 ▼	0.019100 ▼	0.018700 ▼
8	TFIDF	0.166900 ▼	0.180200 ▼	0.131500	0.180200 ▼	0.235500 ▼	0.200900 ▼	0.297400	0.076700 ▼	0.087500 ▼	0.058100	0.087500 ▼	0.137200 ▼	0.175400 ▼	0.087600
9	KEA	0.151800 ▼	0.160200 ▼	0.122200 ▼	0.160200 ▼	0.214000 ▼	0.178400 ▼	0.276300 ᐁ	0.069400 ▼	0.077400 ▼	0.053800	0.077400 ▼	0.124700 ▼	0.156200 ▼	0.081600 ▼
10	EmbedRank	0.252200 ▲	0.286200 ▲	0.182300 ▲	0.286200 ▲	0.352300 ▲	0.316800 ▲	0.406500 ▲	0.131800 ▲	0.158600 ▲	0.090600 ▲	0.158600 ▲	0.206800 ▲	0.276400 ▲	0.121700 ▲
11	SIFRank	0.286700 ▲	0.322300 ▲	0.196600 ▲	0.322300 ▲	0.397200 ▲	0.356600 ▲	0.431600 ▲	0.150300 ▲	0.184400 ▲	0.097700 ▲	0.184400 ▲	0.235800 ▲	0.311600 ▲	0.131700 ▲
12	SIFRankPlus	0.273400 ▲	0.314500 ▲	0.189300 ▲	0.314500 ▲	0.378100 ▲	0.347400 ▲	0.412600 ▲	0.140200 ▲	0.174300 ▲	0.092300 ▲	0.174300 ▲	0.225500 ▲	0.304800 ▲	0.127200 ▲
13	EmbedRankBERT	0.234000 ▲	0.234000 ᐁ	0.155500 ▲	0.234000 ᐁ	0.324500 ▲	0.216400 ▼	0.345200 ▲	0.105700 ▲	0.105700 ᐁ	0.068700 ▲	0.105700 ᐁ	0.191300 ▲	0.191300 ▼	0.103800 ▲
14	EmbedRankSentenceBERT	0.252700 ▲	0.281700 ▲	0.172100 ▲	0.281700 ▲	0.351300 ▲	0.307600 ▲	0.383800 ▲	0.122600 ▲	0.146400 ▲	0.079500 ▲	0.146400 ▲	0.207600 ▲	0.268000 ▲	0.114400 ▲

7. SIFRank Evaluation scores (evaluation script is taken from original SIFRank repo)

Evaluation results on Inspec

	Models	F1.10	F1.15	F1.5	P.10	P.15	P.5	R.10	R.15	R.5	time
0	EmbedRankBERT	0.3085	0.3374	0.2191	0.3068	0.2823	0.3248	0.3103	0.4192	0.1653	228.411
1	EmbedRankSentenceBERT	0.3263	0.3463	0.2539	0.3245	0.2897	0.3764	0.3282	0.4302	0.1916	101.341
2	EmbedRank	0.3271	0.339	0.2678	0.327	0.2868	0.3973	0.3272	0.4143	0.202	29.233
3	SIFRank	0.3444	0.3499	0.254	0.3437	0.2949	0.3769	0.3451	0.4302	0.1916	264.976
4	SIFRankPlus	0.3176	0.3421	0.2408	0.317	0.2883	0.3572	0.3182	0.4206	0.1816	265.089

Evaluation results on Semeval2017

	Models	F1.10	F1.15	F1.5	P.10	P.15	P.5	R.10	R.15	R.5	time
0	EmbedRankTransformers	0.2211	0.2745	0.137	0.3018	0.2957	0.3055	0.1745	0.2562	0.0883	293.412
1	EmbedRankSentenceBERT	0.2416	0.2863	0.1605	0.3298	0.3084	0.3578	0.1906	0.2672	0.1034	118.951
2	EmbedRank	0.2586	0.2962	0.1807	0.3536	0.3199	0.4037	0.2038	0.2758	0.1164	27.3123
3	SIFRank	0.2917	0.3328	0.1918	0.399	0.3592	0.4285	0.2299	0.31	0.1236	354.831
4	SIFRankPlus	0.2719	0.3185	0.1827	0.3719	0.3438	0.4081	0.2143	0.2966	0.1177	352.871

8. SIFRank Evaluation Scores (From original source) plus my model's score

F1 Scores on N=5 (first N extracted keywords)

Models	Inspec	SemEval2017	DUC2001
TFIDF	11.28	12.70	9.21
YAKE	15.73	11.84	10.61
TextRank	24.39	16.43	13.94
SingleRank	24.69	18.23	21.56
TopicRank	22.76	17.10	20.37
PositionRank	25.19	18.23	24.95
Multipartite	23.05	17.39	21.86
RVA	21.91	19.59	20.32
EmbedRankBERT	`23.31`	`14.60`	N/A
EmbedRankSentenceBERT	`25.39`	`16.05`	N/A
EmbedRank d2v	27.20	20.21	21.74
SIFRank	29.11	22.59	24.27
SIFRank+	28.49	21.53	30.88

References

https://github.com/hanxiao/bert-as-service

https://www.groundai.com/project/embedrank-unsupervised-keyphrase-extraction-using-sentence-embeddings/1

https://monkeylearn.com/keyword-extraction/

https://arxiv.org/pdf/1801.04470.pdf

https://github.com/liaad/keep

https://github.com/LIAAD/KeywordExtractor-Datasets

https://spacy.io/usage/linguistic-features

https://github.com/usnistgov/trec_eval

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
embedding		embedding
evaluation		evaluation
extraction		extraction
keyword_extraction		keyword_extraction
model		model
notebooks		notebooks
rank		rank
tests		tests
utils		utils
.gitignore		.gitignore
README.md		README.md
helpers.py		helpers.py
install_dependencies.sh		install_dependencies.sh
requirements.txt		requirements.txt
run_bert_service.sh		run_bert_service.sh
run_evaluation.py		run_evaluation.py
sifrank_evaluation.py		sifrank_evaluation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

unsupervised_keyword_extraction

Getting Started

1. Create environment

2. Download a pre-trained BERT model

3. Start Bert as a Service

4. Usage Example

5. Evaluation

6. TrecEval Results

7. SIFRank Evaluation scores (evaluation script is taken from original SIFRank repo)

8. SIFRank Evaluation Scores (From original source) plus my model's score

References

About

Releases

Packages

Languages

AnzorGozalishvili/unsupervised_keyword_extraction

Folders and files

Latest commit

History

Repository files navigation

unsupervised_keyword_extraction

Getting Started

1. Create environment

2. Download a pre-trained BERT model

3. Start Bert as a Service

4. Usage Example

5. Evaluation

6. TrecEval Results

7. SIFRank Evaluation scores (evaluation script is taken from original SIFRank repo)

8. SIFRank Evaluation Scores (From original source) plus my model's score

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages