Tool for Information extraction from Russian texts
This tool includes the following modules:
- Terms extraction
- Relation extraction
- Entity linking
- Aspect extraction
To install:
git clone https://github.com/iis-research-team/Terminator.git
To use this tool one should download the files:
-
For terms extraction download weights file from here and put it to
terms_extractor/dl_extractor/weights
-
For relation extraction:
2.1. Download config file from here
2.2. Download model file from here
2.3. Download model arguments file from here
and put it all to relation_extractor/dl_relation_extractor/weights
- For entity linking:
3.1. Download prepocessed wikidata dump from here,
unzip and put it to entity_linker/wikidata_dump
;
3.2. Download fasttext model from here
and put it to entity_linker/fasttext_model
.
- For aspect extraction download weights file from here
and put it to
aspect_extractor/weights
This module extracts terms from the raw text.
from terms_extractor.combined_extractor.combined_extractor import CombinedExtractor
combined_extractor = CombinedExtractor()
text = 'Научные вычисления включают прикладную математику (особенно численный анализ), вычислительную технику ' \
'(особенно высокопроизводительные вычисления) и математическое моделирование объектов изучаемых научной ' \
'дисциплиной.'
result = combined_extractor.extract(text)
for token, tag in result:
print(f'{token} -> {tag}')
This module extracts relations between two terms. To extract relations it requires text with terms highlighted by special tokens.
Example of relation extraction:
from relation_extractor.combined_relation_extractor.combined_relation_extractor import CombinedRelationExtractor
combined_extractor = CombinedRelationExtractor()
sample = '<e1>Модель</e1> используется в методе генерации и определения форм слов для решения ' \
'<e2>задач морфологического синтеза</e2> и анализа текстов.'
relation = combined_extractor.extract(sample)
This module links terms with entities in Wikidata. It requires extracted terms and their context as input.
from entity_linker.entity_linker import RussianEntityLinker
ru_el = RussianEntityLinker()
term = 'язык программирования Python'
context = ['язык программирования Python', 'использовался', 'в']
print(ru_el.get_linked_mention(term, context))
This module extracts aspects from the raw text.
from aspect_extractor import AspectExtractor
extractor = AspectExtractor()
text = "Определена модель для визуализации связей между объектами и их атрибутами в различных процессах. " \
"На основании модели разработан универсальный абстрактный компонент графического пользовательского интерфейса и приведены примеры его программной реализации. " \
"Также проведена апробация компонента для решения прикладной задачи по извлечению информации из документов."
result = extractor.extract(text)
for token, tag in result:
print(f'{token} -> {tag}')
RuSERRC is the dataset of scientific texts in Russian, which is annotated with terms, aspects, linked entities, and relations.
If you find this repository useful, feel free to cite our papers:
Bruches E., Tikhobaeva O., Dementyeva Y., Batura T. TERMinator: A System for Scientific Texts Processing. In Proceedings of the 29th International Conference on Computational Linguistics (COLING 2022). International Committee on Computational Linguistics. 2022. pp. 3420–3426.
@inproceedings{terminator2022,
title={{TERM}inator: A System for Scientific Texts Processing},
author={Bruches, Elena and Tikhobaeva, Olga and Dementyeva, Yana and Batura, Tatiana},
booktitle={Proceedings of the 29th International Conference on Computational Linguistics},
year={2022},
pages={3420--3426}
}
Bruches E., Mezentseva A., Batura T. A system for information extraction from scientific texts in Russian. Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2021. Communications in Computer and Information Science. Springer, Cham, 2022. vol. 1620. pp. 234–245.
@inproceedings{ruserrc,
title={A system for information extraction from scientific texts in Russian},
author={Bruches, Elena and Mezentseva, Anastasia and Batura, Tatiana},
booktitle={Data Analytics and Management in Data Intensive Domains. DAMDID/RCDL 2021. Communications in Computer and Information Science},
volume={1620}
pages={234--245},
year={2022}
}