We use Conceptualizer to extract the colexification patterns directly from a parallel corpora. This repositories contain a more effiecient version of Conceptualizer (forward pass + backward pass) introduced in Conceptualizer paper.
.
├── README.md
├── association_finder.py
├── eva
│ ├── baseline_vectors
│ │ ├── other_word_vectors_process.ipynb
│ │ ├── sentence_classification.ipynb
│ │ ├── sentence_retrieval.ipynb
│ │ └── train_sentence_ID_script.py
│ ├── colexification_patterns
│ │ ├── clics_neighbors_dict.pickle
│ │ ├── conceptualizer_100_neighbors_dict.pickle
│ │ ├── conceptualizer_10_neighbors_dict.pickle
│ │ ├── conceptualizer_1_neighbors_dict.pickle
│ │ ├── conceptualizer_20_neighbors_dict.pickle
│ │ ├── conceptualizer_50_neighbors_dict.pickle
│ │ ├── conceptualizer_5_neighbors_dict.pickle
│ │ └── eva_colexification.ipynb
│ ├── round_trip
│ │ ├── round_trip.py
│ │ └── round_trip_min_langs.py
│ ├── sentence_classification
│ │ ├── sentence_classification.ipynb
│ │ └── sentence_classification.py
│ └── sentence_retrieval
│ ├── sentence_retrieval.ipynb
│ ├── sentence_retrieval.py
│ └── test_ids.txt
├── network_related
│ ├── __init__.py
│ ├── eflomal_network_builder.py
│ ├── eflomal_training.py
│ ├── iso2area.pickle
│ ├── iso2family.pickle
│ ├── network_builder.py
│ ├── train_different_min_language_embedding.py
│ └── updated
│ ├── NetworkAnalysis-Basic.ipynb
│ └── NetworkAnalysis-LanguageFamilies-Areas.ipynb
├── processing_concepts.py
├── processing_concepts_eng.py
├── processing_ngrams.py
└── processing_parallel.py
(1) Preprocess the parallel data to obtain all ngrams for each verse in all languages:
python -u processing_ngrams.py --updated_ngrams true --ignore_case true
(2) Create parallel data that are required in the subsequent computation:
python -u processing_parallel.py --updated_ngrams true --src_lang eng
(3) Obtain valid concepts and their statistics (for English, we use lemmata as concepts):
python -u processing_concepts.py --updated_ngrams true --src_lang eng --ignore_case true
(4) run the following command to extract colexifications by Conceptualizer:
python -u association_finder.py --updated_ngrams true --do_lemmatize true --lemmatizer_type spacy --ignore_case true --src_lang eng --use_multiprocessing true
For the first time, go to the directory and run python code like the following:
cd network_related
from network_builder import ConceptNetwork
considered_lang = 'all'
net = ConceptNetwork(involved_lang=considered_lang, load_directed_graph_from_path=False, use_updated=True)
net.store_net()
The codes above will store a directed network based on the colexification patterns of all languages. Use the following code to directly load the network from the disk (if you have stored it using codes above):
net = ConceptNetwork(involved_lang=considered_lang, load_directed_graph_from_path=True, use_updated=True)
Specifying the minimum number of languages (e.g., 50) for an colexification edge to be included in both networks (both ColexNet and ColexNet+ are undirected networks) and run following code:
colexnet = net.to_undirected(aggregate_type='union', minimum_number_of_langs=50)
colexnet_plus = net.expand_graph(minimum_number_of_langs=50)
To visualize the communities in ColexNet, please refer to ./network_related/updated/NetworkAnalysis-Basic.ipynb
.
Alternatively, please visit our online demo for visualizations of communities and the concepts in ColexNet.
Simply run the following command to train multilingual embeddings on ColexNet+ for different hyperparameters (1, 5, 10, 20, 50 and 100 as number of minimum languages to preserve an colexification edge)
cd network_related
python -u train_different_min_language_embedding.py
You could find our published embeddings and networks here.
We use four strong baselines in our work: sentence_id, clique_word, nt_word and eflomal-aligned.
The embeddings of clique_word and nt_word can be downloaded here.
Run the following code to train sentence_id embeddings:
cd eva
cd baseline_vectors
python -u train_sentence_ID_script.py
To generate eflomal-aligned embeddings
(1) first run the following codes to create an alignment graph and store it on the disk:
from network_builder import EflomalAlignmentNetwork
net = EflomalAlignmentNetwork(load_graph_from_path=False)
net.store_net()
net.store_vocab()
(2) then run the following command to train node embeddings:
cd network_related
python -u eflomal_training.py
The ground-truth colexification patterns from CLICS can be found at ./eva/colexification_patterns/clics_neighbors_dict.pickle
and coleixication patterns identified in ColexNet can found in ./eva/colexification_patterns/conceptualizer_50_neighbors_dict.pickle
(hyperparameter of 50, for example).
Please then refer to ./eva/colexification_patterns/eva_colexification.ipynb
for codes for evaluation.
The codes for round-trip translation can be found in ./eva/round_trip
.
Run the following for reproducing the results for round-trip translation of sentence_id, clique_word, nt_word, eflomal-aligned and ColexNet2Vec (50, by default) embeddings:
python -u round_trip.py
Run the following for reproducing the results for round-trip translation of ColexNet2Vec embeddings for different hyperparameters (1, 5, 10, 20, 50 and 100 as number of minimum languages to preserve an colexification edge) :
python -u round_trip_min_langs.py
The codes for sentence retrieval can be found in ./eva/sentence_retrieval
.
Run the following for processing the sentence retrieval dataset :
python -u sentence_retrieval.py
Please then refer to ./eva/sentence_retrieval/sentence_retrieval.ipynb
for codes of evaluation on ColexNe2Vec.
Please then refer to ./eva/baseline_vectors/sentence_retrieval.ipynb
for codes of evaluation on other embeddings.
The codes for sentence classification can be found in ./eva/sentence_classification
.
Run the following for processing the sentence classification dataset :
python -u sentence_classification.py
Please then refer to ./eva/sentence_retrieval/sentence_classification.ipynb
for codes of evaluation on ColexNe2Vec.
Please then refer to ./eva/baseline_vectors/sentence_classification.ipynb
for codes of evaluation on other embeddings.
Please cite [1] and [2] if you found the resources in this repository useful.
Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs
[1] Y. Liu, H. Ye, L. Weissweiler, R. Pei, H. Schuetze Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs
@inproceedings{liu-etal-2023-crosslingual-transfer,
title = "Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs",
author = "Liu, Yihong and
Ye, Haotian and
Weissweiler, Leonie and
Pei, Renhao and
Schuetze, Hinrich",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.562",
doi = "10.18653/v1/2023.findings-emnlp.562",
pages = "8376--8401",
}
[2] Y. Liu, H. Ye, L. Weissweiler, P. Wicke, R. Pei, R. Zangenfeind, H. Schuetze A Crosslingual Investigation of Conceptualization in 1335 Languages
@inproceedings{liu-etal-2023-crosslingual,
title = "A Crosslingual Investigation of Conceptualization in 1335 Languages",
author = {Liu, Yihong and
Ye, Haotian and
Weissweiler, Leonie and
Wicke, Philipp and
Pei, Renhao and
Zangenfeind, Robert and
Sch{\"u}tze, Hinrich},
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.726",
doi = "10.18653/v1/2023.acl-long.726",
pages = "12969--13000",
}