A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation
This is the scripts to run the formalism from the LREC-Coling 2024 paper "A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation".
⭐ Do not hesitate to test it and report any bugs, feedbacks, results.
Our PictoGrammar model goal is to provide an Arasaac pictogram translation of a speech transcription (or text).
PictoGrammar uses a set of models : (1) SpaCy to tokenize, lemmatize, and post-tag, (2) a Named Entity Recognition (NER) model based on CamemBERT, and (3) a Word Sense Disambiguation (WSD) model.
- Python >= 3.9
- PyTorch >= 1.10
- Transformers >= 4.26
- SpaCy french model -- fr_dep_news_trf
- WSD model -- https://github.com/macairececile/nwsd
git clone https://github.com/macairececile/picto_grammar.git
cd picto_grammar/
pip install -r requirements.txt
git clone https://github.com/macairececile/nwsd.git
export PYTHONPATH=$PYTHONPATH:/path_to_nwsd/nwsd/src
Then, download the WSD model via this link: https://cloud.univ-grenoble-alpes.fr/s/XECiw4gmEbGDprD and decompress it in picto_grammar/data/ folder.
The repository is organized in 4 folders :
- src/ -- python scripts to run the grammar.
- img/ -- images.
- data/ -- folder with the data used in the paper.
- examples/ -- folder with examples of output files generated by the grammar.
- Input data format : a .csv file with two tab-separated columns (see example file in examples/input.csv)
clips | text |
---|---|
cefc-tcof-Acc_del_07-118 | mh il y a pas longtemps j'ai revu une tante |
cefc-tcof-Acc_del_07-112 | oh ben ouais euh enfin c'est je sais |
cefc-tcof-Acc_del_07-166 | tu dis euh un pneu de voiture |
- Output data format :
A .csv file with 4 tab-separated columns (see example file in examples/output.csv) :
clips | text | text_process | pictos | tokens |
---|---|---|---|---|
cefc-tcof-Acc_del_07-118 | mh il y a pas longtemps j'ai revu une tante | mh il y a pas longtemps j'ai revu une tante | [9839, 9001, 5526, 37678, 6632, 37163, 6564, 8474, 30276] | passé il_y_a non longtemps me une_autre_fois voir une tante |
cefc-tcof-Acc_del_07-112 | oh ben ouais euh enfin c'est je sais | oh ben oui euh enfin c'est je sais | [5584, 7095, 36480, 6632, 16885] | oui celui-là être me savoir |
cefc-tcof-Acc_del_07-166 | tu dis euh un pneu de voiture | tu dis euh un pneu de voiture | [6625, 9693, 2627, 37072, 7074, 2339] | toi dire un pneu de voiture |
A .html file to visualize the generated pictogram sequence per utterance (see example file in examples/out.html).
python src/grammar.py --wn_file "data/dico/index.sense" --no_transl "data/dico/no_translation.csv" --wsd "data/wsd_model/" --lexicon "data/dico/lexique.csv" --data "examples/input.csv" --out "examples/out.csv" --tags "data/dico/tags.csv"
An out.html file will be generated to see the output sequence.
@inproceedings{macaire_lrec2024,
title = {A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation},
author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Arrigo, Jordan and Lemaire, Claire and Esperan{\c c}a-Rodier, Emmanuelle and Lecouteux, Benjamin and Schwab, Didier},
url = {https://hal.science/hal-04534234},
booktitle = {LREC-Coling},
address = {Turin, Italy},
year = {2024},
month = May,
keywords = {Pictograms ; Speech ; Machine Translation},
pdf = {https://hal.science/hal-04534234/file/1210_Paper_LREC_Coling_Macaire.pdf},
hal_id = {hal-04534234}
}