A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation

This is the scripts to run the formalism from the LREC-Coling 2024 paper "A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation".

⭐ Do not hesitate to test it and report any bugs, feedbacks, results.

🔍 Overview

Our PictoGrammar model goal is to provide an Arasaac pictogram translation of a speech transcription (or text).

PictoGrammar uses a set of models : (1) SpaCy to tokenize, lemmatize, and post-tag, (2) a Named Entity Recognition (NER) model based on CamemBERT, and (3) a Word Sense Disambiguation (WSD) model.

⚙️ Requirements and Installation

Python >= 3.9
PyTorch >= 1.10
Transformers >= 4.26
SpaCy french model -- fr_dep_news_trf
WSD model -- https://github.com/macairececile/nwsd

git clone https://github.com/macairececile/picto_grammar.git
cd picto_grammar/
pip install -r requirements.txt
git clone https://github.com/macairececile/nwsd.git
export PYTHONPATH=$PYTHONPATH:/path_to_nwsd/nwsd/src

Then, download the WSD model via this link: https://cloud.univ-grenoble-alpes.fr/s/XECiw4gmEbGDprD and decompress it in picto_grammar/data/ folder.

📉 Running PictoGrammar

The repository is organized in 4 folders :

src/ -- python scripts to run the grammar.
img/ -- images.
data/ -- folder with the data used in the paper.
examples/ -- folder with examples of output files generated by the grammar.

Data format

Input data format : a .csv file with two tab-separated columns (see example file in examples/input.csv)

clips	text
cefc-tcof-Acc_del_07-118	mh il y a pas longtemps j'ai revu une tante
cefc-tcof-Acc_del_07-112	oh ben ouais euh enfin c'est je sais
cefc-tcof-Acc_del_07-166	tu dis euh un pneu de voiture

Output data format :

A .csv file with 4 tab-separated columns (see example file in examples/output.csv) :

clips	text	text_process	pictos	tokens
cefc-tcof-Acc_del_07-118	mh il y a pas longtemps j'ai revu une tante	mh il y a pas longtemps j'ai revu une tante	[9839, 9001, 5526, 37678, 6632, 37163, 6564, 8474, 30276]	passé il_y_a non longtemps me une_autre_fois voir une tante
cefc-tcof-Acc_del_07-112	oh ben ouais euh enfin c'est je sais	oh ben oui euh enfin c'est je sais	[5584, 7095, 36480, 6632, 16885]	oui celui-là être me savoir
cefc-tcof-Acc_del_07-166	tu dis euh un pneu de voiture	tu dis euh un pneu de voiture	[6625, 9693, 2627, 37072, 7074, 2339]	toi dire un pneu de voiture

A .html file to visualize the generated pictogram sequence per utterance (see example file in examples/out.html).

Use the grammar

python src/grammar.py --wn_file "data/dico/index.sense" --no_transl "data/dico/no_translation.csv" --wsd "data/wsd_model/" --lexicon "data/dico/lexique.csv" --data "examples/input.csv" --out "examples/out.csv" --tags "data/dico/tags.csv"

An out.html file will be generated to see the output sequence.

📝 Citation

@inproceedings{macaire_lrec2024,
  title = {A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation},
  author = {Macaire, C{\'e}cile and Dion, Chlo{\'e} and Arrigo, Jordan and Lemaire, Claire and Esperan{\c c}a-Rodier, Emmanuelle and Lecouteux, Benjamin and Schwab, Didier},
  url = {https://hal.science/hal-04534234},
  booktitle = {LREC-Coling},
  address = {Turin, Italy},
  year = {2024},
  month = May,
  keywords = {Pictograms ; Speech ; Machine Translation},
  pdf = {https://hal.science/hal-04534234/file/1210_Paper_LREC_Coling_Macaire.pdf},
  hal_id = {hal-04534234}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation

🔍 Overview

⚙️ Requirements and Installation

📉 Running PictoGrammar

Data format

Use the grammar

📝 Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
data		data
examples		examples
img		img
src		src
README.md		README.md
requirements.txt		requirements.txt

macairececile/picto_grammar

Folders and files

Latest commit

History

Repository files navigation

A Multimodal French Corpus of Aligned Speech, Text, and Pictogram Sequences for Speech-to-Pictogram Machine Translation

🔍 Overview

⚙️ Requirements and Installation

📉 Running PictoGrammar

Data format

Use the grammar

📝 Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages