Skip to content

Latest commit

 

History

History
23 lines (14 loc) · 948 Bytes

README.md

File metadata and controls

23 lines (14 loc) · 948 Bytes

OccGen_dataset

OccGen dataset and its metadata, we have two usecases for our toolkit. The translations_HR_data.csv which represents the parallel data of EN, ES, AR, RU. The translations_LR_data.csv which represents the parallel data of EN, SW.

All the previous data is annotated and is ready to be used for evaluation.

The OccGen toolkit is available in this github link: https://github.com/mt-upc/OccGen_toolkit/

Training data is released under the following links:

En-Ar 122k: https://drive.google.com/file/d/18SP18I-9wb0H_ibrdP9eBc3hVfgk4q9X/view?usp=sharing

En-Es 1M: https://drive.google.com/file/d/1wahfqTiEgD89wHqnZuq2X4syDk15EgTz/view?usp=sharing

En-Ru 433k:https://drive.google.com/file/d/16KnpUWtbCkoXOEIwUJKFuYqK8woQPZDg/view?usp=sharing

Gender distribution in our training data can be seen in the below heatmap:

heatmap_genders

Gender definitions extracted from Wikipedia.