This program was created to address specific data processing tasks, including:
-
Generating Excel Files: Can take JSONL files containing multilingual data and produce separate Excel files for each language, simplifying data analysis and presentation.
-
Creating Separate JSONL Files: Can filter and split JSONL data based on specified categories (e.g., test, train, dev) and generate separate JSONL files for each category and language.
-
Generating Translation Data: Can merge data from different languages, extract translations from English (en) to other languages (xx), and create a structured JSONL file for easy access and analysis.
In this section, you will set up the python 3 environment and work with the MASSIVE dataset
Task 1 : Build a python3 project with the structure of projects installing the necessary dependencies in preffered IDE (pycharm, visual studio) then import the MASSIVE dataset https://github.com/alexa/massive
Task 2 : generate "en-xx.xlxs" files for all languages, using id, utt and annot_utt. Recursion is not used due to its heavy time complexity.
Task 3 : have the flags running the solution in the run_script.sh
In this question, you will be manipulating JSON files to produce required outputs:
Task 1: generate seperate JSONL files for English (en), Swahili (sw) and German (de) with test, train and dev.
Task 2: generate a single JSON file showing all the translations from en to xx with id and utt for all the train sets(pretty print your json file structure)
-
python >= 3.11
-
pandas
-
absl-py
- Clone the repo
git clone https://github.com/StevenJoel06/bwaku.git
- Ensure you have Python installed on your machine as well as a pip. Confirm with
python -V pip -v
Ensure that the version is 3.10 and above And 22 for pip
- Install Pandas
pip install pandas
- Install absl flags
pip install absl-py
- To generate the output files run the generator.sh file