Prerequisites

Introduction

This program was created to address specific data processing tasks, including:

Generating Excel Files: Can take JSONL files containing multilingual data and produce separate Excel files for each language, simplifying data analysis and presentation.
Creating Separate JSONL Files: Can filter and split JSONL data based on specified categories (e.g., test, train, dev) and generate separate JSONL files for each category and language.
Generating Translation Data: Can merge data from different languages, extract translations from English (en) to other languages (xx), and create a structured JSONL file for easy access and analysis.

Project Tasks

Question 1: python3 environment setup

In this section, you will set up the python 3 environment and work with the MASSIVE dataset

Task 1 : Build a python3 project with the structure of projects installing the necessary dependencies in preffered IDE (pycharm, visual studio) then import the MASSIVE dataset https://github.com/alexa/massive

Task 2 : generate "en-xx.xlxs" files for all languages, using id, utt and annot_utt. Recursion is not used due to its heavy time complexity.

Task 3 : have the flags running the solution in the run_script.sh

Question 2: Working with files

In this question, you will be manipulating JSON files to produce required outputs:

Task 1: generate seperate JSONL files for English (en), Swahili (sw) and German (de) with test, train and dev.

Task 2: generate a single JSON file showing all the translations from en to xx with id and utt for all the train sets(pretty print your json file structure)

Prerequisites

python >= 3.11
pandas
absl-py

Installation Process

Clone the repo

git clone https://github.com/StevenJoel06/bwaku.git

Ensure you have Python installed on your machine as well as a pip. Confirm with

python -V pip -v

Ensure that the version is 3.10 and above And 22 for pip

Install Pandas

pip install pandas

Install absl flags

pip install absl-py

To generate the output files run the generator.sh file

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Codelabs		Codelabs
data		data
output		output
output2		output2
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
generate_xlsx.py		generate_xlsx.py
jsonfiles.py		jsonfiles.py
qn3.py		qn3.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Project Tasks

Question 1: python3 environment setup

Question 2: Working with files

Prerequisites

Installation Process

About

Releases

Packages

Contributors 5

Languages

Tetioo/Language-processor

Folders and files

Latest commit

History

Repository files navigation

Introduction

Project Tasks

Question 1: python3 environment setup

Question 2: Working with files

Prerequisites

Installation Process

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages