Skip to content

A project that analyses jsonl files to generate and create new excel and jsonl files

Notifications You must be signed in to change notification settings

Tetioo/Language-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This program was created to address specific data processing tasks, including:

  • Generating Excel Files: Can take JSONL files containing multilingual data and produce separate Excel files for each language, simplifying data analysis and presentation.

  • Creating Separate JSONL Files: Can filter and split JSONL data based on specified categories (e.g., test, train, dev) and generate separate JSONL files for each category and language.

  • Generating Translation Data: Can merge data from different languages, extract translations from English (en) to other languages (xx), and create a structured JSONL file for easy access and analysis.

Project Tasks

Question 1: python3 environment setup

In this section, you will set up the python 3 environment and work with the MASSIVE dataset

Task 1 : Build a python3 project with the structure of projects installing the necessary dependencies in preffered IDE (pycharm, visual studio) then import the MASSIVE dataset https://github.com/alexa/massive

Task 2 : generate "en-xx.xlxs" files for all languages, using id, utt and annot_utt. Recursion is not used due to its heavy time complexity.

Task 3 : have the flags running the solution in the run_script.sh

Question 2: Working with files

In this question, you will be manipulating JSON files to produce required outputs:

Task 1: generate seperate JSONL files for English (en), Swahili (sw) and German (de) with test, train and dev.

Task 2: generate a single JSON file showing all the translations from en to xx with id and utt for all the train sets(pretty print your json file structure)

Prerequisites

  • python >= 3.11

  • pandas

  • absl-py

Installation Process

  1. Clone the repo

git clone https://github.com/StevenJoel06/bwaku.git

  1. Ensure you have Python installed on your machine as well as a pip. Confirm with

python -V pip -v

Ensure that the version is 3.10 and above And 22 for pip

  1. Install Pandas

pip install pandas

  1. Install absl flags

pip install absl-py

  1. To generate the output files run the generator.sh file

About

A project that analyses jsonl files to generate and create new excel and jsonl files

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages