synthetic code-switched dataset #9076

SuperReem · 2024-05-01T06:31:17Z

SuperReem
May 1, 2024

"Follow the 2 steps listed below in order -

Create the (intermediate) manifest file using code_switching_manifest_creation.py. It's usage is as follows:

python code_switching_manifest_creation.py --manifest_language1 <absolute path of Language 1's manifest file> --manifest_language2 <absolute path of Language 2's manifest file> --manifest_save_path --id_language1 <language code for language 1 (e.g. en)> --id_language2 <language code for language 2 (e.g. es)> --max_sample_duration_sec --min_sample_duration_sec --dataset_size_required_hrs

Estimated runtime for dataset_size_required_hrs=10,000 is ~2 mins

Create the synthetic audio data and the corresponding manifest file using code_switching_audio_data_creation.py It's usage is as follows:

python code_switching_audio_data_creation.py --manifest_path <absolute path to intermediate CS manifest generated in step 1> --audio_save_folder_path --manifest_save_path --audio_normalized_amplitude --cs_data_sampling_rate --sample_beginning_pause_msec --sample_joining_pause_msec <pause to be added between segments while joining, in milli seconds> --sample_end_pause_msec --is_lid_manifest <boolean to create manifest in the multi-sample lid format for the text field, true by default> --workers

Example of the multi-sample LID format: [{“str”:“esta muestra ” “lang”:”es”},{“str”:“was generated synthetically”: “lang”:”en”}]

Estimated runtime for generating a 10,000 hour corpus is ~40 hrs with a single worker"

after following these two steps, how to configure the train_ds and validation_ds here with the created synthetic code-switched dataset and manifest:

name: "FastConformer-CTC-BPE"

model:
sample_rate: 16000
log_prediction: true # enables logging sample predictions in the output during training
ctc_reduction: 'mean_volume'
skip_nan_grad: false

train_ds:
manifest_filepath: ???
sample_rate: ${model.sample_rate}
batch_size: 16 # you may increase batch_size if your memory allows
shuffle: true
num_workers: 8
pin_memory: true
max_duration: 16.7
min_duration: 0.1
# tarred datasets
is_tarred: false
tarred_audio_filepaths: null
shuffle_n: 2048
# bucketing params
bucketing_strategy: "fully_randomized"
bucketing_batch_size: null
is_code_switched: true
code_switched:
min_duration: 12
max_duration: 20
min_monolingual: 0.3
probs: [0.5, 0.5]
force_monochannel: true
sampling_scales: 0.75
seed: 123

validation_ds:
manifest_filepath: ???
sample_rate: ${model.sample_rate}
batch_size: 16 # you may increase batch_size if your memory allows
shuffle: false
use_start_end_token: false
num_workers: 8
pin_memory: true
is_code_switched: true
code_switched:
min_duration: 12
max_duration: 20
min_monolingual: 0.3
probs: [0.5, 0.5]
force_monochannel: true
sampling_scales: 0.75
seed: 123

test_ds:
manifest_filepath: null
sample_rate: ${model.sample_rate}
batch_size: 16
shuffle: false
use_start_end_token: false
num_workers: 8
pin_memory: true

tokenizer:
type: agg
langs:
en:
dir: ???
type: bpe
ar:
dir: ???
type: bpe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

synthetic code-switched dataset #9076

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

synthetic code-switched dataset #9076

SuperReem May 1, 2024

Replies: 0 comments

SuperReem
May 1, 2024