Skip to content

Latest commit

 

History

History
600 lines (554 loc) · 32.1 KB

README.md

File metadata and controls

600 lines (554 loc) · 32.1 KB

AudioAugmentor

Python library for augmenting audio data

EXAMPLE 1

This library is designed to augment audio data for machine learning purposes. It combines several tools and libraries for audio data augmentation and provides a unified interface that can be used to apply a large set of audio augmentations in one place.

The library is designed to be used with the PyTorch machine learning framework. It can also work solely on just simple audio waveforms and augment just those. Additionally it can also augment local audio datasets.

This library specifically combines these libraries and tools:

Available augmentations

Table below shows which library was used to apply specific audio augmentation/codec.

audiomentations torch-audiomentations torchaudio pyroomacoustics ffmpeg-python
AddBackgroundNoise
AddColoredNoise
AddGaussianNoise
AddShortNoises
AdjustDuration
AirAbsorption
ApplyImpulseResponse
BandPassFilter
BandStopFilter
ClippingDistortion
FrequencyMasking
Volume / Gain
GainTransition
HighPassFilter
HighShelfFilter
Limiter
LoudnessNormalization
LowPassFilter
LowShelfFilter
Mp3Compression
MelSpectrogram
Normalize
Padding
PeakNormalization
PeakingFilter
PitchShift
PolarityInversion
Time inversion
ApplyRIR (RoomSimulator)
SevenBandParametricEQ
Shift
Speed
Spectrogram
TanhDistortion
TimeMasking
TimeStretch
ac3
adpcm_ima_wav
adpcm_ms
adpcm_yamaha
eac3
flac
libmp3lame
mp2
pcm_alaw
pcm_f32le
pcm_mulaw
pcm_s16le
pcm_s24le
pcm_s32le
pcm_u8
wmav1
wmav2
g726
gsm
amr

Usage

For a more complex example see example colab notebook above. Or see jupyter notebook AudioAugmentor_Usage_Example.ipynb in the examples directory within this repository.

Note: AudioAugmentor was mainly tested using Python 3.11.8 and Fedora 38 (Google Colab uses Python 3.10 and Ubuntu)

0. You need to install the library and necessary packages first

!!!You may need to run the following commands with sudo!!!

If so install these packages manually in terminal.

pip install -U pip
pip install AudioAugmentor
dnf install -y sox                # FEDORA
dnf install -y sox-devel          # FEDORA
dnf install -y ffmpeg             # FEDORA
# apt-get install -y sox          # UBUNTU
# apt-get install -y libsox-dev   # UBUNTU
# apt-get install -y ffmpeg       # UBUNTU

1. Import necessary libraries

import torch
import torchaudio
import numpy as np
import audiomentations as AA
from IPython.display import Audio, display

from AudioAugmentor import transf_gen
from AudioAugmentor import sox_parser
from AudioAugmentor import core
from AudioAugmentor import rir_setup
from AudioAugmentor import torchaudio_transf_wrapper as TTW

2. Define the augmentations you want to apply to your audio data.

You have 3 options of how to define the augmentations:

a) Use transf_gen.transf_gen function to generate list of transformations.

See supported transformation table and examples of every augmentation, so you know what parameters are needed for each augmentation method.

You can enter augmentation parameters as a string or as a dictionary.

PitchShift='sample_rate=16000, n_steps=[1, 1.5, 0.1], p=1.0'

PitchShift={'sample_rate': 16000, 'n_steps': [1, 1.5, 0.1], 'p': 1.0}

transformations = transf_gen.transf_gen(verbose=True,
                                        PitchShift='sample_rate=16000, n_steps=[1, 1.5, 0.1], p=1.0',
                                        Speed={'orig_freq': 16000, 'factor': [0.9, 1.5, 0.1], 'p': 1},
                                        LowPassFilter={'min_cutoff_freq': 700, 'max_cutoff_freq': 800, 'sample_rate': sampling_rate, 'p': 1},
)

b) Use pseudo SoX command. SoX command must be in this format:

--sox="norm gain 0 highpass 1000 phaser 0.5 0.6 1 0.45 0.6 -s"

(When you don't want to apply some codec after applying SoX effects)

OR

--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k

(In this case, you want to apply codec after applying SoX effects -> Codec is entered in the form codec_name codec_parameter_name codec_parameter_value directly after the SoX effects command)

example_sox = '--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k'

c) Use a file with multiple pseudo SoX commands. Random SoX command from this file will be chosen and applied to your data.

File must to be loaded using sox_parser.load_sox_file function.

sox_file_content_to_write = '''--sox="norm gain 0 highpass 1000 phaser 0.5 0.6 1 0.45 0.6 -s"
#--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s"
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" gsm
--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k
'''
with open('sox_file_example.txt', 'w') as f:
    f.write(sox_file_content_to_write)

sox_file_content = sox_parser.load_sox_file('sox_file_example.txt')
print('SOX FILE LOADED:', sox_file_content, type(sox_file_content))

3. Apply augmentations

a) Use generated the transformations list, single SoX command or loaded SoX file content while initializing Collator class.

Use this initiated class as an argument for the collate_fn parameter of PyTorch's dataloader.

collate_fn = core.Collator(
    transformations=transformations, device='cpu', sox_effects=None, sample_rate=sampling_rate, verbose=True,
    #transformations=None, device='cpu', sox_effects='--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k', sample_rate=sampling_rate, verbose=False,
    #transformations=None, device='cpu', sox_effects=sox_file_content, sample_rate=sampling_rate, verbose=False,
)

dataset = torchaudio.datasets.LIBRISPEECH("../data", url="train-clean-100", download=True)
aug_dataloader = torch.utils.data.DataLoader(
    dataset,
    batch_size=1,
    num_workers=0,
    collate_fn=collate_fn,
)
augmented_record_from_dataset = next(iter(aug_dataloader))
display(Audio(augmented_record_from_dataset[0].squeeze(0).squeeze(0).squeeze(0).cpu(), rate=sampling_rate))

OR

b) Use generated the transformations list, single SoX command or loaded SoX file content while initializing AugmentWaveform class and apply the augmentations to the audio signal.

augment = core.AugmentWaveform(
    transformations=transformations, device='cpu', sox_effects=None, sample_rate=16000, verbose=False,
    #transformations=None, device='cpu', sox_effects='--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k', sample_rate=16000, verbose=False,
    #transformations=None, device='cpu', sox_effects=sox_file_content, sample_rate=16000, verbose=False,
)
# Load test wav file
signal, fs = torchaudio.load('../data/test.wav')
# Apply transformations
waveform = augment(signal.numpy()[0])
display(Audio(waveform, rate=fs))

c) Use generated the transformations list, single SoX command or loaded SoX file content while initializing AugmentLocalAudioDataset class and apply the augmentations to the local audio dataset.

augment = core.AugmentLocalAudioDataset(
    transformations=transformations, device='cpu', sox_effects=None, sample_rate=16000, verbose=False,
    #transformations=None, device='cpu', sox_effects='--sox="norm gain 20 highpass 300 phaser 0.5 0.6 1 0.45 0.6 -s" amr audio_bitrate 4.75k', sample_rate=16000, verbose=False,
    #transformations=None, device='cpu', sox_effects=sox_file_content, sample_rate=16000, verbose=False,
)
augment(input_dir='../data/test-input-folder', output_dir='../data/test-output-folder')

EXAMPLES OF AVAILABLE AUGMENTATIONS

!!!Put following examples as an argument for transf_gen.transf_gen function to generate a list of transformations!!!

Like this:

transformations = transf_gen.transf_gen(verbose=True,
                                        AddBackgroundNoise=f'background_paths="../data/musan/noise/free-sound", min_snr_in_db=10, max_snr_in_db=20, p=1, sample_rate={sampling_rate}',
                                        AddColoredNoise=f'min_snr_in_db=9, max_snr_in_db=10, p=1, sample_rate={sampling_rate}',
                                        )

You can enter augmentation parameters as a string or as a dictionary.

PitchShift='sample_rate=16000, n_steps=[1, 1.5, 0.1], p=1.0'

PitchShift={'sample_rate': 16000, 'n_steps': [1, 1.5, 0.1], 'p': 1.0}

⬆️ AddBackgroundNoise docs

AddBackgroundNoise=f'''background_paths="../data/musan/noise/free-sound",
                       min_snr_in_db=10, 
                       max_snr_in_db=20, 
                       p=1, 
                       sample_rate={sampling_rate}''',

⬆️ AddColoredNoise docs

AddColoredNoise=f'''min_snr_in_db=9,
                    max_snr_in_db=10, 
                    p=1, 
                    sample_rate={sampling_rate}''',

⬆️ AddGaussianNoise docs

AddGaussianNoise={'min_amplitude': 0.001, 
                  'max_amplitude': 0.015, 
                  'p': 1},

⬆️ AddShortNoises docs

AddShortNoises={'sounds_path': "../data/musan/noise/free-sound",
                'min_snr_in_db': 3.0,
                'max_snr_in_db': 30.0,
                'noise_rms': "relative_to_whole_input",
                'min_time_between_sounds': 2.0,
                'max_time_between_sounds': 8.0,
                'noise_transform': AA.PolarityInversion(),
                'p': 1.0},

⬆️ AdjustDuration docs

AdjustDuration={'duration_seconds': 3.5, 
                'padding_mode': 'silence', 
                'p': 1},

⬆️ AirAbsorption docs

AirAbsorption={'min_distance': 10.0, 
               'max_distance': 50.0, 
               'min_humidity': 80.0, 
               'max_humidity': 90.0, 
               'min_temperature': 10.0, 
               'max_temperature': 20.0, 
               'p': 1.0},

⬆️ ApplyImpulseResponse docs

ApplyImpulseResponse=f'''ir_paths="../data/Rir.wav", 
                         p=1, 
                         sample_rate={sampling_rate}''',

⬆️ BandPassFilter docs

BandPassFilter=f'''min_center_frequency=200, 
                   max_center_frequency=4000, 
                   min_bandwidth_fraction=0.5, 
                   max_bandwidth_fraction=1.99, 
                   sample_rate={sampling_rate}, 
                   p=1''',

⬆️ BandStopFilter docs

BandStopFilter=f'''min_center_frequency=200, 
                   max_center_frequency=4000, 
                   min_bandwidth_fraction=0.5, 
                   max_bandwidth_fraction=1.99, 
                   sample_rate={sampling_rate}, 
                   p=1''',

⬆️ ClippingDistortion docs

ClippingDistortion={'min_percentile_threshold': 10, 
                    'max_percentile_threshold': 30, 
                    'p': 1},

⬆️ FrequencyMasking docs

FrequencyMasking={'freq_mask_param': 80},

⬆️ Volume / Gain docs

Vol={'gain': [2.5, 3, 0.1], 
     'p': 1.0},

⬆️ GainTransition docs

GainTransition={'min_gain_db': 30, 
                'max_gain_db': 40, 
                'min_duration': 5, 
                'max_duration': 16, 
                'duration_unit': 'seconds', 
                'p': 1},

⬆️ HighPassFilter docs

HighPassFilter=f'''min_cutoff_freq=700,
                   max_cutoff_freq=800,
                   sample_rate={sampling_rate},
                   p=1''',

⬆️ HighShelfFilter docs

HighShelfFilter={'min_center_freq': 2000, 
                 'max_center_freq': 5000, 
                 'min_gain_db': 10.0, 
                 'max_gain_db': 16.0, 
                 'min_q': 0.5, 
                 'max_q': 1.0, 
                 'p': 1},

⬆️ Limiter docs

Limiter='''min_threshold_db=-24, 
           max_threshold_db=-2,
           min_attack=0.0005, 
           max_attack=0.025, 
           min_release=0.05, 
           max_release=0.7, 
           threshold_mode="relative_to_signal_peak", 
           p=1''',

⬆️ LoudnessNormalization docs

LoudnessNormalization={'min_lufs': -31, 
                       'max_lufs': -13, 
                       'p': 1},

⬆️ LowPassFilter docs

LowPassFilter={'min_cutoff_freq': 700, 
               'max_cutoff_freq': 800, 
               'sample_rate': sampling_rate, 
               'p': 1},

⬆️ LowShelfFilter docs

LowShelfFilter={'min_center_freq': 20, 
                'max_center_freq': 600, 
                'min_gain_db': -16.0, 
                'max_gain_db': 16.0, 
                'min_q': 0.5, 
                'max_q': 1.0, 
                'p': 1},

⬆️ Mp3Compression docs

Mp3Compression={'min_bitrate': 8, 
                'max_bitrate': 8, 
                'backend': 'pydub', 
                'p': 1},

⬆️ MelSpectrogram docs

MelSpectrogram={'sample_rate': 16000},

⬆️ Normalize docs

Normalize={'p': 1},

⬆️ Padding docs

Padding={'mode': 'silence', 
         'min_fraction': 0.02, 
         'max_fraction': 0.8, 
         'pad_section': 'start', 
         'p': 1},

⬆️ PeakNormalization docs

PeakNormalization={'p': 1, 
                   'sample_rate': sampling_rate},

⬆️ PeakingFilter docs

PeakingFilter={'min_center_freq': 51, 
               'max_center_freq': 7400, 
               'min_gain_db': -22, 
               'max_gain_db': 22, 
               'min_q': 0.5, 
               'max_q': 1.0, 
               'p': 1},

⬆️ PitchShift docs

PitchShift={'sample_rate': 16000, 
            'n_steps': [1, 1.5, 0.1],
            'bins_per_octave': 12, 
            'n_fft': 512, 
            'win_length':512, 
            'hop_length': 512//4, 
            'p': 1.0},

⬆️ PolarityInversion docs

PolarityInversion={'p': 1, 
                   'sample_rate': sampling_rate},

⬆️ Time inversion docs

TimeInversion={'p': 1, 
               'sample_rate': sampling_rate},

⬆️ ApplyRIR

# Use this to see available materials you can use as walls_mat, floor_mat and ceiling_mat argument
# from AudioAugmentor import rir_setup
# rir_setup.get_all_materials_info()

# This way you set up parameters when you want to generate random room parameter
rir_kwargs = {
    'audio_sample_rate': 16000,
    'x_range': (0, 100), 
    'y_range': (0, 100), 
    'num_vertices_range': (3, 6),
    'mic_height': 1.5,
    'source_height': 1.5,
    'walls_mat': 'curtains_cotton_0.5',
    'room_height': 2.0,
    'max_order': 3,
    'floor_mat': 'carpet_cotton',
    'ceiling_mat': 'hard_surface',
    'ray_tracing': True,
    'air_absorption': True,
}
# This way you set up parameters when you want to generate specific room
rir_kwargs = {
    'audio_sample_rate': 16000,
    'corners_coord': [[0, 0], [0, 3], [5, 3], [5, 1], [3, 1], [3, 0]],
    'walls_mat': 'curtains_cotton_0.5',
    'room_height': 2.0,
    'max_order': 3,
    'floor_mat': 'carpet_cotton',
    'ceiling_mat': 'hard_surface',
    'ray_tracing': True,
    'air_absorption': True,
    'source_coord': [[1.0], [1.0], [0.5]],
    'microphones_coord': [[3.5], [2.0], [0.5]],
}
transformations = transf_gen.transf_gen(verbose=True,
                                        ApplyRIR=rir_kwargs,
                                        )

⬆️ SevenBandParametricEQ docs

SevenBandParametricEQ={'min_gain_db': -10, 
                       'max_gain_db': 10, 
                       'p': 1},

⬆️ Shift docs

Shift={'min_shift': 1, 
       'max_shift': 2, 
       'p': 1, 
       'sample_rate': sampling_rate},

⬆️ Speed docs

Speed={'orig_freq': 16000, 
       'factor': [0.9, 1.5, 0.1], 
       'p': 1},

⬆️ Spectrogram docs

Spectrogram={'sample_rate': 16000},

⬆️ TanhDistortion docs

TanhDistortion={'min_distortion': 0.1, 
                'max_distortion': 0.8, 
                'p': 1},

⬆️ TimeMasking docs

TimeMasking={'time_mask_param': 80},

⬆️ TimeStretch docs

TimeStretch='''min_rate=0.9, 
               max_rate=1.1, 
               p=0.2, 
               leave_length_unchanged=False''',

⬆️ Codecs using torchaudio

You can select just one. No need to use them all. :)

transformations = transf_gen.transf_gen(verbose=True,
                                        ac3=True,
                                        adpcm_ima_wav=True,
                                        adpcm_ms=True,
                                        adpcm_yamaha=True,
                                        eac3=True,
                                        flac=True,
                                        libmp3lame=True,
                                        mp2=True,
                                        pcm_alaw=True,
                                        pcm_f32le=True,
                                        pcm_mulaw=True,
                                        pcm_s16le=True,
                                        pcm_s24le=True,
                                        pcm_s32le=True,
                                        pcm_u8=True,
                                        wmav1=True,
                                        wmav2=True,
                                        )

⬆️ g726

g726={'audio_bitrate': '40k'},

⬆️ gsm

gsm=True,

⬆️ amr

amr={'audio_bitrate': '4.75k'},