Welcome to the repository of the audio captioning baseline system for the DCASE challenge of 2020.
Here you can find the complete code of the baseline system, consisting of:
- the caption evaluation part,
- the dataset pre-processing/feature extraction part,
- the data handling part for Pytorch library, and
- the deep neural network (DNN) method part
Parts 1, 2, and 3, also exist in separate repositories. Caption evaluation tools for audio captioning can also be found here. Code for dataset pre-processing/feature extraction for Clotho dataset can also be found here. Finally, code for handling the Clotho data (i.e. extracted features and one-hot encoded words) for PyTorch library (i.e. PyTorch DataLoader for Clotho data) can also be found here.
This repository is maintained by K. Drossos.
- Too long - Didn't read (TL-DR)
- Setting up the code
- Preparing the data
- Use the baseline system
- Explanation of settings
If you are familiar with most of the stuff and you want to use this system fast as possible, do the following:
- Install all dependencies from the corresponding files.
- Make sure that your system has Java installed and enabled.
- Download the data from Zenodo and place them in the data directory.
- Run the baseline system.
If you want or need a bit more details, then read the following sections.
To start using the audio captioning DCASE 2020 baseline system, firstly you have to set-up the code. Please note bold that the code in this repository is tested with Python 3.7.
To set-up the code, you have to do the following:
- Clone this repository.
- Use either pip or conda to install dependencies
Use the following command to clone this repository at your terminal:
$ git clone git@github.com:audio-captioning/dacse-2020-baseline.git
The above command will create the directory dacse-2020-baseline
and populate
it with the contents of this repository. The dacse-2020-baseline
directory
will be called root directory for the rest of this README file.
For installing the dependencies, there are two ways. You can either use conda or pip.
To use conda, you can issue the following command at your terminal (when you are in the root directory):
$ conda create --name audio-captioning-baseline --file requirements_conda.yaml
The above command will create a new environment called audio-captioning-baseline
, which
will have set-up all the dependencies for the audio captioning DCASE 2020 baseline. To
activate the audio-captioning-baseline
environment, you can issue the following command"
$ conda activate audio-captioning-baseline
Now, you are ready to proceed to the following steps.
If you do not use anaconda/conda, you can use the default Python package manager to install the dependencies of the audio captioning DCASE 2020 baseline. To do so, you have to issue the following command at the terminal (when you are inside the root directory):
$ pip install -r requirements_pip.txt
The above command will install the required packages using pip. Now you are ready to go to the following steps.
After setting-up the code for the audio captioning DCASE 2020 baseline system, you have to obtain the Clotho dataset, place it to the proper directory, and do the feature extraction.
Clotho dataset is freely available online at the Zenodo platform. You can find Clotho at
You should download all .7z
files and the .csv
files with the captions. That is, you have
do download the following files from Zenodo:
clotho_audio_development.7z
clotho_audio_evaluation.7z
clotho_captions_development.csv
clotho_captions_evaluation.csv
After downloading the files, you should place them in the data
directory, in your root directory.
You should create two directories in your data
directory:
- The first is called
clotho_audio_files
, and - second is called
clotho_csv_files
.
Then, you have to expand the 7z
files. There are many options on how to do this. We do not want
to promote different software and/or packages, so you can just search on Google about how to expand
7zip
files at your operating system.
After you expand the 7z
files, you should have two directories created. The first is
development
and it will be created by teh clotho_audio_development.7z
file. The second
is evaluation, and it will be created by the clotho_audio_evaluation.7z
file. Finally, you should
have the following files and directories at your root/data
directory:
development
directoryevaluation
directoryclotho_captions_development.csv
fileclotho_captions_evaluation.csv
file
The development
directory contains 2163 audio files and the evaluation
directory 1045 audio
files. You should move the development
and evaluation
directories in the data/clotho_audio_files
directory, and the clotho_captions_development.csv
and clotho_captions_evaluation.csv
files in
the data/clotho_csv_files
directory. Thus, there should be the following structure in your data
directory:
data/
| - clotho_audio_files/
| | - development/
| | - evaluation/
| - clotho_csv_files/
| |- clotho_captions_development.csv
| |- clotho_captions_evaluation.csv
Now, you can use the baseline system to extract the features and create the dataset.
This baseline system implements all the necessary processes in order to use the Clotho data,
optimize a deep neural network (DNN), and predict and evaluate captions. Each process has
some corresponding settings that can be modified in order to fine tune the baseline system.
In the following subsection, the default settings will be used.
To create the dataset, you can either run the script processes/dataset.py
using
the command:
$ python processes/dataset.py
or run the baseline system using the main.py
script. In any case, the dataset creation will
start.
The dataset creation is a lengthy process, mainly due to the checking of the data. That is, the dataset creation has two steps:
- Firstly a split is created (e.g. development or evaluation), and then
- the data for the split are validated.
You can select if you want to have the validation of the data by altering the validate_dataset
parameter at the settings/dataset_creation.yaml
file.
The result of the dataset creation process will be the creation of the directories:
data/data_splits
,data/data_splits/development
,data/data_splits/evaluation
, anddata/pickles
The directories in data/data_splits
have the input and output examples for the optimization
and assessment of the baseline DNN. The data/pickles
directory holds the pickle
files that
have the frequencies of the words and characters (so one can use weights in the objective function)
and the correspondence of words and characters with indices.
Note bold: Once you have created the dataset, there is no need to create it every time. That is, after you create the dataset using the baseline system, then you can set
workflow:
dataset_creation: No
at the settings/main_settings.yaml
file.
To conduct an experiment using the baseline DNN, you can use the main.py
script. In case that you
do have previously created the input/output examples (using the above mentioned procedure), then you
can skip the dataset creation by altering the value of the dataset_creation
in the
settings/main_settings.yaml
file. To use the main.py
script, you can issue the command:
$ python main.py
Alternatively, you can use the process/method.py
by:
$ python processes/method.py
The above commands will start the process of optimizing the baseline DNN, using the data that were created in the create the dataset section.
To use the pre-trained model, you have first to obtain the pre-trained weights. The pre-trained weights are freely available online at Zenodo
Note bold: To use the caption evaluate tools you need to have Java installed and enabled.
To evaluate the predictions, you have first to have a optimized (i.e. trained) model (i.e. a DNN). You can obtain this DNN directly from training process (i.e. you do first training and then evaluation) or you can use some pre-trained weights.
To use some pre-trained weights, you have to specify the name of the file having the weights
at the settings/dirs_and_files.yaml
file. Also, you have to indicate that you will use a pre-trained
model (at the settings/model_baseline.yaml
file) and indicate that you want to do evaluation
of the DNN (at the settings/method_baseline.yaml
file).
Please note bold: Before being able to run the code for the evaluation of the predictions,
you have first to run the script get_stanford_models.sh
in the coco_caption
directory.
There are different settings for the baseline system, associated with the creation of the dataset, the data, the outputs of the baseline system, and (of course) the optimization of the baseline DNN.
All these settings can be found in the settings
directory. This directory has (by default)
the following files:
main_settings.yaml
dirs_and_files.yaml
dataset_creation.yaml
model_baseline.yaml
method_baseline.yaml
The parameters in the above .yaml
files are explained in the following sections.
In the settings/main_settings.yaml
file, you can find the following settings:
workflow:
dataset_creation: Yes
dnn_training: yes
dnn_evaluation: yes
dataset_creation_settings: !include dataset_creation.yaml
feature_extraction_settings: !include feature_extraction.yaml
dnn_training_settings: !include method_baseline.yaml
dirs_and_files: !include dirs_and_files.yaml
The settings at the workflow
block, correspond to the different processes that the baseline
system can do; the creation of the dataset (dataset_creation
), the optimization of the DNN
(dnn_training
), and the evaluation of captions (dnn_evaluation
). By indicating
a yes
or a no
, you can switch on (with yes
) or off (with no
) the processes. For example,
the
workflow:
dataset_creation: no
dnn_training: yes
dnn_evaluation: no
means that the baseline system will not create the dataset and will not evaluate captions, but it will do the optimization of the DNN.
The rest settings serve to indicate the files that hold the settings for each of the processes.
For example, the dataset_creation.yaml
file holds the settings for the dataset creation. An
exception is the dirs_and_files
field, which indicates which file holds the settings for inputs
and outputs of the baseline system.
The settings/dirs_and_files.yaml
file, holds the following settings:
root_dirs:
outputs: 'outputs'
data: 'data'
dataset:
development: &dev 'development'
evaluation: &eva 'evaluation'
features_dirs:
output: 'data_splits'
development: *dev
evaluation: *eva
audio_dirs:
downloaded: 'clotho_audio_files'
output: 'data_splits_audio'
development: *dev
evaluation: *eva
annotations_dir: 'clotho_csv_files'
pickle_files_dir: 'pickles'
files:
np_file_name_template: 'clotho_file_{audio_file_name}_{caption_index}.npy'
words_list_file_name: 'words_list.p'
words_counter_file_name: 'words_frequencies.p'
characters_list_file_name: 'characters_list.p'
characters_frequencies_file_name: 'characters_frequencies.p'
model:
model_dir: 'models'
checkpoint_model_name: 'dcase_model_baseline.pt'
pre_trained_model_name: 'dcase_model_pre_trained.pytorch'
logging:
logger_dir: 'logging'
caption_logger_file: 'captions_baseline.txt'
These are the necessary settings, to specify the input and output directories for the
baseline system. Specifically, the root_dirs
has the root directories for the outputs
(outputs
) and the data (data
). The root directory of the outputs will be used in order
to output the checkpoints of the baseline DNN and the logging files. The root directory for
the data will be used as the root directory where the data are, e.g. the data
(at root/data
)
directory mentioned in section setting up the data.
The dataset
has the directory and file names used when creating and accessing the dataset
files. The development
and evaluation
are the name of the directories that will hold
the corresponding (in each case) development and evaluation (respectively) data. These
names can be overridden in the corresponding entries (for example, in the dataset/audio_dirs
for the audio directories).
The dataset/features_dirs
has the directory names that:
- parent directory for the ready-to-use features -
output
- the development features -
development
- the evaluation features -
evaluation
The dataset/audio
has the directory names that:
- the downloaded audio will be -
downloaded
- the dataset files (i.e. the input/output examples) will be -
output
- the name of the directory that will hold the development data -
development
- the name of the directory that will hold the evaluation data -
evaluation
The annotations_dir
and pickle_files_dir
hold the names of the directories where the
csv files will be (annotations_dir
) and where the pickle
files will be placed (pickle_files_dir
).
The files
entry at the dataset
block, has:
- the file names of the
pickle
fileswords_list_file_name
,words_counter_file_name
,characters_list_file_name
, andcharacters_frequencies_file_name
)
- and the template string of the file name of the input/output examples of the dataset
(
np_file_name_template
).
The model
entry, has:
- the name of the directory (in the
root_dirs/outputs
) where the checkpoints of the baseline DNN will be saved -model_dir
, - the template file name of the checkpoint file -
checkpoint_model_name
, and - the file name that the baseline DNN will look as pre-trained model -
pre_trained_model_name
.
Finally, the logging
entry has:
- the directory where the logging files will be placed -
logger_dir
, - and the base file name of the logging -
caption_logger_file
.
The file settings/dataset_creation.yaml
has:
workflow:
create_dataset: Yes
validate_dataset: No
annotations:
development_file: 'clotho_captions_development.csv'
evaluation_file: 'clotho_captions_evaluation.csv'
audio_file_column: 'file_name'
captions_fields_prefix: 'caption_{}'
use_special_tokens: Yes
nb_captions: 5
keep_case: No
remove_punctuation_words: Yes
remove_punctuation_chars: Yes
use_unique_words_per_caption: No
use_unique_chars_per_caption: No
audio:
sr: 44100
to_mono: Yes
max_abs_value: 1.
The workflow
block is to indicate the execution or not of the dataset creation and validation.
That is:
- if the dataset will be created -
create_dataset
, and - if the data for each split will be validated (this is a lengthy process) -
validate_dataset
The annotations
block holds the settings needed for accessing and processing the annotations
(i.e. the csv files) of Clotho. That is:
- the name of the file holding the annotations of the development split -
development_file
- the name of the file holding the annotations of the evaluation split -
evaluation_file
- the name of the column in the annotations file that has the file name of the corresponding audio file - 'audio_file_column'
- the prefix of the column name of the columns (at the csv files) that have the
captions -
captions_fields_prefix
- indication of the special tokens (i.e.
<sos>
and<eos>
) will be used -use_special_tokens
- the amount of captions per each audio file -
nb_captions
- indication if the letter case. i.e. keep capital letters as capitals (
Yes
) or turn them to small case (No
) will be kept -keep_case
- if the punctuation will be removed from the word tokens -
remove_punctuation_words
- if the punctuation will be removed from the character tokens -
remove_punctuation_chars
- take into account unique words per audio file when counting the frequency of
appearance of each word -
use_unique_words_per_caption
- take into account unique characters per audio file when counting the frequency of
appearance of each character -
use_unique_chars_per_caption
The audio
block holds settings for processing the audio data:
- the sampling frequency that the audio data will be resampled (if needed) to -
sr
- indication if the audio files should be turned to mono -
to_mono
- maximum absolute value for normalizing the audio data -
max_abs_value
The file settings/model_baseline.yaml
holds the settings for the baseline DNN:
use_pre_trained_model: No
encoder:
input_dim_encoder: 64
hidden_dim_encoder: 256
output_dim_encoder: 256
dropout_p_encoder: .25
decoder:
output_dim_h_decoder: 256
nb_classes: # Empty, to be filled automatically.
dropout_p_decoder: .25
max_out_t_steps: 22
The use_pre_trained_model
flag indicates if a pre-trained model will be used. If
this flag is set to Yes
, then the name of the file with the weights of the pre-trained
model has to be specified in the settings/dirs_and_files.yaml
file.
The encoder
block has the settings for the encoder of the baseline DNN:
- the input dimensionality to the first layer of the encoder -
input_dim_encoder
- the hidden output dimensionality of the first and second layers of the encoder -
hidden_dim_encoder
- the output dimensionality of the third layer of the encoder -
output_dim_encoder
- the dropout probability for the encoder -
dropout_p_encoder
Similarly, the decoder
block holds the settings for the decoder of the baseline DNN:
- the output dimensionality of the RNN of the decoder -
output_dim_h_decoder
- the amount of classes for the classifier (it is filled automatically by the
baseline system) -
nb_classes
- the dropout probability for the decoder -
dropout_p_decoder
- the maximum output time-steps for the decoder -
max_out_t_steps
The file settings/method_baseline.yaml
holds the settings for the training and
evaluation procedure of the baseline DNN:
model: !include model_baseline.yaml
data:
input_field_name: 'features'
output_field_name: 'words_ind'
load_into_memory: No
batch_size: 16
shuffle: Yes
num_workers: 0
drop_last: Yes
training:
nb_epochs: 300
patience: 10
loss_thr: !!float 1e-4
optimizer:
lr: !!float 1e-4
grad_norm:
value: !!float 1.
norm: 2
force_cpu: No
text_output_every_nb_epochs: !!int 10
nb_examples_to_sample: 100
The model
specifies the file where the settings of the baseline DNN are (i.e. the
settings/model_baseline.yaml
file).
The data
block has the settings for handling the data at the training/evaluation processes:
- the name of the field of the numpy object that holds the input values -
input_field_name
- the name of the field of the numpy object that holds the output/target values -
output_field_name
- indication if the whole dataset should be loaded into the memory during training/evaluation
processes -
load_into_memory
- the size of the batch -
batch_size
- indication for shuffling the training data -
shuffle
- the amount of workers that the PyTorch DataLoader will use -
num_workers
- indication if the last (incomplete) batch will be used or not -
drop_last
The training
block holds settings for the training process:
- the maximum amount of epochs -
nb_epochs
- the amount of epochs for patience -
patience
- the threshold at the loss for considering that the loss is the same or changed -
loss_thr
- the settings for the optimizer (just the learning rate) -
optimizer
- settings for clipping the gradient norm -
grad_norm
- indication if the training should necessarily be on the CPU (e.g. for debugging) -
force_cpu
- indication of every how many epochs there should be an output of predicted captions -
text_output_every_nb_epochs
- how many examples to use for outputting the captions -
nb_examples_to_sample