Infer your own data

This tutorial is aimed at helping people understand how to infer any data with GOPT. There will be some restrictions on "any data", I will explain it later.

Install Kaldi and GOPT. Sometimes Kaldi can be fuzzy. If meeting any problems, use a docker image of Kaldi instead.
Download the original speechocean762 to your disk.
Copy speechocean762 and change to your dataset’s name. In this example, I use test_dataset
There are multiple hacks we need to do to make our test_dataset runnable.
First, delete unnecessary .wav files. Replace them with your own.
Update all files in both test and train. Including spk2age, spk2gender, spk2utt, text, utt2spk, wav.scp. Remember, words in text need to be capitalized.

Because I only have 1 wav file, all these files will be 1-line. For example, I set the speaker ID to be 0001 and utt_id to be test. So:

wav.scp includes

test	WAVE/SPEAKER0001/test.wav

spk2utt includes

0001 test

If you are dealing with multiple data, remember Kaldi requires to keep sorting correctly. It is recommended to use the same system as speechocean762 did. Here is my code example of doing this. (My dataset contains columns id_user, word (transcript) and file_name (which is a path)). The lexicon_dict_l contains word -> suffx/prefix added phones. If you are interestd, the generation of such dict can be seen in the end of this tutorial.

scp_file = "wav.scp"
spk2utt = "spk2utt"
utt2spk = "utt2spk"
text = "text"
scp_str = ""
spk2utt_str = ""
utt2spk_str = ""
text_str = ""
text_phone = ""
id_list = ["%04d" % x for x in range(10000)]
speaker_hist = set()
dataset.sort_values(by="id_user", ascending=True, inplace=True)
j = 0
holder = ""
cache = ""
for _, line in dataset.iterrows():
    speaker_hist.add(line.id_user)
    spk = id_list[len(speaker_hist)]
    if cache == "":
        cache = spk
    utt = spk + id_list[j]
    j += 1
    scp_str = scp_str + utt + ' ' + f"WAVE/{line.file_name}\n"
    if cache != spk:
        spk2utt_str = spk2utt_str + holder + '\n'
        holder = ""
    if len(holder)==0:
        holder = spk + " " + utt
    else:
        holder = holder + " " + utt
    cache = spk
    utt2spk_str = utt2spk_str + utt + " " + spk + '\n'
    text_str = text_str + utt + " " + line.word.upper() + '\n' 
    text_phone = text_phone + utt + ".0\t" + lexicon_dict_l[line.word.upper()] + '\n'
else:
    spk2utt_str = spk2utt_str + holder + '\n'

with open(scp_file, 'w') as f:
    f.write(scp_str)

with open(spk2utt, 'w') as f:
    f.write(spk2utt_str)

with open(utt2spk, 'w') as f:
    f.write(utt2spk_str)

with open(text, 'w') as f:
    f.write(text_str)

with open("text-phone", 'w') as f:
    f.write(text_phone)

In resource/text-phone, delete unnecessary lines and replace your own. Here, each lines begins with <utt_id>.<n> which represents the n-th word in your text. After it, please append the corresponding phones of that word.

To be specific, find your corresponding phones in resource/lexicon.txt. For instance, the word FAN would be F AE0 N. For all the first phones, add an additional suffix B, for the last phones, add an additional suffix E and for all others, add the suffix I. If there is only one phone, add the suffix S.

For example, if my text is “FAN WORKS”, the final result in text-phone is
```
test.0 F_B AE0_I N_E
test.1 W_B ER0_I K_I S_E
```
If your don’t do this right, you will stuck at stage 7.
Download and extract all tars in https://kaldi-asr.org/models/m13

In gop_speechocean762/s5/run.sh , change lines 38-42 to your extracted results. In my case, I use

librispeech_eg=../../librispeech/s5
model=$librispeech_eg/exp/chain_cleaned/tdnn_1d_sp
ivector_extractor=$librispeech_eg/exp/nnet3_cleaned/extractor
lang=$librispeech_eg/data/lang_test_tgsmall

Also, change stage to 2to avoid download. Change nj to the number of examples. In my case, I set to 1.

After running run.sh, confirm you have non-empty feat files in gop_train and gop_test. Mine looks like this

(gopt) yifan@XXX:~/develop/kaldi/egs/gop_speechocean762/s5/exp/gop_train$ tree .
.
├── feat.1.ark
├── feat.1.scp
├── feat.scp
├── gop.1.ark
├── gop.1.scp
├── gop.scp
└── log
    └── compute_gop.1.log

Execute the following original guide in GOPT.

kaldi_path=your_kaldi_path
cd $gopt_path
mkdir -p data/raw_kaldi_gop/librispeech
cp src/extract_kaldi_gop/{extract_gop_feats.py,extract_gop_feats_word.py} ${kaldi_path}/egs/gop_speechocean762/s5/local/
cd ${kaldi_path}/egs/gop_speechocean762/s5

Now we need to change the original GOPT files
First, in extract_gop_feats.py, delete the continue https://github.com/YuanGongND/gopt/blob/master/src/extract_kaldi_gop/extract_gop_feats.py#L54. (PS: label in this file is lable. Ummm, you can’t unsee them.)
In the same file, because we do not have score, change https://github.com/YuanGongND/gopt/blob/master/src/extract_kaldi_gop/extract_gop_feats.py#L61 to
```
lables.append([ph])
```
Run the edited python local/extract_gop_feats.py, skip extract_gop_feats_word.py, not needed for inference.

Continue with

cd $gopt_path
cp -r ${kaldi_path}/egs/gop_speechocean762/s5/gopt_feats/* data/raw_kaldi_gop/<your dataset name>

Change another GOPT file, src/prep_data/gen_seq_data_phn.py. Because we do not have score any more, all we want to have is the phn. Also we need to replace the hardcoding path to <your dataset name>. You can debug it yourself, here is my edited results.

# -*- coding: utf-8 -*-
# @Time    : 9/19/21 11:13 PM
# @Author  : Yuan Gong
# @Affiliation  : Massachusetts Institute of Technology
# @Email   : yuangong@mit.edu
# @File    : gen_seq_data_phn.py

# Generate sequence phone input and label for seq2seq models from raw Kaldi GOP features.

import numpy as np

def load_feat(path):
    file = np.loadtxt(path, delimiter=',')
    return file

def load_keys(path):
    file = np.loadtxt(path, delimiter=',', dtype=str)
    return file

def load_label(path):
    file = np.loadtxt(path, delimiter=',', dtype=str)
    return file

def process_label(label):
    pure_label = []
    for i in range(0, label.shape[0]):
        pure_label.append(float(label[i, 1]))
    return np.array(pure_label)

def process_feat_seq(feat, keys, labels, phn_dict):
    key_set = []
    for i in range(keys.shape[0]):
        cur_key = keys[i].split('.')[0]
        key_set.append(cur_key)

    feat_dim = feat.shape[1] - 1

    utt_cnt = len(list(set(key_set)))
    print('In total utterance number : ' + str(utt_cnt))

    # Pad all sequence to 50 because the longest sequence of the so762 dataset is shorter than 50.
    seq_feat = np.zeros([utt_cnt, 50, feat_dim])
    # -1 means n/a, padded token
    # [utt, seq_len, 0] is the phone label, and the [utt, seq_len, 1] is the score label
    seq_label = np.zeros([utt_cnt, 50, 2]) - 1

    # the key format is utt_id.phn_id
    prev_utt_id = keys[0].split('.')[0]

    row = 0
    for i in range(feat.shape[0]):
        cur_utt_id, cur_tok_id = keys[i].split('.')[0], int(keys[i].split('.')[1])
        # if a new sequence, start a new row of the feature vector.
        if cur_utt_id != prev_utt_id:
            row += 1
            prev_utt_id = cur_utt_id

        # The first element is the phone label.
        seq_feat[row, cur_tok_id, :] = feat[i, 1:]

        # [utt, seq_len, 0] is the phone label
        print(labels)
        seq_label[row, cur_tok_id, 0] = phn_dict[labels[i]]
        # [utt, seq_len, 1] is the score label, range from 0-2
        # seq_label[row, cur_tok_id, 1] = labels[i, 1]

    return seq_feat, seq_label

def gen_phn_dict(label):
    phn_dict = {}
    phn_idx = 0
    for i in range(label.shape[0]):
        if label[i] not in phn_dict:
            phn_dict[label[i]] = phn_idx
            phn_idx += 1
    return phn_dict

# generate sequence training data
tr_feat = load_feat('../../data/raw_kaldi_gop/test_dataset/tr_feats.csv')
tr_keys = load_keys('../../data/raw_kaldi_gop/test_dataset/tr_keys_phn.csv')
tr_label = load_label('../../data/raw_kaldi_gop/test_dataset/tr_labels_phn.csv')
phn_dict = gen_phn_dict(tr_label)
print(phn_dict)
tr_feat, tr_label = process_feat_seq(tr_feat, tr_keys, tr_label, phn_dict)
print(tr_feat.shape)
print(tr_label.shape)
np.save('../../data/seq_data_test_dataset/tr_feat.npy', tr_feat)
np.save('../../data/seq_data_test_dataset/tr_label_phn.npy', tr_label)

# generate sequence test data
te_feat = load_feat('../../data/raw_kaldi_gop/test_dataset/te_feats.csv')
te_keys = load_keys('../../data/raw_kaldi_gop/test_dataset/te_keys_phn.csv')
te_label = load_label('../../data/raw_kaldi_gop/test_dataset/te_labels_phn.csv')
te_feat, te_label = process_feat_seq(te_feat, te_keys, te_label, phn_dict)
print(te_feat.shape)
print(te_label.shape)
np.save('../../data/seq_data_test_dataset/te_feat.npy', te_feat)
np.save('../../data/seq_data_test_dataset/te_label_phn.npy', te_label)

The last step requires you to run these lines. Skip word and utterence.

mkdir data/seq_data_<your dataset name>
cd src/prep_data
python gen_seq_data_phn.py

Finally, in gopt/data/<your dataset name>, you will see kindly two files that are needed to do the inference. te_feat.npy and te_label_phn.npy. But remember, the te_label_phn.npycontains both phn and scores which we have not generated (and not needed). So, in order to do the inference, run the following.

PS: to simplify stuff, my train dataset is the same as test dataset.

import torch
import sys
import os
sys.path.append(os.path.abspath('../src/'))
from models import GOPT
gopt = GOPT(embed_dim=24, num_heads=1, depth=3, input_dim=84)
# GOPT is trained with dataparallel, so it need to be wrapped with dataparallel even you have a single gpu or cpu
gopt = torch.nn.DataParallel(gopt)
sd = torch.load('gopt_librispeech/best_audio_model.pth', map_location='cpu')
gopt.load_state_dict(sd, strict=True)

import numpy as np
input_feat = np.load("te_feat.npy")
input_phn = np.load("te_label_phn.npy")
gopt = gopt.float()
gopt.eval()
with torch.no_grad():
    t_input_feat = torch.from_numpy(input_feat[:,:,:])
    t_phn = torch.from_numpy(input_phn[:,:,0])
    u1, u2, u3, u4, u5, p, w1, w2, w3 = gopt(t_input_feat.float(),t_phn.float())

Good Luck!

PS:

Suggestion: If your text contains words that are not in lexicon, I recommend always replace the content of lexicon.txt in speechocean762 with the librispeech-lexicon.txt from http://www.openslr.org/11/.

Example Code of cleaning the librispeech-lexicon.txt:

with open("librispeech-lexicon.txt", 'r') as f:
    lexicon_raw = f.read()
    rows = lexicon_raw.splitlines()
clean_rows = [row.split() for row in rows]
lexicon_dict_l = dict()
for row in clean_rows:
    c_row = row.copy()
    key = c_row.pop(0)
    if len(c_row) == 1:
        c_row[0] = c_row[0] + '_S'
    if len(c_row) >= 2:
        c_row[0] = c_row[0] + '_B'
        c_row[-1] = c_row[-1] + '_E'
    if len(c_row) > 2:
        for i in range(1,len(c_row)-1):
            c_row[i] = c_row[i] + '_I'
    val = " ".join(c_row)
    lexicon_dict_l[key] = val
lexicon_dict_l

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

steps_of_inference.md

steps_of_inference.md

Infer your own data

Files

steps_of_inference.md

Latest commit

History

steps_of_inference.md

File metadata and controls

Infer your own data