AWS Transcribe does a reasonable job doing speach to text with speaker id but the json output is crazy.
Recording meetings and other conversations via the computer is easy but when you go back to figure out who said what it is painful. AWS Transcribe allows you to take a mp3 file and convert it to text. AWS Transcribe also will offer to detect the speaks and will identify each word with a speaker_label. The problem is that the JSON file they hand back is a bit of a bear to try and decypher.
This script takes a AWS Transcribe JSON output file and processes it into a number of useful formats. JSON files for further processing and Audacity Labels for visaulization and correcting the in evitable errors that you will see in text-to-speach.
The script will create two outputs.
{
"accountId": "333333333333",
"jobName": "BillTedExcel",
"processed_time": "2022-03-15 20:35:56.777924",
"type": "words",
"results": [
{
"content": "now",
"end_time": "3.83",
"speaker_label": "spk_0",
"start_time": "3.14",
"type": "pronunciation"
},
{
"content": "a",
"end_time": "3.94",
"speaker_label": "spk_0",
"start_time": "3.84",
"type": "pronunciation"
},
{
"content": "motion",
"end_time": "4.33",
"speaker_label": "spk_0",
"start_time": "3.94",
"type": "pronunciation"
},
{
"content": "picture",
"end_time": "4.71",
"speaker_label": "spk_0",
"start_time": "4.33",
"type": "pronunciation"
},
{
"content": ",",
"end_time": "4.71001",
"speaker_label": "spk_0",
"start_time": "4.71",
"type": "punctuation"
}
]
}
{
"accountId": "333333333333",
"jobName": "BillTedExcel",
"processed_time": null,
"results": [
{
"content": " now a motion picture, so grand and so magnificent And so vast. It spans 7000 years. no",
"end_time": "11.67",
"speaker_label": "spk_0",
"start_time": "3.14"
},
{
"content": "way.",
"end_time": "12.6001",
"speaker_label": "spk_1",
"start_time": "11.68"
},
{
"content": "Yes way. But it starts with Bill. Bill? S. Preston who was joan of Arc and",
"end_time": "18.53",
"speaker_label": "spk_0",
"start_time": "12.61"
}
]
}
The OSS software Audacity has the capabiltiy to add a label track where time intervals that sink with the audio track can be imported. The imported label track is usefull for visualization and selecting audio segments.
Three label formats can be output.
3.14 11.67 spk_0
11.68 12.6001 spk_1
12.61 18.53 spk_0
18.54 23.03001 spk_1
23.04 24.67001 spk_0
24.68 27.48 spk_1
27.48 33.73 spk_0
33.73 47.62001 spk_1
49.39 53.09 spk_0
53.1 53.99 spk_1
55.02 55.93001 spk_0
56.07 58.39 spk_1
59.1 79.76001 spk_0
80.54 81.26001 spk_1
81.26001 91.66001 spk_0
92.54 102.63 spk_1
105.45 109.65 spk_0
3.14 11.67 spk_0>>> now a motion picture, so grand and so magnificent And so vast. It spans 7000 years. no
11.68 12.6001 spk_1>>>way.
12.61 18.53 spk_0>>>Yes way. But it starts with Bill. Bill? S. Preston who was joan of Arc and
18.54 23.03001 spk_1>>>Ted Noah's wife were in danger of flunking most heinously tomorrow.
23.04 24.67001 spk_0>>>A force from the future.
24.68 27.48 spk_1>>>Can we go anywhere we want at any time? You can do
27.48 33.73 spk_0>>>anything you want is putting history at their fingertips. Let's reach out and touch someone. Okay, they're traveling through
33.73 47.62001 spk_1>>>time. How's it going? Royal ugly dudes, put them in the iron meet Excellent, execute them booties. What's going on?
49.39 53.09 spk_0>>>And they're making a big impression. Historical
3.14 3.83 spk_0 >0> now
3.84 3.94 spk_0 >1> a
3.94 4.33 spk_0 >2> motion
4.33 4.71 spk_0 >3> picture
4.71 4.71001 spk_0 >4> ,
4.71001 4.94 spk_0 >5> so
4.94 5.47 spk_0 >6> grand
5.48 5.57 spk_0 >7> and
5.58 6.58 spk_0 >8> so
6.58 7.25 spk_0 >9> magnificent
7.81 8.03 spk_0 >10> And
8.03 8.25 spk_0 >11> so
8.25 8.8 spk_0 >12> vast
8.8 8.8001 spk_0 >13> .
9.01 9.18 spk_0 >14> It
9.18 9.7 spk_0 >15> spans
9.7 10.65 spk_0 >16> 7000
10.65 11.2 spk_0 >17> years
11.2 11.2001 spk_0 >18> .
Just copy over the process_trans.py file. It's Python3. No modules to install.
# First argument path to JSON file. Second is file base name for outputs.
$ python process_trans.py example_input/BillTedExcel.json BillTedOuput
The audio file that was used for the example was a bit extreme in that it had music and cut scene dialog all over the place. The result is that the speaker labeling isn't that good. Even in more controled environments if there are speakers that are talking over each other there will be label errors. The Word JSON object allows these error to be corrected more easily. In the future a more automated way of making corrrection to this JSON file will be added.
The script can consume this file as well (same syntax as above) and then be used to create the transcription and Audacity Label files.
Using Audacity and the Debbug labels is very convienent to fix some errors as you can edit the labels and they can be exported as well. In the future the script will be extended to allow the edited and exported labels to be read back into the script to fix the content of the JSON files and other label files.
Enjoy!