-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathindex.html
590 lines (569 loc) · 29.2 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
<!DOCTYPE html>
<html lang="en" data-theme="light">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/bulma@1.0.2/css/bulma.min.css" />
<link rel="stylesheet" href="assets/css/styles.css" />
<title>DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data</title>
</head>
<body>
<section class="hero">
<div class="container">
<div class="hero-body">
<p class="title is-1">
DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data
</p>
<div class="content is-medium">
<p class="subtitle">
Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Jagadeesh Balam, Boris Ginsburg, Yu-Chiang Frank
Wang, Hung-yi Lee
</p>
<p class="subtitle">National Taiwan University, NVIDIA</p>
<a href="https://arxiv.org/pdf/2409.20007">📑 Paper</a> |
<a href="https://kehanlu.github.io/DeSTA2/">🌐 Website</a> |
<a href="https://github.com/kehanlu/DeSTA2">👩💻 Github</a> |
<a href="https://huggingface.co/DeSTA-ntu/DeSTA2-8B-beta">🤗 Model</a> |
<a href="https://huggingface.co/datasets/DeSTA-ntu/DeSTA2-Llama3-8B-Instruct">🤗 Dataset</a>
<!-- <p>Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language
models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo
extensive speech instruction-tuning to bridge the gap between speech and text modalities. This
requires significant annotation efforts and risks catastrophic forgetting of the original language
capabilities. In this work, we present a simple yet effective automatic process for creating
speech-text pair data that carefully injects speech paralinguistic understanding abilities into SLMs
while preserving the inherent language capabilities of the text-based LLM. Our model demonstrates
general capabilities for speech-related tasks without the need for speech instruction tuning data,
achieving impressive performance on Dynamic-SUPERB and AIR-Bench-Chat benchmarks. Furthermore, our
model exhibits the ability to follow complex instructions derived from LLMs, such as specific output
formatting and chain-of-thought reasoning. Our approach not only enhances the versatility and
effectiveness of SLMs but also reduces reliance on extensive annotated datasets, paving the way for
more efficient and capable speech understanding systems.
</p> -->
</div>
<img src="assets/images/figure1.png" style="border-radius: 50px" alt="" />
</div>
</div>
</section>
<section class="section">
<div class="container">
<div class="box">
<div class="content is-medium">
<h2 class="title is-4">✨✨ Key features</h2>
<ul>
<li>
We propose an automatic approach to develop speech language models <em><b>without</b></em> using speech
instruction-tuning data. We aim to inject speech understanding capability into text-based LLMs while
keeping their original versatility.
</li>
<li>
We train the end-to-end speech language model with carefully constructed
<a href="#dataset-construction">speech captions</a> to avoid the catastrophic forgetting problem.
</li>
<li>
The model shows strong <a href="#instruction-following">instruction-following ability</a> and
demonstrates robust speech understanding capability on the
<a href="#dynamic-superb">Dynamic-SUPERB</a> and
<a href="#AIR-Bench-Chat">AIR-Bench-Chat</a> benchmarks.
</li>
</ul>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container">
<div class="box">
<div class="content">
<h2 class="title is-2" id="instruction-following">Instruction-following Ability</h2>
<p>
Our model not only demonstrates strong speech understanding capabilities but also excels in following
complex instructions, such as following specific output formatting and chain-of-thought reasoning.
</p>
</div>
<audio controls>
<source src="assets/audios/7_1_d7.wav" type="audio/ogg" />
</audio>
<table class="table is-fullwidth">
<thead>
<tr>
<th>Instruction</th>
<th>Model Output</th>
</tr>
</thead>
<tbody>
<tr>
<td>
What is the gender of the speaker?
<b>Your entire response should be in English, capital letters only.</b>
</td>
<td>THE GENDER OF THE SPEAKER IS FEMALE.</td>
</tr>
<tr>
<td>
What is the emotion of the speaker?
<b>Your entire response should be in English, capital letters only.</b>
</td>
<td>HAPPINESS</td>
</tr>
<tr>
<td>What is the gender of the speaker? <b>Answer in JSON format.</b></td>
<td>{"gender": "Female" }</td>
</tr>
<tr>
<td>What is the emotion conveyed in the speech? <b>Think step by step.</b></td>
<td>
Let's break it down step by step!<br />1. We start with the phrase "Thank you for calling". This
implies that a request has been made, and the speaker is expressing gratitude.<br />2. The next phrase
is "Goodbye". This is a normal way to end a conversation, and it's a polite way to part ways.<br />Considering
these two phrases together, we can infer that the speaker is happy to end the conversation and is not
expecting anything further. The tone is friendly and grateful, suggesting that the speaker has had a
positive interaction.<br />Therefore, the emotion conveyed in the speech is happiness!
</td>
</tr>
<tr>
<td><b>Analyze the speech,</b> then tell me what is the emotion conveyed in the speech?</td>
<td>
Let's break down the speech:<br />* "Thank you for calling": This phrase is a polite expression of
gratitude for the person's call.<br />* "Goodbye": This is a final farewell, indicating that the
conversation is ending.<br />The tone of the speech is polite and friendly, as the speaker is thanking
the person for calling and bidding them farewell.<br />The emotion conveyed in this speech is
happiness. The speaker seems cheerful and appreciative, expressing gratitude for the call and ending
the conversation on a positive note.
</td>
</tr>
<tr>
<td>Calculate the words per second rate, <b>think step by step.</b></td>
<td>
Let's break it down step by step:<br />1. The speech lasts for 2 seconds: [00:00:00 - 00:00:02]<br />2.
The number of words is: "Thank you for calling. Goodbye." (5 words)<br />3. To find the words per
second rate, we need to divide the number of words by the time it took to say the words:<br />Words
per second = Total number of words / Time taken (in seconds)<br />= 5 words / 2 seconds<br />= 2.5
words per second<br />Therefore, the words per second rate is 2.5 words
</td>
</tr>
<tr>
<td>
How many speakers are there in the audio? Can you tell me the emotion of each speaker? Can you also
provide a brief summary of the audio? Is there any background noise? How do you feel today? Can you
give me a response to the audio?
</td>
<td>
There is only 1 speaker in the audio.<br />
The speaker is a female, and her emotion is happiness.<br />
The audio is a brief goodbye message, where the speaker says "Thank you for calling. Goodbye."<br />
There is no background noise in the audio.<br />
As a machine, I don't have emotions, so I don't feel anything.<br />
As for a response to the audio, I can say: "You're welcome, and have a great day!"
</td>
</tr>
<tr>
<td>
請描述這段聲音,聲音聽起來如何?內容說了什麼? 以中文回答。<br /><br /><em
>Note: our model only trained on English data, but it is generalizable to other languages thanks to
the multilingual capability of Llama3! 😊
</em>
</td>
<td>
這段聲音是一個女性的聲音,聽起來是很愉快的。她說了一句「Thank you for calling.
Goodbye」,語氣溫柔、柔和,沒有任何激動或激情的感覺。整個聲音聽起來很平靜、很自然,沒有任何明顯的感覺或情感色彩。
</td>
</tr>
<tr>
<td>
<b>Multi-turn question-answering</b><br />
<em style="background-color: #f0f0f0">First question</em> <br />
Transcribe the audio. <br /><br />
<em style="background-color: #f0f0f0">Second question</em> <br />
Who is the speaker talking to?
</td>
<td>
<em style="background-color: #f0f0f0">First response</em><br />
Here is the transcription of the audio clip: <br />
"Thank you for calling. Goodbye." <br />
<br />
<em style="background-color: #f0f0f0">Second response</em><br />
The speaker is speaking to someone who has called them, but it's not specified who that person is. The
conversation appears to be a brief and abrupt goodbye, suggesting that the speaker may not have had a
conversation with the caller or may have been interrupted.
</td>
</tr>
</tbody>
</table>
</div>
</div>
</section>
<section class="section">
<div class="container">
<div class="box">
<h2 class="title is-2">Speech Understanding Ability</h2>
<div class="content is-medium">
<h2 class="title is-4" id="dynamic-superb">Dynamic-SUPERB</h2>
<p>
<a href="https://ieeexplore.ieee.org/document/10448257">Dynamic-SUPERB</a> is a crowdsourced benchmark
comprising 48 speech-related classification tasks with human-created instructions and answers.
</p>
</div>
<div class="table-container">
<table class="table is-fullwidth">
<thead>
<tr>
<th>Audio</th>
<th>Instruction</th>
<th>Model Output</th>
<th>Ground Truth</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<audio controls>
<source src="assets/audios/LanguageIdentification_VoxForge_0000_de143-43.flac" type="audio/ogg" />
</audio>
<p>LanguageIdentification</p>
</td>
<td>
Which language can you recognize from this audio recording? The answer could be German, English,
Spanish, Italian, Russian, or French.
</td>
<td>
Based on the audio recording, I recognize the language as German. The phrase "Die hätte man dicht
machen müssen" is a grammatically correct and idiomatic German sentence, and the speaker's accent
and pronunciation also suggest a German speaker.
</td>
<td>German</td>
</tr>
<tr>
<td>
<audio controls>
<source
src="assets/audios/NoiseDetection_LJSpeech_MUSAN-Music_0199_music_LJSpeech-1.1_16k_LJ050-0033.wav"
type="audio/ogg"
/>
</audio>
<p>NoiseDetection</p>
</td>
<td>
Determine whether the sentence is a noise-free utterance or a noisy one with music disturbance. The
answer could be clean or noisy.
</td>
<td>
Based on the provided information, I would classify the sentence as "noisy". The noise level is
moderate (15db), and there is a reverberation time of 60ms, which suggests that there is some
background noise or music disturbance present.
</td>
<td>Noisy</td>
</tr>
<tr>
<td>
<audio controls>
<source
src="assets/audios/AccentClassification_AccentdbExtended_0193_british_s01_176.wav"
type="audio/ogg"
/>
</audio>
<p>AccentClassification</p>
</td>
<td>
The task involves detecting variations in pronunciation and intonation. The answer could be
american, australian, bangla, british, indian, malayalam, odiya, telugu, or welsh.
</td>
<td>
Based on the accent and pronunciation, I would identify the speaker as having a British accent.
</td>
<td>British</td>
</tr>
<tr>
<td>
<audio controls>
<source
src="assets/audios/EmotionRecognition_MultimodalEmotionlinesDataset_0026_dia382_utt0.wav"
type="audio/ogg"
/>
</audio>
<p>EmotionRecognition</p>
</td>
<td>
Recognize and categorize the emotions expressed in spoken language. The answer could be anger,
disgust, sadness, joy, neutral, surprise, or fear.
</td>
<td>
Based on the audio clip, the emotion expressed is "joy". The speaker's tone and words ("Here she is,
future fashion superstar") convey a sense of excitement and happiness, indicating a positive
emotional state.
</td>
<td>Joy</td>
</tr>
<tr>
<td>
<audio controls>
<source
src="assets/audios/DialogueEmotionClassification_DailyTalk_0196_7_1_d756.wav"
type="audio/ogg"
/>
</audio>
<p>DialogueEmotionClassification</p>
</td>
<td>
Analyze the dialogue and identify its corresponding emotion. The answer could be anger, disgust,
fear, sadness, happiness, surprise, or no emotion.
</td>
<td>
Based on the dialogue, the corresponding emotion is sadness. The speaker is apologizing and
expressing regret for causing trouble, which is a characteristic of a sad tone.
</td>
<td>Sadness</td>
</tr>
<tr>
<td>
<audio controls>
<source src="assets/audios/SpeechTextMatching_LJSpeech_0001_LJ001-0107.wav" />
</audio>
<p>SpeechTextMatching</p>
</td>
<td>
Verify if the speech and text are congruent. The answer could be yes or no.<br />Text: "guidelines
for an experimental program to develop more detailed criteria"
</td>
<td>
The answer is NO.<br />The speech is about saying a few words on the principles of design and
typography, but the text is about guidelines for an experimental program to develop more detailed
criteria, which is a different topic.
</td>
<td>No</td>
</tr>
<tr>
<td>
<audio controls>
<source
src="assets/audios/SpeakerVerification_LibriSpeech-TestClean_0046_3575-170457-0038.flac"
/>
</audio>
<p>SpeakerVerification</p>
</td>
<td>
Scrutinize the first audio clip to establish whether the second one is delivered by the same
speaker. The answer could be yes or no.
</td>
<td>
Based on the provided audio clip, I would say NO, the second audio clip is not delivered by the same
speaker. The tone, pitch, and emotion expressed in the first audio clip (neutral, calm, and
introspective) are different from what I would expect from the same speaker. The first audio clip
has a more contemplative and reflective tone, whereas the second audio clip is more dramatic and
emotive.
</td>
<td>No</td>
</tr>
</tbody>
</table>
</div>
<div class="content is-medium">
<h2 class="title is-4" id="AIR-Bench-Chat">AIR-Bench-Chat</h2>
<p>
<a href="https://arxiv.org/abs/2402.07729">AIR-Bench-Chat</a> is an LLM-generated benchmark derived from
speech metadata that focuses on open-ended questions.
</p>
</div>
<div class="table-container">
<table class="table is-fullwidth">
<thead>
<tr>
<th>Audio</th>
<th>Instruction</th>
<th>Model Output</th>
<th>Ground Truth</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<audio controls>
<source src="assets/audios/Ses01M_script01_1_F014.wav" />
</audio>
</td>
<td>
Based on the speaker's emotional state, what could be the nature of the proof being requested?
</td>
<td>
Based on the speaker's angry tone and high-pitched voice, it is likely that the proof being
requested is related to a dispute or controversy. The speaker is demanding evidence to support a
claim or argument, and is likely feeling frustrated or upset that the other party has not provided
sufficient evidence to back up their claims.
</td>
<td>
Given the speaker's anger, the proof likely relates to a serious allegation or an issue that
significantly impacts the speaker..
</td>
</tr>
<tr>
<td>
<audio controls>
<source src="assets/audios/Ses04F_impro02_M004.wav" />
</audio>
</td>
<td>
What emotion is the speaker likely trying to convey to the listener by stating a fear of their
departure?
</td>
<td>
Based on the given information, the speaker is likely trying to convey a sense of sadness to the
listener by stating a fear of their departure. The low pitch, slow speaking speed, and very low
volume suggest a somber and melancholic tone, which is consistent with a sad emotion.
</td>
<td>The speaker is trying to convey a sense of sadness and attachment to the listener.</td>
</tr>
<tr>
<td>
<audio controls>
<source src="assets/audios/Ses01F_script03_1_F029.wav" />
</audio>
</td>
<td>What could be the next step for the manager after witnessing such a scene?</td>
<td>
Based on the manager's tone and emotion, it seems that she is amused and entertained by the scene.
Given her high pitch and volume, she might be trying to emphasize the absurdity of the situation.
Considering her slow speaking speed, she might be savoring the moment and enjoying the spectacle.<br /><br />As
a next step, the manager could:<br />* Laugh and join in on the fun, possibly even taking a photo or
video to capture the moment.<br />* Try to intervene and calm the situation down, but in a playful
and lighthearted manner, acknowledging the absurdity of the situation.<br />* Use the opportunity to
make a humorous
</td>
<td>The manager might need to intervene or address the situation occurring on the floor.</td>
</tr>
<tr>
<td>
<audio controls>
<source src="assets/audios/MUL0608_120.98_148.92.wav" />
</audio>
</td>
<td>What did the second person inquire about after the first person's request?</td>
<td>The second person inquired about the price range of the restaurant.</td>
<td>The second person inquired about any specifications on the price range.</td>
</tr>
</tbody>
</table>
</div>
</div>
</div>
</section>
<section class="section">
<div class="container">
<div class="box">
<h1 class="title is-2">Method</h1>
<div class="content">
<h2 class="title is-3" id="dataset-construction">Stage 1: Dataset Consturction</h2>
<p>
In our dataset construction process, we first extract and format all textual attributes from the speech
into a structured "seed transcript." This seed transcript is then combined with the prompt "What can you
hear from the audio?" and fed into Llama3. The response generated by Llama3 becomes our training target
for the end-to-end speech language model.
</p>
<p>
There are 12 attributes, including gender, age, accent, emotion, pitch, volume, speaking speed, SNR level,
C50 value, duration, intent, and spoken text.
</p>
<img src="assets/images/dataset_construction.png" alt="" />
<div class="table-container">
<table class="table is-fullwidth">
<thead>
<tr>
<th style="color: orange">Seed Transcript</th>
<th>Llama3-Instruct Response</th>
<th>Correspond Audio</th>
</tr>
</thead>
<tbody>
<tr>
<td>
[00:00:01 - 00:00:02]: "These days are long gone." (Gender:Male, Accent: Scottish, Age: 20,
Emotion: neutral, Duration: 2s)
</td>
<td>
From the audio, I can hear:<br />* A male voice with a Scottish accent<br />* The voice is
approximately 20 years old<br />* The tone is neutral, indicating a lack of strong emotions<br />*
The audio duration is 2 seconds
</td>
<td>
<audio controls>
<source src="assets/audios/p284_159.wav" />
</audio>
<p>VCTK-Corpus</p>
</td>
</tr>
<tr>
<td>
[00:00:01 - 00:00:04]: "At the top level, the salaries are far too high." (Gender:Male, Accent:
English, Age: 23, Emotion: neutral, Duration: 4s)
</td>
<td>
From the audio, we can hear a male speaker with an English accent (possibly British English)
saying:<br />"At the top level, the salaries are far too high."<br />The speaker is a 23-year-old
male with a neutral emotion. The audio lasts for 4 seconds.
</td>
<td>
<audio controls>
<source src="assets/audios/p287_162.wav" />
</audio>
<p>VCTK-Corpus</p>
</td>
</tr>
<tr>
<td>
[00:00:01 - 00:00:04]: "Oh, lovely. Where's the wedding going to be?" (Gender:Male, Emotion:
happiness, Action: question, Duration: 4s)
</td>
<td>
From the audio, I can hear a male speaker asking a question with a tone of happiness, specifically
the words: "Oh, lovely. Where's the wedding going to be?"
</td>
<td>
<audio controls>
<source src="assets/audios/4_0_d47.wav" />
</audio>
<p>Dailytalk</p>
</td>
</tr>
<tr>
<td>
[00:00:01 - 00:00:07]: "A new pride taught me mine ego, and that teach I unto men: no longer to
thrust one's head into the sand of celestial things, but to carry it freely, a terrestrial head,
which giveth meaning to the earth!" (Gender:Male, Emotion: cheerful, Pitch: Normal, Volume:
Normal, Speaking speed: Fast, Duration: 7s)
</td>
<td>
From the given audio, we can hear a male speaker with a cheerful tone, speaking at a normal pitch
and volume. The speaking speed is relatively fast, and the duration is 7 seconds.<br />The audio
appears to be a philosophical or inspirational quote, with the speaker using poetic language to
convey a message. The text itself is a condensed and somewhat abstract statement, but it seems to
be encouraging people to be more grounded and down-to-earth, rather than getting lost in lofty or
abstract concepts.
</td>
<td>
<audio controls>
<source src="assets/audios/0_000307.wav" />
</audio>
<p>PromptTTS</p>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="container">
<div class="content">
<h2 class="title is-3">Stage 2: Model training</h2>
<p>
During the model training phase, we utilize a pre-trained Whisper model to extract speech features and
generate text transcriptions from the audio input. These speech features and the text embeddings (from
the transcription and chat template) are concatenated before being fed into Llama3. We maintain a
consistent template structure to prevent any mismatch. Finally, the model is trained to generate speech
descriptions, effectively learning to interpret and articulate the nuances of spoken content based on
both acoustic features and textual information.
</p>
<img src="assets/images/model_training.png" alt="" />
</div>
</div>
</div>
</div>
</section>
</body>
</html>