Streaming audio mic data to aws transcribe in node #33

geofflittle · 2020-06-04T03:36:34Z

I'm attempting to write a node application that transcribes audio from a microphone via AWS' streaming transcription service, it's heavily based off of the amazon-transcribe-websocket-static example. What I have so far can be found in this repository (it's small).

Unfortunately the above doesn't work. I believe there's a bug in taking the data provided by the microphone stream and transforming it before passing it to the writable transcriber stream. This is because I have proven that the other two components of the app work

I've written a piece of the app to pipe the mic to the speakers that proves that the mic stream works as expected.
When sending requests over the WebSocket to the transcription service, it sends non-exceptional responses back, albeit empty, proving that the transcription service client works as expected.

As a side note, I'm not familiar with handling audio data and encoding (decoding?) it to PCM. I'm not even positive if what the mic-stream is giving me is PCM or not and if I need to decode from or encode to PCM before providing it to the transcription service. All of this is to say, I'm pretty sure the byte-handling is the issue.

Any help getting this sorted would be greatly appreciated.

Thanks,
Geoff

bfla · 2020-06-05T13:22:58Z

I'm not sure what the issue is with your node.js code. But I've been adapting the code examples in this repository for my own project and learned a few things that might help...

Mozilla has a useful guide that explains audio data/encoding and it helped me wrap my head around it (since I had no prior experience with audio data).
After spending some time in the Web Audio API docs, I couldn't find any specific docs that indicate exactly what output the browser microphone produces but I believe each channel is just streaming raw audio data: a series of numbers that each represent a sound amplitude at a specific point in time
I think the pcmEncode function in this repo is taking the output from the browser microphone and converting the samples to 16-bit integers (instead of 32-bit floats). It's not easy to find in the AWS Transcribe docs, but it appears that AWS Transcribe Streaming is designed to accept 16-bit PCM audio chunks (PCM data can also be represented as 32-bit floats or integers).
The downsample function is taking the browser microphone output and reducing the number of audio samples taken per second (because speech recognition typically uses a sample rate of 16000 samples per second whereas a higher sample rate like 44100 is more conventional for general use).

Sorry I can't solve your problem but hopefully that context is helpful.

evgenyfadeev · 2020-08-28T16:29:19Z

you need to send binary encoded linear audio data in 16 bit little endian signed integer format - found this by trial and failure.

Unfortunately the streaming transcribe service documentation is missing information on whether the data should be signed or unsigned and whether it should be big or little endian integers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming audio mic data to aws transcribe in node #33

Streaming audio mic data to aws transcribe in node #33

geofflittle commented Jun 4, 2020

bfla commented Jun 5, 2020

evgenyfadeev commented Aug 28, 2020 •

edited

Loading

Streaming audio mic data to aws transcribe in node #33

Streaming audio mic data to aws transcribe in node #33

Comments

geofflittle commented Jun 4, 2020

bfla commented Jun 5, 2020

evgenyfadeev commented Aug 28, 2020 • edited Loading

evgenyfadeev commented Aug 28, 2020 •

edited

Loading