Skip to content
This repository has been archived by the owner on Oct 1, 2020. It is now read-only.

Streaming audio mic data to aws transcribe in node #33

Open
geofflittle opened this issue Jun 4, 2020 · 2 comments
Open

Streaming audio mic data to aws transcribe in node #33

geofflittle opened this issue Jun 4, 2020 · 2 comments

Comments

@geofflittle
Copy link

I'm attempting to write a node application that transcribes audio from a microphone via AWS' streaming transcription service, it's heavily based off of the amazon-transcribe-websocket-static example. What I have so far can be found in this repository (it's small).

Unfortunately the above doesn't work. I believe there's a bug in taking the data provided by the microphone stream and transforming it before passing it to the writable transcriber stream. This is because I have proven that the other two components of the app work

  1. I've written a piece of the app to pipe the mic to the speakers that proves that the mic stream works as expected.
  2. When sending requests over the WebSocket to the transcription service, it sends non-exceptional responses back, albeit empty, proving that the transcription service client works as expected.

As a side note, I'm not familiar with handling audio data and encoding (decoding?) it to PCM. I'm not even positive if what the mic-stream is giving me is PCM or not and if I need to decode from or encode to PCM before providing it to the transcription service. All of this is to say, I'm pretty sure the byte-handling is the issue.

Any help getting this sorted would be greatly appreciated.

Thanks,
Geoff

@bfla
Copy link

bfla commented Jun 5, 2020

I'm not sure what the issue is with your node.js code. But I've been adapting the code examples in this repository for my own project and learned a few things that might help...

  • Mozilla has a useful guide that explains audio data/encoding and it helped me wrap my head around it (since I had no prior experience with audio data).
  • After spending some time in the Web Audio API docs, I couldn't find any specific docs that indicate exactly what output the browser microphone produces but I believe each channel is just streaming raw audio data: a series of numbers that each represent a sound amplitude at a specific point in time
  • I think the pcmEncode function in this repo is taking the output from the browser microphone and converting the samples to 16-bit integers (instead of 32-bit floats). It's not easy to find in the AWS Transcribe docs, but it appears that AWS Transcribe Streaming is designed to accept 16-bit PCM audio chunks (PCM data can also be represented as 32-bit floats or integers).
  • The downsample function is taking the browser microphone output and reducing the number of audio samples taken per second (because speech recognition typically uses a sample rate of 16000 samples per second whereas a higher sample rate like 44100 is more conventional for general use).

Sorry I can't solve your problem but hopefully that context is helpful.

@evgenyfadeev
Copy link

evgenyfadeev commented Aug 28, 2020

you need to send binary encoded linear audio data in 16 bit little endian signed integer format - found this by trial and failure.

Unfortunately the streaming transcribe service documentation is missing information on whether the data should be signed or unsigned and whether it should be big or little endian integers.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants