This post-call batch analytics component is using Ingestion Client that helps transcribe your audio files without any development effort. The Ingestion Client monitors your dedicated Azure Storage container so that new audio files are transcribed automatically as soon as they land.
Think of this tool as an automated & scalable transcription solution for all audio files in your Azure Storage. This tool is a quick and effortless way to transcribe your audio files or just explore transcription.
We created an ingestion layer (a transcription client) that will help you set up a full blown, scalable and secure transcription pipeline. Using an ARM template deployment, all the resources necessary to seamlessly process your audio files are configured and turned on.
The Ingestion Client is optimized to use the capabilities of the Azure Speech infrastructure. It uses Azure resources to orchestrate transcription requests to the Azure Speech service using audio files as they appear in your dedicated storage containers. You can also set up additional processing beyond transcription, such as sentiment analysis and other text analytics.
The following diagram shows the structure of this tool as defined by the ARM template.
When a file lands in a storage container, the Grid event indicates the completed upload of a file. The file is filtered and pushed to a Service bus topic. Code in Azure Functions triggered by a timer picks up the event and creates a transmission request using the Azure Speech services batch pipeline. When the transmission request is complete, an event is placed in another queue in the same service bus resource. A different Azure Function triggered by the completion event starts monitoring transcription completion status. When transcription completes, the Azure Function copies the transcript into the same container where the audio file was obtained.
The rest of the features are applied on demand. By deploying additional resources through the ARM template, you can choose to apply analytics on the transcript, produce reports or redact.
This solution can transcribe audio files automatically and at scale.
Since source code is provided in this repo, you can customize the Ingestion Client.
This tool follows these best practices:
- Optimizes the number of audio files included in each transcription to achieve the shortest possible SAS TTL.
- Retry logic optimization handles smooth scaling and transient HTTP 429 errors.
- Runs Azure Functions economically, ensuring minimal execution costs.
- Distribute load across available regions using a round robin algorithm.
Follow these steps to set up and run the tool using ARM templates.
An Azure Account and an Azure Speech services key is needed to run the Ingestion Client.
NOTE: You need to create a Speech Resource with a paid (S0) key. The free key account will not work. Optionally for analytics you can create a Text Analytics resource too.
If the above link does not work try the following steps:
- Go to Azure portal
- Click on Create a Resource.
- Type Speech and select Speech.
- On the Speech resource, click Create.
- You will find the subscription key under Keys
- You will also need the region, so make a note of that too.
- You need to choose the operating mode (described in the next section).
To test, we recommend you use Microsoft Azure Storage Explorer.
Audio files can be processed either by the Speech to Text API v3.0 for batch processing, or our Speech SDK for real-time processing. This section lists differences to help you choose an operating mode.
In batch mode, audio files are processed in batches. The Azure Function creates a transcription request periodically with all the files that have been requested up to that point. If the number of files is large then many requests will be raised. Consider the following about batch mode:
- Low Azure Function costs. Two Azure Functions coordinate the process and run for milliseconds.
- Diarization and Sentiment. Offered in Batch Mode only.
- Higher Latency. Transcripts are scheduled and executed based on capacity of cluster. Real time mode takes priority.
- Multiple Audio Formats are supported.
- You will need to deploy the Batch ARM Template from the repository for this operating mode.
Refer to the instructions here if you want to try the Real Time Mode - it incurs higher Azure Function costs and supports only .wav PCM audio format.
Batch mode will process transcription requests following best effort policies using the compute you request when the transcription is scheduled. Available compute is directly allocated.
In Real time mode, each Azure Speech resource is allocated with a default of 100 concurrent connections, which indicates the maximum number of parallel audio transcription streams. Customers can request a higher limit. To avoid throttling, the cadence of new audio file uploads to Azure storage should be controlled, because each upload triggers a real-time transcription. Throttling occurs when the concurrency limit is reached.
The batch and real time ARM templates are nearly the same. The main differences are the lack of diarization and sentiment options in Real Time mode, as well as downstream post processing through SQL. With that in mind, follow the instructions below to deploy the resources from ARM template.
-
In the Azure portal, click Create a Resource. In the search box, type template deployment, and select the Template deployment resource.
-
On the screen that appears, click theCreate button.
- You will be creating Azure resources from the ARM template we provide. Click on click the Build your own template in the editor link.
- Load the template by clicking Load file. Alternatively, you could copy/paste the template in the editor.
- Once the template text is loaded you will be able to read and edit the transcript. Do NOT attempt any edits at this stage. You need to save the template you loaded, so click the Save button.
Saving the template will result in the screen below. You will need to fill in the form provided. It is important that all the information is correct. Let us look at the form and go through each field.
FILL OUT ALL REQUIRED FIELD HIGHLIGHTED IN THE RED BOX - SQL Admin username cannot be 'admin', use sqladmin or similar
NOTE: Use short descriptive names in the form for your resource group. Long resource group names can result in deployment error.
-
Pick the Azure Subscription Id where you will create the resources.
-
Either pick or create a resource group. (It would be better to have all the Ingestion Client resources within the same resource group so we suggest you create a new resource group.)
-
Pick a region. This can be the same region as your Azure Speech key.
The following settings all relate to the resources and their attributes:
- Give your storage account a name. You will be using a new storage account rather than an existing one.
The following 2 steps are optional. If you omit them, the tool will use the base model to obtain transcripts. If you have created a Speech model, then enter a custom model.
Transcripts are obtained by polling the service. We acknowledge that there is a cost related to that. So, the following setting gives you the option to limit that cost by telling your Azure Function how often you want it to fire.
-
Enter the polling frequency. There are many scenarios where this would be required to be done couple of times a day.
-
Enter locale of the audio. You need to tell us what language model we need to use to transcribe your audio.
-
Enter your Azure Speech subscription key and Locale information.
The rest of the settings relate to the transcription request. You can read more about those in How to use batch transcription.
-
Select a profanity option.
-
Select a punctuation option.
-
Select to Add Diarization [all locales] [Batch Template Only].
-
Select to Add Word level Timestamps [all locales] [Batch Template Only].
If you want to perform Text Analytics, add those credentials.
-
Add Text analytics key
-
Add Text analytics region
-
Add Sentiment [Batch Template Only]
-
Add Personally Identifiable Information (PII) Redaction [Batch Template Only]
NOTE: The ARM template also allows you to customize the PII categories through the PiiCategories variable (e.g., to only redact person names and organizations set the value to "Person,Organization"). A full list of all supported categories can be found in the PII Entity Categories.
If you want to further analytics we could map the transcript json we produce to a DB schema. [Batch Template Only]
-
Enter SQL DB credential login
-
Enter SQL DB credential password
You can feed that data to your custom PowerBI script or take the scripts included in this repository. Follow the PowerBI guide to set it up.
Press Create to create the resources. It typically takes 1-2 mins. The resources are listed below.
If a Consumption Plan (Y1) was selected for the Azure Functions, make sure that the functions are synced with the other resources (see Trigger syncing for further details).
To do so, click on your StartTranscription function in the portal and wait until your function shows up:
Do the same for the FetchTranscription function:
Important: Until you restart both Azure functions you may see errors.
Upload audio files to the newly created audio-input container (results are added to json-result-output and test-results-output containers). Once they are done you can test your account.
Use Microsoft Azure Storage Explorer to test uploading files to your new account. The process of transcription is asynchronous. Transcription usually takes half the time of the audio track to complete.
The structure of your newly created storage account will look like the picture below.
There are several containers to distinguish between the various outputs. We suggest (for the sake of keeping things tidy) to follow the pattern and use the audio-input container as the only container for uploading your audio.
By default, the ARM template uses the newest version of the Ingestion Client which can be found in this repository. To use a custom version, edit the paths to the binaries inside the deployment template to point to a custom published version. You can find our published binaries here.
To publish a new version, you can use Visual Studio, right-click on the project, click Publish and follow the instructions.
Although you do not need to download or change the code, you can still download it from GitHub:
git clone https://github.com/Azure-Samples/cognitive-services-speech-sdk
cd cognitive-services-speech-sdk/samples/ingestion/ingestion-client
The created resources their pricing and corresponding plans (where applicable) are:
- Storage Pricing, Simple Storage
- Service Bus Pricing, Standard
- Azure Functions Pricing, Premium / Consumption
- Key Vault Pricing
Optionally:
The following example is indicative of the cost distributions to inform and set the cost expectations.
Assume a scenario where we are trying to transcribe 1000 mp3 files of an average length of 10mins and size of 10MB. Each of them individually landing on the storage container over the course of a business day.
- Speech Transcription Costs are: 10k mins = $166.60
- Service Bus Costs are: 1k events landing in 'CreateTranscriptionQueue' and another 1k in 'FetchTranscriptionQueue' = $0.324/daily (standing charge) for up to 13m messages/month
- Storage Costs are: Write operations are $0.0175 (per 10,000), and Read operations $0.0014 (again per 10k read operations) = ($0.0175 + $0.0014)/10 (for 1000 files) = $0.00189
- Azure Functions For Consumption, the costs are: The first 400,000 GB/s of execution and 1,000,000 executions are free = $0.00. For Premium functions, the base cost is: 2 instances in EP1 x 1 hour: $0.43
- Key Vault Costs are: 0.03/10,000 transactions (For the above scenario 1 transactions would be required per file) = $0.003
The total for the above scenario would be $166.60, with the majority of the cost being on transcription. If Premium functions are hosted, there is an additional cost of $310.54 per month.
We hope the above scenario gives you an idea of the cost distribution. Of course will vary depending on scenario and usage pattern. Also use our Azure Calculator to better understand pricing.