Skip to content

Conversation

@rogerbarreto
Copy link
Contributor

@rogerbarreto rogerbarreto commented Feb 3, 2025

ADR - Introducing Speech To Text Abstraction

Problem Statement

The project requires the ability to transcribe and translate speech audios to text. The project is a proof of concept to validate the ISpeechToTextClient abstraction against different transcription and translation APIs providing a consistent interface for the project to use.

Note

The names used for the proposed abstractions below are open and can be changed at any time given a bigger consensus.

Considered Options

Option 1: Generic Multi Modality Abstraction IModelClient<TInput, TOutput> (Discarded)

This option would have provided a generic abstraction for all models, including audio transcription. However, this would have made the abstraction too generic and brought up some questioning during the meeting:

Usability Concerns:

The generic interface could make the API less intuitive and harder to use, as users would not be guided towards the specific options they need. 1

  • Naming and Clarity:

    Generic names like "complete streaming" do not convey the specific functionality, making it difficult for users to understand what the method does. Specific names like "transcribe" or "generate song" would be clearer. 2

  • Implementation Complexity:

    Implementing a generic interface would still require concrete implementations for each permutation of input and output types, which could be complex and cumbersome. 3

  • Specific Use Cases:

    Different services have specific requirements and optimizations for their modalities, which may not be effectively captured by a generic interface. 4

  • Future Proofing vs. Practicality:

    While a generic interface aims to be future-proof, it may not be practical for current needs and could lead to an explosion of permutations that are not all relevant. 5

  • Separation of Streaming and Non-Streaming:

    There was a concern about separating streaming and non-streaming interfaces, as it could complicate the API further. 6

Option 2: Speech to Text Abstraction ISpeechToTextClient (Preferred)

This option would provide a specific abstraction for audio transcription and audio translations, which would be more intuitive and easier to use. The specific interface would allow for better optimization and customization for each service.

Initially was thought on having different interfaces one for streaming and another for non-streaming api, but after some discussion, it was decided to have a single interface similar to what we have in IChatClient.

Note

Further modality abstractions will mostly follow this as a standard moving forward.

public interface ISpeechToTextClient : IDisposable
{
    Task<SpeechToTextResponse> GetTextAsync(
        Stream audioSpeechStream, 
        SpeechToTextOptions? options = null, 
        CancellationToken cancellationToken = default);

    IAsyncEnumerable<SpeechToTextResponseUpdate> GetStreamingTextAsync(
        Stream audioSpeechStream,
        SpeechToTextOptions? options = null,
        CancellationToken cancellationToken = default);
}

Inputs:

  • Stream audioSpeechStream, allows for streaming audio data contents to the service.

    This API enables usage of large audio files or real-time transcription (without having to load the full file in-memory) and can easily be extended to support different audio input types like a single DataContent or a Stream instance.

    Supporting scenarios like:

    • Single in-memory data of audio. Non up-streaming audio
    • One audio streamed in multiple audio content chunks - Real-time Transcription
    • Single or multiple audio uri (referenced) audioContents - Batch Transcription

    DataContent type input extension

    // Non-Streaming API
    public static Task<SpeechToTextResponse> GetTextAsync(
        this ISpeechToTextClient client,
        DataContent audioSpeechContent, 
        SpeechToTextOptions? options = null, 
        CancellationToken cancellationToken = default);
    
    // Streaming API
    public static IAsyncEnumerable<SpeechToTextResponseUpdate> GetStreamingTextAsync(
        this ISpeechToTextClient client,
        DataContent audioSpeechContent, 
        SpeechToTextOptions? options = null, 
        CancellationToken cancellationToken = default);
  • SpeechToTextOptions, analogous to existing ChatOptions it allows providing additional options on both Streaming and Non-Streaming APIs for the service, such as language, model, or other parameters.

    public class SpeechToTextOptions
    {
        /// <summary>Gets or sets the model ID for the speech to text.</summary>
        public string? ModelId { get; set; }
    
        /// <summary>Gets or sets the language of source speech.</summary>
        public string? SpeechLanguage { get; set; }
    
        /// <summary>Gets or sets the language for the target generated text.</summary>
        public string? TextLanguage { get; set; }
    
        /// <summary>Gets or sets the sample rate of the speech input audio.</summary>
        public int? SpeechSampleRate { get; set; }
    
        /// <summary>Gets or sets any additional properties associated with the options.</summary>
        public AdditionalPropertiesDictionary? AdditionalProperties { get; set; }
    
        /// <summary>Produces a clone of the current <see cref="SpeechToTextOptions"/> instance.</summary>
        /// <returns>A clone of the current <see cref="SpeechToTextOptions"/> instance.</returns>
        public virtual SpeechToTextOptions Clone();
    }
    • ModelId is a unique identifier for the model to use for transcription.

    • SpeechLanguage is the language of the audio content.

    • SpeechSampleRate is the sample rate of the audio content. Real-time speech to text generation requires a specific sample rate.

Outputs:

  • SpeechToTextResponse, For non-streaming API analogous to existing ChatResponse it provides the text generated result and additional information about the speech response.

    public class SpeechToTextResponse
    {
        [JsonConstructor]
        public SpeechToTextResponse();
    
        public SpeechToTextResponse(IList<AIContent> contents);
    
        public SpeechToTextResponse(string? content);
    
        /// <summary>Gets or sets the ID of the generated text response.</summary>
        public string? ResponseId { get; set; }
    
        /// <summary>Gets or sets the model ID using in the creation of the speech to text.</summary>
        public string? ModelId { get; set; }
    
        /// <summary>Gets or sets the start time of the text segment associated with this response in relation to the full audio speech length.</summary>
        public TimeSpan? StartTime { get; set; }
    
        /// <summary>Gets or sets the end time of the text segment associated with this response in relation to the full audio speech length.</summary>
        public TimeSpan? EndTime { get; set; }
    
        /// <summary>Gets or sets the raw representation of the speech to text completion from an underlying implementation.</summary>
        [JsonIgnore]
        public object? RawRepresentation { get; set; }
    
        /// <summary>Gets or sets any additional properties associated with the speech to text completion.</summary>
        public AdditionalPropertiesDictionary? AdditionalProperties { get; set; }
    
        /// <summary>Gets or sets the generated content items.</summary>
        [AllowNull]
        public IList<AIContent> Contents { get; set; }
    
        /// <summary>Gets the text of this speech to text response.</summary>
        [JsonIgnore]
        public string Text => Contents?.ConcatText() ?? string.Empty;
    }
    • ResponseId is a unique identifier for the response.

    • ModelId is a unique identifier for the model used for transcription.

    • StartTime and EndTime represents both Timestamps from where the text started and ended relative to the speech audio length.

      i.e: Audio starts with instrumental music for the first 30 seconds before any speech, the transcription should start from 30 seconds forward, same for the end time.

Note

TimeSpan is used to represent the time stamps as it is more intuitive and easier to work with, some services give the time in milliseconds, ticks or other formats.

  • SpeechToTextResponseUpdate, For streaming API, analogous to existing ChatResponseUpdate it provides the speech to text result as multiple chunks of updates, that represents the content generated as well as any important information about the processing.

    public class SpeechToTextResponseUpdate
    {
        [JsonConstructor]
        public SpeechToTextResponseUpdate();
    
        public SpeechToTextResponseUpdate(IList<AIContent> contents);
    
        public SpeechToTextResponseUpdate(string? content);
    
        /// <summary>Gets or sets the kind of the generated text update.</summary>
        public SpeechToTextResponseUpdateKind Kind { get; set; } = SpeechToTextResponseUpdateKind.TextUpdating;
    
        /// <summary>Gets or sets the ID of the generated text response of which this update is a part.</summary>
        public string? ResponseId { get; set; }
    
        /// <summary>Gets or sets the start time of the text segment associated with this update in relation to the full audio speech length.</summary>
        public TimeSpan? StartTime { get; set; }
    
        /// <summary>Gets or sets the end time of the text segment associated with this update in relation to the full audio speech length.</summary>
        public TimeSpan? EndTime { get; set; }
    
        /// <summary>Gets or sets the model ID using in the creation of the speech to text of which this update is a part.</summary>
        public string? ModelId { get; set; }
    
        /// <summary>Gets or sets the raw representation of the generated text update from an underlying implementation.</summary>
        [JsonIgnore]
        public object? RawRepresentation { get; set; }
    
        /// <summary>Gets or sets additional properties for the update.</summary>
        public AdditionalPropertiesDictionary? AdditionalProperties { get; set; }
    
        /// <summary>Gets the text of this speech to text response.</summary>
        [JsonIgnore]
        public string Text => Contents?.ConcatText() ?? string.Empty;
    
        /// <summary>Gets or sets the generated content items.</summary>
        [AllowNull]
        public IList<AIContent> Contents { get; set; }
    }
    • ResponseId is a unique identifier for the speech to text response.

    • StartTime and EndTime for the given transcribed chunk represents the timestamp where it starts and ends relative to the audio length.

      i.e: Audio starts with instrumental music for the first 30 seconds before any speech, the transcription chunk will flush with the StartTime from 30 seconds forward until the last word of the chunk which will represent the end time.

Note

TimeSpan is used to represent the time stamps as it is more intuitive and easier to work with, some services give the time in milliseconds, ticks or other formats.

    • Contents is a list of AIContent objects that represent the transcription result. Most use cases will have one TextContent object that can be retrieved from the Text property similarly as a Text in ChatMessage.

    • Kind is a struct similarly to ChatRole

      [JsonConverter(typeof(Converter))]
      public readonly struct SpeechToTextResponseUpdateKind : IEquatable<SpeechToTextResponseUpdateKind>
      {
          /// <summary>Gets when the generated text session is opened.</summary>
          public static SpeechToTextResponseUpdateKind SessionOpen { get; } = new("sessionopen");
      
          /// <summary>Gets when a non-blocking error occurs during speech to text updates.</summary>
          public static SpeechToTextResponseUpdateKind Error { get; } = new("error");
      
          /// <summary>Gets when the text update is in progress, without waiting for silence.</summary>
          public static SpeechToTextResponseUpdateKind TextUpdating { get; } = new("textupdating");
      
          /// <summary>Gets when the text was generated after small period of silence.</summary>
          public static SpeechToTextResponseUpdateKind TextUpdated { get; } = new("textupdated");
      
          /// <summary>Gets when the generated text session is closed.</summary>
          public static SpeechToTextResponseUpdateKind SessionClose { get; } = new("sessionclose");
      
          // Similar implementation to ChatRole
      }

      General Update Kinds:

      • SessionOpen - When the transcription session is open.

      • TextUpdating - When the speech to text is in progress, without waiting for silence. (Preferably for UI updates)

        Different apis used different names for this, ie:

        • AssemblyAI: PartialTranscriptReceived
        • Whisper.net: SegmentData
        • Azure AI Speech: RecognizingSpeech
      • TextUpdated - When a speech to text block is complete after a small period of silence.

        Different API names for this, ie:

        • AssemblyAI: FinalTranscriptReceived
        • Whisper.net: N/A (Not supported by the internal API)
        • Azure AI Speech: RecognizedSpeech
      • SessionClose - When the transcription session is closed.

      • Error - When an error occurs during the speech to text process.

        Errors during the streaming can happen, and normally won't block the ongoing process, but can provide more detailed information about the error. For this reason instead of throwing an exception, the error can be provided as part of the ongoing streaming using a dedicated content ErrorContent.

        public class ErrorContent : AIContent
        {
            /// <summary>The error message.</summary>
            private string _message;
        
            /// <summary>Initializes a new instance of the <see cref="ErrorContent"/> class with the specified message.</summary>
            /// <param name="message">The message to store in this content.</param>
            [JsonConstructor]
            public ErrorContent(string message)
            {
                _message = Throw.IfNull(message);
            }
        
            /// <summary>Gets or sets the error message.</summary>
            public string Message
            {
                get => _message;
                set => _message = Throw.IfNull(value);
            }
        
            /// <summary>Gets or sets the error code.</summary>
            public string? ErrorCode { get; set; }
        
            /// <summary>Gets or sets the error details.</summary>
            public string? Details { get; set; }
        }

      Specific API categories:

@rogerbarreto rogerbarreto changed the title M.E.AI - Audio transcription abstraction (WIP) - Missing UT + IT M.E.AI.Abstractions - Audio transcription abstraction (WIP) - Missing UT + IT Feb 3, 2025
@rogerbarreto
Copy link
Contributor Author

@dotnet-policy-service agree company="Microsoft"

@dotnet-comment-bot
Copy link
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.Extensions.AI.Abstractions Line 83 69.66 🔻
Microsoft.Extensions.AI.Abstractions Branch 83 66.2 🔻
Microsoft.Gen.MetadataExtractor Line 98 57.35 🔻
Microsoft.Gen.MetadataExtractor Branch 98 62.5 🔻
Microsoft.Extensions.AI.Ollama Line 80 78.25 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Extensions.AI 88 89
Microsoft.Extensions.AI.OpenAI 77 78

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=938431&view=codecoverage-tab

@RussKie RussKie added the area-ai Microsoft.Extensions.AI libraries label Feb 4, 2025
@dotnet-comment-bot
Copy link
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.Extensions.Caching.Hybrid Line 86 82.77 🔻
Microsoft.Extensions.AI.Abstractions Line 83 81.95 🔻
Microsoft.Extensions.AI.Abstractions Branch 83 73.8 🔻
Microsoft.Gen.MetadataExtractor Line 98 57.35 🔻
Microsoft.Gen.MetadataExtractor Branch 98 62.5 🔻
Microsoft.Extensions.AI.Ollama Line 80 78.25 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Extensions.AI.OpenAI 77 78
Microsoft.Extensions.AI 88 89

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=942860&view=codecoverage-tab

@dotnet-comment-bot
Copy link
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.Extensions.AI.Ollama Line 80 78.11 🔻
Microsoft.Extensions.Caching.Hybrid Line 86 82.92 🔻
Microsoft.Extensions.AI.OpenAI Line 77 74.23 🔻
Microsoft.Extensions.AI.OpenAI Branch 77 63.08 🔻
Microsoft.Extensions.AI.Abstractions Line 83 81.36 🔻
Microsoft.Extensions.AI.Abstractions Branch 83 74.51 🔻
Microsoft.Gen.MetadataExtractor Line 98 57.35 🔻
Microsoft.Gen.MetadataExtractor Branch 98 62.5 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Extensions.AI.AzureAIInference 91 92
Microsoft.Extensions.AI 88 89

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=945523&view=codecoverage-tab

@dotnet-comment-bot
Copy link
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.Extensions.AI.Abstractions Branch 83 81.05 🔻
Microsoft.Extensions.Caching.Hybrid Line 86 82.77 🔻
Microsoft.Extensions.AI.Ollama Line 80 78.11 🔻
Microsoft.Extensions.AI.OpenAI Branch 77 70.56 🔻
Microsoft.Extensions.AI Line 88 80.31 🔻
Microsoft.Extensions.AI Branch 88 87.64 🔻
Microsoft.Gen.MetadataExtractor Line 98 57.35 🔻
Microsoft.Gen.MetadataExtractor Branch 98 62.5 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Extensions.AI.AzureAIInference 91 92

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=945918&view=codecoverage-tab

@luisquintanilla
Copy link
Contributor

cc: @Swimburger for visibility. Feedback is appreciated. Thanks!

@rogerbarreto rogerbarreto changed the title M.E.AI.Abstractions - Audio transcription abstraction (WIP) - Missing UT + IT M.E.AI.Abstractions - Speech to Text Abstraction (WIP) - Missing UT + IT Feb 23, 2025
@dotnet-comment-bot
Copy link
Collaborator

‼️ Found issues ‼️

Project Coverage Type Expected Actual
Microsoft.Extensions.AI.Abstractions Branch 82 78.82 🔻
Microsoft.Extensions.AI Line 89 79.8 🔻
Microsoft.Extensions.AI Branch 89 86.67 🔻

🎉 Good job! The coverage increased 🎉
Update MinCodeCoverage in the project files.

Project Expected Actual
Microsoft.Gen.MetadataExtractor 57 70

Full code coverage report: https://dev.azure.com/dnceng-public/public/_build/results?buildId=960384&view=codecoverage-tab

@rogerbarreto rogerbarreto marked this pull request as ready for review February 26, 2025 09:48
@rogerbarreto rogerbarreto requested review from a team as code owners February 26, 2025 09:48
@rogerbarreto rogerbarreto changed the title M.E.AI.Abstractions - Speech to Text Abstraction (WIP) - Missing UT + IT M.E.AI.Abstractions - Speech to Text Abstraction Feb 26, 2025
Copy link
Member

@stephentoub stephentoub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks!

@stephentoub stephentoub merged commit aaa1a25 into dotnet:main Apr 2, 2025
6 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators May 3, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-ai Microsoft.Extensions.AI libraries

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants