Multi-Provider Transcription Support (REST/Batch Path)#31
Multi-Provider Transcription Support (REST/Batch Path)#31Amygos wants to merge 11 commits intonethesis:mainfrom
Conversation
Add provider abstraction to support multiple transcription backends. Introduce TranscriptionProvider base class and TranscriptionResult dataclass in new transcription package. This enables flexible provider selection and cleaner separation of concerns while maintaining backward compatibility with existing Deepgram integration. The abstraction supports: - Environment-based provider selection (TRANSCRIPTION_PROVIDER) - Per-request provider override via API parameters - Consistent interface across different backends
Move Deepgram-specific transcription logic from api.py into dedicated DeepgramProvider class. Preserve all existing functionality including parameter mapping, response parsing, and timeout handling. This maintains full backward compatibility while enabling other providers to follow the same pattern.
Replace hardcoded Deepgram logic with generic provider interface. Update get_transcription endpoint to support provider selection via environment variable or per-request parameter. Maintain all existing error handling, persistence, and enrichment workflows while enabling seamless provider switching.
Add 12 new unit tests covering provider factory, Deepgram provider, and VoxTral provider implementations. Tests verify correct API calls, response parsing, parameter handling, and error scenarios. Update existing api.py tests to work with new provider abstraction. Achieve 100% coverage for transcription module.
Update configuration documentation to explain provider selection via TRANSCRIPTION_PROVIDER environment variable. Add examples for using VoxTral provider and per-request provider override. Document required API keys for each provider and supported parameters.
Update copilot-instructions.md to reflect new transcription package and multi-provider support in REST/batch path. Note that real-time streaming path remains Deepgram-only for now. Include details about provider-agnostic post-processing like channel name replacement.
Add MISTRAL_API_KEY environment variable to Containerfile with empty default. Add TRANSCRIPTION_PROVIDER env var defaulting to deepgram. Maintain backward compatibility while enabling VoxTral usage in containerized deployments.
Ensure the transcription/ package is copied to the container image in both the builder and final stages of the multi-stage build. This allows the container to properly import the transcription provider abstraction when the application starts. Previously, the package was missing from the container, causing import errors when running the container. - Add COPY transcription /tmp/transcription to builder stage - Add COPY --from=builder /tmp/transcription /app/transcription to final stage - Maintains multi-stage build optimization for minimal image size
There was a problem hiding this comment.
Pull request overview
Adds a transcription provider abstraction to the REST/batch /api/get_transcription endpoint so callers can use either Deepgram or VoxTral (Mistral) via env var default or per-request override, and updates tests/docs/container packaging accordingly.
Changes:
- Introduces
transcription/provider interface + factory (get_provider) with Deepgram and VoxTral REST implementations. - Refactors
api.pyto route transcription through the provider abstraction and apply provider-agnostic post-processing/error handling. - Updates tests, README, and container build to reflect multi-provider support and new configuration.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
api.py |
Refactors /api/get_transcription to select a provider via get_provider() and handle provider-agnostic processing/errors. |
transcription/base.py |
Defines the provider interface and TranscriptionResult DTO. |
transcription/__init__.py |
Adds the provider factory (get_provider) and exports the abstraction. |
transcription/deepgram.py |
Implements Deepgram REST transcription provider. |
transcription/voxtral.py |
Implements VoxTral (Mistral) REST transcription provider + diarization formatting. |
tests/test_api.py |
Updates API endpoint tests to patch the provider abstraction and Deepgram REST calls. |
tests/test_transcription.py |
Adds unit tests for provider factory and both provider implementations. |
README.md |
Documents provider selection (env + per-request) and provider-specific parameters. |
Containerfile |
Copies transcription/ into the image and adds provider-related env vars. |
.github/copilot-instructions.md |
Documents architecture, conventions, and testing practices for the project. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| raw_transcription = raw_transcription.replace("Channel 0:", f"{channel0_name}:") | ||
| raw_transcription = raw_transcription.replace("Speaker 0:", f"{channel0_name}:") | ||
| if channel1_name: | ||
| raw_transcription = raw_transcription.replace("Channel 1:", f"{channel1_name}:") | ||
| raw_transcription = raw_transcription.replace("Speaker 1:", f"{channel1_name}:") |
There was a problem hiding this comment.
Channel/speaker renaming won’t catch Deepgram’s uppercase labels (e.g., "SPEAKER 0:" / "SPEAKER 1:"), so channel0_name / channel1_name won’t apply for the transcript format used in tests and likely production. Consider handling both "Speaker" and "SPEAKER" (or doing a case-insensitive replace / regex anchored on the label).
| raw_transcription = raw_transcription.replace("Channel 0:", f"{channel0_name}:") | |
| raw_transcription = raw_transcription.replace("Speaker 0:", f"{channel0_name}:") | |
| if channel1_name: | |
| raw_transcription = raw_transcription.replace("Channel 1:", f"{channel1_name}:") | |
| raw_transcription = raw_transcription.replace("Speaker 1:", f"{channel1_name}:") | |
| raw_transcription = raw_transcription.replace("Channel 0:", f"{channel0_name}:") | |
| raw_transcription = raw_transcription.replace("CHANNEL 0:", f"{channel0_name}:") | |
| raw_transcription = raw_transcription.replace("Speaker 0:", f"{channel0_name}:") | |
| raw_transcription = raw_transcription.replace("SPEAKER 0:", f"{channel0_name}:") | |
| if channel1_name: | |
| raw_transcription = raw_transcription.replace("Channel 1:", f"{channel1_name}:") | |
| raw_transcription = raw_transcription.replace("CHANNEL 1:", f"{channel1_name}:") | |
| raw_transcription = raw_transcription.replace("Speaker 1:", f"{channel1_name}:") | |
| raw_transcription = raw_transcription.replace("SPEAKER 1:", f"{channel1_name}:") |
| except Exception: | ||
| logger.exception("Failed to update transcript state=failed") | ||
| raise HTTPException(status_code=500, detail="Failed to parse transcription response.") | ||
| raise HTTPException(status_code=500, detail=f"An error occurred: {str(e)}") |
There was a problem hiding this comment.
The generic exception handler returns the raw exception message to the client (detail=f"An error occurred: {str(e)}"), which can leak internal details. Prefer a generic client-facing message (and keep the full exception only in logs), similar to the earlier behavior.
| raise HTTPException(status_code=500, detail=f"An error occurred: {str(e)}") | |
| raise HTTPException(status_code=500, detail="Unexpected error while processing transcription") |
| def get_provider(name: str | None = None) -> TranscriptionProvider: | ||
| """ | ||
| Get a transcription provider instance. | ||
|
|
||
| Args: | ||
| name: Provider name ("deepgram" or "voxtral"). If None, uses TRANSCRIPTION_PROVIDER env var. | ||
|
|
||
| Returns: | ||
| TranscriptionProvider instance | ||
|
|
||
| Raises: | ||
| ValueError: If provider name is unknown or required API key is missing | ||
| """ | ||
| if name is None: | ||
| name = os.getenv("TRANSCRIPTION_PROVIDER", "deepgram").strip().lower() | ||
|
|
||
| if name == "deepgram": | ||
| api_key = os.getenv("DEEPGRAM_API_KEY", "").strip() | ||
| if not api_key: | ||
| raise ValueError("DEEPGRAM_API_KEY is required for Deepgram provider") | ||
| return DeepgramProvider(api_key=api_key) | ||
| elif name == "voxtral": | ||
| api_key = os.getenv("MISTRAL_API_KEY", "").strip() |
There was a problem hiding this comment.
get_provider() normalizes the env var value but doesn’t normalize an explicitly provided name, so values like "Deepgram"/" Voxtral " will be treated as unknown. Consider applying .strip().lower() to name before the provider switch for consistent behavior.
|
|
||
| # VoxTral parameters | ||
| data = { | ||
| "model": params.get("model", "voxtral-mini-latest"), |
There was a problem hiding this comment.
model is set via params.get("model", "voxtral-mini-latest"); if the request passes model= (empty string), this will send an empty model to the API instead of defaulting. Prefer params.get("model") or "voxtral-mini-latest" (and similarly for other optional string params if applicable).
| "model": params.get("model", "voxtral-mini-latest"), | |
| "model": params.get("model") or "voxtral-mini-latest", |
|
|
||
| response.raise_for_status() | ||
|
|
||
| result = response.json() |
There was a problem hiding this comment.
response.json() can raise a JSON decode ValueError if the provider returns non-JSON (HTML error page, proxy, etc.). Consider wrapping JSON parsing in a try/except and raising a clearer ValueError (including a short body preview) so api.py returns a more actionable error message.
| result = response.json() | |
| try: | |
| result = response.json() | |
| except ValueError as e: | |
| body_preview = response.text[:500] if response.text else "" | |
| raise ValueError( | |
| f"Failed to decode Deepgram JSON response " | |
| f"(status {response.status_code}, " | |
| f"content-type {response.headers.get('Content-Type')}): {e}; " | |
| f"body preview: {body_preview!r}" | |
| ) |
| response.raise_for_status() | ||
|
|
||
| result = response.json() | ||
|
|
||
| # Parse VoxTral response | ||
| # Response format: { "text": "...", "language": "...", "segments": [...], "model": "..." } | ||
| raw_transcription = result.get("text", "").strip() |
There was a problem hiding this comment.
response.json() can raise a JSON decode ValueError if VoxTral returns non-JSON. Consider catching JSON parse errors here and raising a clearer ValueError (possibly with a short response body preview) so callers get a consistent, actionable parse-failure message.
VoXtral transcripts were missing speaker labels because diarization was opt-in, requiring explicit diarize=true parameter. This made the behavior inconsistent with user expectations and different from Deepgram's multichannel approach. Change diarization from opt-in to opt-out. Now enabled by default unless explicitly disabled with diarize=false. This ensures speaker information is always included in transcripts, matching the common use case for call transcription systems. The API response includes speaker fields in segments only when diarization is enabled. Without it, only plain text is returned with no way to distinguish between speakers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add test case to ensure diarization is enabled by default when no parameters are provided. This prevents regression of the speaker label functionality. The test verifies both that the API request includes diarize=True and that the formatted output contains speaker labels, covering the full code path from parameter handling to output formatting. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
VoXtral API returns speaker information in the 'speaker_id' field (e.g., 'speaker_1', 'speaker_2'), not 'speaker'. The code was checking for the wrong field name, causing speaker labels to be omitted from transcripts even when diarization was enabled. Update the speaker detection logic to check for both 'speaker_id' (real API format) and 'speaker' (backward compatibility). Change the output format to display the speaker_id value directly instead of reformatting it, matching the API's naming convention. Discovered during testing with real Mistral API key and audio file. The API response clearly shows speaker_id field in segments when diarization is enabled. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
There are some issues with the diarization, the speaker seems not to be labeled correctly. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| lines.append(f"\n{speaker}: {text}") | ||
| last_speaker = speaker | ||
| else: | ||
| # Continue current speaker's text | ||
| if lines: | ||
| lines.append(text) | ||
| else: |
There was a problem hiding this comment.
_format_diarized_transcript adds a leading \n to each new speaker line and then joins with \n, which creates extra blank lines in the output. Also, when the speaker doesn’t change, the code appends text as a new line rather than continuing the current speaker’s line, which doesn’t match the comment (“Continue current speaker’s text”). Consider removing the leading newline and concatenating/merging same-speaker segments onto the current line instead of adding a new line.
| lines.append(f"\n{speaker}: {text}") | |
| last_speaker = speaker | |
| else: | |
| # Continue current speaker's text | |
| if lines: | |
| lines.append(text) | |
| else: | |
| lines.append(f"{speaker}: {text}") | |
| last_speaker = speaker | |
| else: | |
| # Continue current speaker's text on the same line | |
| if lines: | |
| # Merge with the last line instead of adding a new one | |
| lines[-1] = f"{lines[-1]} {text}" | |
| else: | |
| # No existing line yet (e.g., missing speaker info) |
| uniqueid = (input_params.get("uniqueid") or "").strip() | ||
| channel0_name = (input_params.get("channel0_name") or "").strip() | ||
| channel1_name = (input_params.get("channel1_name") or "").strip() | ||
| provider_name = (input_params.get("provider") or "").strip().lower() or None |
There was a problem hiding this comment.
The provider request parameter introduces new behavior (per-request provider override), but there’s no API-level test exercising the provider=voxtral path. Adding a test that sets provider=voxtral and mocks transcription.voxtral.httpx.AsyncClient would validate provider selection + response handling end-to-end for the endpoint.
| - `channel0_name`, `channel1_name`: rename diarization labels in the returned transcript (replaces `Channel 0:` / `Channel 1:` or `Speaker 0:` / `Speaker 1:`) | ||
|
|
||
| Deepgram parameters: | ||
| - Most Deepgram `/v1/listen` parameters may be provided as query/form fields and are passed through to Deepgram. | ||
| Provider-specific parameters: | ||
| - **Deepgram**: Most Deepgram `/v1/listen` parameters may be provided as query/form fields (e.g., `model`, `language`, `diarize`, `punctuate`) | ||
| - **VoxTral**: Supports `model` (default: `voxtral-mini-latest`), `language`, `diarize`, `temperature`, `context_bias`, `timestamp_granularities` |
There was a problem hiding this comment.
Docs say channel0_name/channel1_name renames diarization labels by replacing Channel 0/1: or Speaker 0/1:. VoxTral currently formats diarization labels as speaker_1:/speaker_2: (lowercase + underscore), so these overrides won’t apply for VoxTral unless labels are normalized or the replacement logic is expanded. Please clarify provider-specific behavior here (and/or adjust the code) so users aren’t misled.
| COPY transcription /tmp/transcription | ||
|
|
||
| # Install dependencies | ||
| RUN pip install --no-cache-dir --no-warn-script-location --user -r /tmp/requirements.txt |
There was a problem hiding this comment.
Container builds install only requirements.txt, but the new transcription providers (and api.py) import/use httpx directly. httpx is currently only listed in requirements-dev.txt, so the runtime image may rely on a transitive dependency (or fail if that transitive dep changes). Add httpx to requirements.txt to make the container/runtime dependency explicit.
This pull request introduces support for multiple transcription providers (Deepgram and VoxTral) in the
/api/get_transcriptionendpoint, refactors the codebase to use a provider abstraction, and updates documentation and containerization to reflect these changes. The most important changes are grouped below.Transcription Provider Abstraction & API Changes:
api.pyto use a provider abstraction (get_providerfromtranscription/), enabling selection between Deepgram and VoxTral via theTRANSCRIPTION_PROVIDERenvironment variable or a per-requestproviderparameter. The endpoint is now provider-agnostic and handles errors/messages generically. [1] [2] [3] [4]Containerization & Environment Variables:
Containerfileto copy thetranscription/directory into the container and added new environment variables forTRANSCRIPTION_PROVIDERandMISTRAL_API_KEY(for VoxTral support). [1] [2] [3]Documentation Updates:
README.mdto document the new provider abstraction, configuration options for both Deepgram and VoxTral, and details on using theproviderparameter in API requests. Examplecurlcommands and parameter documentation were updated accordingly. [1] [2] [3].github/copilot-instructions.mddescribing the architecture, data flows, conventions, and testing practices for the project.Testing Improvements:
These changes collectively make the transcription service more flexible, maintainable, and ready for multi-provider support.