Skip to content

Multi-Provider Transcription Support (REST/Batch Path)#31

Open
Amygos wants to merge 11 commits intonethesis:mainfrom
Amygos:main
Open

Multi-Provider Transcription Support (REST/Batch Path)#31
Amygos wants to merge 11 commits intonethesis:mainfrom
Amygos:main

Conversation

@Amygos
Copy link
Member

@Amygos Amygos commented Feb 16, 2026

This pull request introduces support for multiple transcription providers (Deepgram and VoxTral) in the /api/get_transcription endpoint, refactors the codebase to use a provider abstraction, and updates documentation and containerization to reflect these changes. The most important changes are grouped below.

Transcription Provider Abstraction & API Changes:

  • Refactored api.py to use a provider abstraction (get_provider from transcription/), enabling selection between Deepgram and VoxTral via the TRANSCRIPTION_PROVIDER environment variable or a per-request provider parameter. The endpoint is now provider-agnostic and handles errors/messages generically. [1] [2] [3] [4]
  • Added provider-specific post-processing for channel/speaker label renaming and improved error handling for provider failures and response parsing.

Containerization & Environment Variables:

  • Updated Containerfile to copy the transcription/ directory into the container and added new environment variables for TRANSCRIPTION_PROVIDER and MISTRAL_API_KEY (for VoxTral support). [1] [2] [3]

Documentation Updates:

  • Expanded README.md to document the new provider abstraction, configuration options for both Deepgram and VoxTral, and details on using the provider parameter in API requests. Example curl commands and parameter documentation were updated accordingly. [1] [2] [3]
  • Added a comprehensive .github/copilot-instructions.md describing the architecture, data flows, conventions, and testing practices for the project.

Testing Improvements:

  • Updated tests to patch the new provider abstraction and ensure correct mocking of provider responses, reflecting the new modular design. [1] [2]

These changes collectively make the transcription service more flexible, maintainable, and ready for multi-provider support.

Add provider abstraction to support multiple transcription backends.
Introduce TranscriptionProvider base class and TranscriptionResult
dataclass in new transcription package. This enables flexible provider
selection and cleaner separation of concerns while maintaining
backward compatibility with existing Deepgram integration.

The abstraction supports:
- Environment-based provider selection (TRANSCRIPTION_PROVIDER)
- Per-request provider override via API parameters
- Consistent interface across different backends
Move Deepgram-specific transcription logic from api.py into
dedicated DeepgramProvider class. Preserve all existing functionality
including parameter mapping, response parsing, and timeout handling.
This maintains full backward compatibility while enabling other
providers to follow the same pattern.
Replace hardcoded Deepgram logic with generic provider interface.
Update get_transcription endpoint to support provider selection via
environment variable or per-request parameter. Maintain all existing
error handling, persistence, and enrichment workflows while enabling
seamless provider switching.
Add 12 new unit tests covering provider factory, Deepgram provider,
and VoxTral provider implementations. Tests verify correct API calls,
response parsing, parameter handling, and error scenarios. Update
existing api.py tests to work with new provider abstraction.
Achieve 100% coverage for transcription module.
Update configuration documentation to explain provider selection via
TRANSCRIPTION_PROVIDER environment variable. Add examples for using
VoxTral provider and per-request provider override. Document required
API keys for each provider and supported parameters.
Update copilot-instructions.md to reflect new transcription package
and multi-provider support in REST/batch path. Note that real-time
streaming path remains Deepgram-only for now. Include details about
provider-agnostic post-processing like channel name replacement.
Add MISTRAL_API_KEY environment variable to Containerfile with empty
default. Add TRANSCRIPTION_PROVIDER env var defaulting to deepgram.
Maintain backward compatibility while enabling VoxTral usage in
containerized deployments.
Ensure the transcription/ package is copied to the container image in both
the builder and final stages of the multi-stage build. This allows the
container to properly import the transcription provider abstraction when
the application starts. Previously, the package was missing from the
container, causing import errors when running the container.

- Add COPY transcription /tmp/transcription to builder stage
- Add COPY --from=builder /tmp/transcription /app/transcription to final stage
- Maintains multi-stage build optimization for minimal image size
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a transcription provider abstraction to the REST/batch /api/get_transcription endpoint so callers can use either Deepgram or VoxTral (Mistral) via env var default or per-request override, and updates tests/docs/container packaging accordingly.

Changes:

  • Introduces transcription/ provider interface + factory (get_provider) with Deepgram and VoxTral REST implementations.
  • Refactors api.py to route transcription through the provider abstraction and apply provider-agnostic post-processing/error handling.
  • Updates tests, README, and container build to reflect multi-provider support and new configuration.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
api.py Refactors /api/get_transcription to select a provider via get_provider() and handle provider-agnostic processing/errors.
transcription/base.py Defines the provider interface and TranscriptionResult DTO.
transcription/__init__.py Adds the provider factory (get_provider) and exports the abstraction.
transcription/deepgram.py Implements Deepgram REST transcription provider.
transcription/voxtral.py Implements VoxTral (Mistral) REST transcription provider + diarization formatting.
tests/test_api.py Updates API endpoint tests to patch the provider abstraction and Deepgram REST calls.
tests/test_transcription.py Adds unit tests for provider factory and both provider implementations.
README.md Documents provider selection (env + per-request) and provider-specific parameters.
Containerfile Copies transcription/ into the image and adds provider-related env vars.
.github/copilot-instructions.md Documents architecture, conventions, and testing practices for the project.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +200 to +204
raw_transcription = raw_transcription.replace("Channel 0:", f"{channel0_name}:")
raw_transcription = raw_transcription.replace("Speaker 0:", f"{channel0_name}:")
if channel1_name:
raw_transcription = raw_transcription.replace("Channel 1:", f"{channel1_name}:")
raw_transcription = raw_transcription.replace("Speaker 1:", f"{channel1_name}:")
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Channel/speaker renaming won’t catch Deepgram’s uppercase labels (e.g., "SPEAKER 0:" / "SPEAKER 1:"), so channel0_name / channel1_name won’t apply for the transcript format used in tests and likely production. Consider handling both "Speaker" and "SPEAKER" (or doing a case-insensitive replace / regex anchored on the label).

Suggested change
raw_transcription = raw_transcription.replace("Channel 0:", f"{channel0_name}:")
raw_transcription = raw_transcription.replace("Speaker 0:", f"{channel0_name}:")
if channel1_name:
raw_transcription = raw_transcription.replace("Channel 1:", f"{channel1_name}:")
raw_transcription = raw_transcription.replace("Speaker 1:", f"{channel1_name}:")
raw_transcription = raw_transcription.replace("Channel 0:", f"{channel0_name}:")
raw_transcription = raw_transcription.replace("CHANNEL 0:", f"{channel0_name}:")
raw_transcription = raw_transcription.replace("Speaker 0:", f"{channel0_name}:")
raw_transcription = raw_transcription.replace("SPEAKER 0:", f"{channel0_name}:")
if channel1_name:
raw_transcription = raw_transcription.replace("Channel 1:", f"{channel1_name}:")
raw_transcription = raw_transcription.replace("CHANNEL 1:", f"{channel1_name}:")
raw_transcription = raw_transcription.replace("Speaker 1:", f"{channel1_name}:")
raw_transcription = raw_transcription.replace("SPEAKER 1:", f"{channel1_name}:")

Copilot uses AI. Check for mistakes.
except Exception:
logger.exception("Failed to update transcript state=failed")
raise HTTPException(status_code=500, detail="Failed to parse transcription response.")
raise HTTPException(status_code=500, detail=f"An error occurred: {str(e)}")
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generic exception handler returns the raw exception message to the client (detail=f"An error occurred: {str(e)}"), which can leak internal details. Prefer a generic client-facing message (and keep the full exception only in logs), similar to the earlier behavior.

Suggested change
raise HTTPException(status_code=500, detail=f"An error occurred: {str(e)}")
raise HTTPException(status_code=500, detail="Unexpected error while processing transcription")

Copilot uses AI. Check for mistakes.
Comment on lines +14 to +36
def get_provider(name: str | None = None) -> TranscriptionProvider:
"""
Get a transcription provider instance.

Args:
name: Provider name ("deepgram" or "voxtral"). If None, uses TRANSCRIPTION_PROVIDER env var.

Returns:
TranscriptionProvider instance

Raises:
ValueError: If provider name is unknown or required API key is missing
"""
if name is None:
name = os.getenv("TRANSCRIPTION_PROVIDER", "deepgram").strip().lower()

if name == "deepgram":
api_key = os.getenv("DEEPGRAM_API_KEY", "").strip()
if not api_key:
raise ValueError("DEEPGRAM_API_KEY is required for Deepgram provider")
return DeepgramProvider(api_key=api_key)
elif name == "voxtral":
api_key = os.getenv("MISTRAL_API_KEY", "").strip()
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_provider() normalizes the env var value but doesn’t normalize an explicitly provided name, so values like "Deepgram"/" Voxtral " will be treated as unknown. Consider applying .strip().lower() to name before the provider switch for consistent behavior.

Copilot uses AI. Check for mistakes.

# VoxTral parameters
data = {
"model": params.get("model", "voxtral-mini-latest"),
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

model is set via params.get("model", "voxtral-mini-latest"); if the request passes model= (empty string), this will send an empty model to the API instead of defaulting. Prefer params.get("model") or "voxtral-mini-latest" (and similarly for other optional string params if applicable).

Suggested change
"model": params.get("model", "voxtral-mini-latest"),
"model": params.get("model") or "voxtral-mini-latest",

Copilot uses AI. Check for mistakes.

response.raise_for_status()

result = response.json()
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.json() can raise a JSON decode ValueError if the provider returns non-JSON (HTML error page, proxy, etc.). Consider wrapping JSON parsing in a try/except and raising a clearer ValueError (including a short body preview) so api.py returns a more actionable error message.

Suggested change
result = response.json()
try:
result = response.json()
except ValueError as e:
body_preview = response.text[:500] if response.text else ""
raise ValueError(
f"Failed to decode Deepgram JSON response "
f"(status {response.status_code}, "
f"content-type {response.headers.get('Content-Type')}): {e}; "
f"body preview: {body_preview!r}"
)

Copilot uses AI. Check for mistakes.
Comment on lines +104 to +110
response.raise_for_status()

result = response.json()

# Parse VoxTral response
# Response format: { "text": "...", "language": "...", "segments": [...], "model": "..." }
raw_transcription = result.get("text", "").strip()
Copy link

Copilot AI Feb 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

response.json() can raise a JSON decode ValueError if VoxTral returns non-JSON. Consider catching JSON parse errors here and raising a clearer ValueError (possibly with a short response body preview) so callers get a consistent, actionable parse-failure message.

Copilot uses AI. Check for mistakes.
Amygos and others added 3 commits February 16, 2026 23:21
VoXtral transcripts were missing speaker labels because diarization
was opt-in, requiring explicit diarize=true parameter. This made the
behavior inconsistent with user expectations and different from
Deepgram's multichannel approach.

Change diarization from opt-in to opt-out. Now enabled by default
unless explicitly disabled with diarize=false. This ensures speaker
information is always included in transcripts, matching the common
use case for call transcription systems.

The API response includes speaker fields in segments only when
diarization is enabled. Without it, only plain text is returned
with no way to distinguish between speakers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add test case to ensure diarization is enabled by default when no
parameters are provided. This prevents regression of the speaker
label functionality.

The test verifies both that the API request includes diarize=True
and that the formatted output contains speaker labels, covering the
full code path from parameter handling to output formatting.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
VoXtral API returns speaker information in the 'speaker_id' field
(e.g., 'speaker_1', 'speaker_2'), not 'speaker'. The code was
checking for the wrong field name, causing speaker labels to be
omitted from transcripts even when diarization was enabled.

Update the speaker detection logic to check for both 'speaker_id'
(real API format) and 'speaker' (backward compatibility). Change
the output format to display the speaker_id value directly instead
of reformatting it, matching the API's naming convention.

Discovered during testing with real Mistral API key and audio file.
The API response clearly shows speaker_id field in segments when
diarization is enabled.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Amygos
Copy link
Member Author

Amygos commented Feb 17, 2026

There are some issues with the diarization, the speaker seems not to be labeled correctly.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +146 to +152
lines.append(f"\n{speaker}: {text}")
last_speaker = speaker
else:
# Continue current speaker's text
if lines:
lines.append(text)
else:
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_format_diarized_transcript adds a leading \n to each new speaker line and then joins with \n, which creates extra blank lines in the output. Also, when the speaker doesn’t change, the code appends text as a new line rather than continuing the current speaker’s line, which doesn’t match the comment (“Continue current speaker’s text”). Consider removing the leading newline and concatenating/merging same-speaker segments onto the current line instead of adding a new line.

Suggested change
lines.append(f"\n{speaker}: {text}")
last_speaker = speaker
else:
# Continue current speaker's text
if lines:
lines.append(text)
else:
lines.append(f"{speaker}: {text}")
last_speaker = speaker
else:
# Continue current speaker's text on the same line
if lines:
# Merge with the last line instead of adding a new one
lines[-1] = f"{lines[-1]} {text}"
else:
# No existing line yet (e.g., missing speaker info)

Copilot uses AI. Check for mistakes.
uniqueid = (input_params.get("uniqueid") or "").strip()
channel0_name = (input_params.get("channel0_name") or "").strip()
channel1_name = (input_params.get("channel1_name") or "").strip()
provider_name = (input_params.get("provider") or "").strip().lower() or None
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The provider request parameter introduces new behavior (per-request provider override), but there’s no API-level test exercising the provider=voxtral path. Adding a test that sets provider=voxtral and mocks transcription.voxtral.httpx.AsyncClient would validate provider selection + response handling end-to-end for the endpoint.

Copilot uses AI. Check for mistakes.
Comment on lines +147 to +151
- `channel0_name`, `channel1_name`: rename diarization labels in the returned transcript (replaces `Channel 0:` / `Channel 1:` or `Speaker 0:` / `Speaker 1:`)

Deepgram parameters:
- Most Deepgram `/v1/listen` parameters may be provided as query/form fields and are passed through to Deepgram.
Provider-specific parameters:
- **Deepgram**: Most Deepgram `/v1/listen` parameters may be provided as query/form fields (e.g., `model`, `language`, `diarize`, `punctuate`)
- **VoxTral**: Supports `model` (default: `voxtral-mini-latest`), `language`, `diarize`, `temperature`, `context_bias`, `timestamp_granularities`
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs say channel0_name/channel1_name renames diarization labels by replacing Channel 0/1: or Speaker 0/1:. VoxTral currently formats diarization labels as speaker_1:/speaker_2: (lowercase + underscore), so these overrides won’t apply for VoxTral unless labels are normalized or the replacement logic is expanded. Please clarify provider-specific behavior here (and/or adjust the code) so users aren’t misled.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to 21
COPY transcription /tmp/transcription

# Install dependencies
RUN pip install --no-cache-dir --no-warn-script-location --user -r /tmp/requirements.txt
Copy link

Copilot AI Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Container builds install only requirements.txt, but the new transcription providers (and api.py) import/use httpx directly. httpx is currently only listed in requirements-dev.txt, so the runtime image may rely on a transitive dependency (or fail if that transitive dep changes). Add httpx to requirements.txt to make the container/runtime dependency explicit.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments