Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added agent to get video transcripts. #72

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions backend/director/agents/transcription.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
import logging
from director.agents.base import BaseAgent, AgentResponse, AgentStatus
from director.core.session import TextContent, MsgStatus
from director.tools.videodb_tool import VideoDBTool

logger = logging.getLogger(__name__)

class TranscriptionAgent(BaseAgent):
def __init__(self, session=None, **kwargs):
self.agent_name = "video_transcription"
self.description = (
"This is an agent to get transcripts of videos"
)
self.parameters = self.get_parameters()
super().__init__(session=session, **kwargs)

def run(self, collection_id: str, video_id: str, timestamp_mode: bool = False, time_range: int = 2) -> AgentResponse:
"""
Transcribe a video and optionally format it with timestamps.

:param str collection_id: The collection_id where given video_id is available.
:param str video_id: The id of the video for which the transcription is required.
:param bool timestamp_mode: Whether to include timestamps in the transcript.
:param int time_range: Time range for grouping transcripts in minutes (default: 2 minutes).
:return: AgentResponse with the transcription result.
:rtype: AgentResponse
"""
self.output_message.actions.append("Trying to get the video transcription...")
output_text_content = TextContent(
agent_name=self.agent_name,
status_message="Processing the transcription...",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please keep the ellipses (..) to a max of two in all status messages and actions?

)
self.output_message.content.append(output_text_content)
self.output_message.push_update()

videodb_tool = VideoDBTool(collection_id=collection_id)

try:
transcript_text = videodb_tool.get_transcript(video_id)
except Exception:
logger.error("Transcript not found. Indexing spoken words...")
self.output_message.actions.append("Indexing spoken words...")
self.output_message.push_update()
videodb_tool.index_spoken_words(video_id)
transcript_text = videodb_tool.get_transcript(video_id)
ankit-v2-3 marked this conversation as resolved.
Show resolved Hide resolved

if timestamp_mode:
self.output_message.actions.append("Formatting transcript with timestamps...")
grouped_transcript = self._group_transcript_with_timestamps(
transcript_text, time_range
)
output_text = grouped_transcript
else:
output_text = transcript_text

output_text_content.text = output_text
output_text_content.status = MsgStatus.success
output_text_content.status_message = "Transcription completed successfully."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Message like "Here is your transcription" would be better since we are using that for title of the trascription.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. I’ve corrected the changes in the recent commit.

self.output_message.publish()

return AgentResponse(
status=AgentStatus.SUCCESS,
message="Transcription successful.",
data={"video_id": video_id, "transcript": output_text},
)

def _group_transcript_with_timestamps(self, transcript_text: str, time_range: int) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The grouping logic is not correct.

If you will test it you will get only one block like this:
image

Reason: There are no new lines in transcription text, and even if they were new line representing the given range (2 minutes in case of default) is wrong.

Correct way would be to use the transcription dictionary that VideoDB tool is sending it has timing information unlike the transcription text that is being used.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashish-spext I’m having a hard time understanding the structure of the transcription dictionary returned by the VideoDB tool.

I think the output of the get_transcript() method, when called with text=False, will give us transcript details.

However, I’m unable to see any changes I’ve made or test the app due to API key limitations:
image

can you please provide more information about how timing information is stored in transcription dictionary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue can be resolved by adding free LLM models. Merging this PR will fix the problem and it would be helpful for solving similar issues in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarfarazsiddiquii We have resolved this issue by adding an OpenAI proxy. An OpenAI key is no longer required. Please pull the latest changes from the main branch, ensure that no OpenAI key is present in the .env file, and test the transcript.

Copy link
Author

@sarfarazsiddiquii sarfarazsiddiquii Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ankit-v2-3, Thank you for the update, the code is now testable.

I’ve fixed the grouping logic in the latest commit, the agent will now properly group the transcription text into 2 minute intervals by default unless time interval is defined.

Let me know if any change is required.

image

"""
Group transcript into specified time ranges with timestamps.

:param str transcript_text: The raw transcript text.
:param int time_range: Time range for grouping in minutes.
:return: Grouped transcript with timestamps.
:rtype: str
"""
lines = transcript_text.split("\n")
grouped_transcript = []
current_time = 0

for i, line in enumerate(lines):
if i % time_range == 0 and line.strip():
timestamp = f"[{current_time:02d}:00 - {current_time + time_range:02d}:00]"
grouped_transcript.append(f"{timestamp} {line.strip()}")
current_time += time_range

return "\n".join(grouped_transcript)
2 changes: 2 additions & 0 deletions backend/director/handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
from director.agents.meme_maker import MemeMakerAgent
from director.agents.dubbing import DubbingAgent
from director.agents.composio import ComposioAgent
from director.agents.transcription import TranscriptionAgent


from director.core.session import Session, InputMessage, MsgStatus
Expand Down Expand Up @@ -53,6 +54,7 @@ def __init__(self, db, **kwargs):
SlackAgent,
MemeMakerAgent,
DubbingAgent,
TranscriptionAgent,
ComposioAgent,
]

Expand Down