Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beckn-gemini bot enhancement - adding support for voice / video/ image input #115

Open
10 tasks
emmayank opened this issue Oct 9, 2024 · 0 comments
Open
10 tasks
Assignees
Labels
enhancement New feature or request

Comments

@emmayank
Copy link

emmayank commented Oct 9, 2024

Description

Enhance the Beckn-Gemini Bot to support multimodal input types such as voice, video, and image. Since Gemini is designed to be multimodal, the bot should allow users to provide input in different formats during a conversation, enabling a more flexible and accessible user experience. At any point in the conversation, the bot should detect the input mode and respond accordingly. For example, when asking for a 6-digit OTP, the user may prefer to send a voice message instead of typing it, or users may ask for information via a voice message like "मैं सौर ऊर्जा खरीदना चाहता हूं"

Goals

  • Add support for voice input, allowing users to send voice messages instead of text.
  • Enable the bot to process video input (if applicable for specific use cases) and respond appropriately.
  • Integrate image recognition capabilities so the bot can understand and respond to images shared by the user (e.g., an image of a bill or document).
  • Allow users to seamlessly switch between input modes (text, voice, video, image) at any point during the conversation.
  • Implement detection for different input types and ensure the bot responds appropriately, regardless of the mode used.
  • Test across various scenarios, ensuring that the bot can handle and respond to voice, video, and image inputs accurately.

Expected Outcome

  • The bot will support multimodal inputs such as voice, video, and images, providing a flexible and accessible user experience.
  • Users can switch input modes (e.g., voice to text, image to voice) without breaking the flow of conversation.
  • The bot detects the input mode and responds appropriately to voice messages, video, or images.

Acceptance Criteria

  • The bot successfully detects and processes voice, video, and image inputs during the conversation.
  • Users can switch between input modes (voice, text, video, image) seamlessly during the conversation without any disruptions.
  • Voice input is correctly recognized and converted to actionable information (e.g., detecting OTP or voice-based requests like "मैं सौर ऊर्जा खरीदना चाहता हूं").
  • The functionality is tested across multiple scenarios and input types to ensure reliability and accuracy.

Mockups / Wireframes

NA

Product Name

Beckn-Gemini Bot

Domain

Multimodal AI / Conversational AI

Tech Skills Needed

  • Voice and Speech Recognition (NLP)
  • Image/Video Processing (Computer Vision)
  • Multimodal Input Integration
  • Chatbot Development

Complexity

High

Category

Bot Enhancement

Sub Category

Multimodal Input Support

Project View

Beckn-Gemini Bot

Project Name

Beckn-Gemini Bot Multimodal Enhancement

@emmayank emmayank added the enhancement New feature or request label Oct 9, 2024
@emmayank emmayank changed the title Beckn-gemini bot enhahcement - adding support for voice / video/ image input Beckn-gemini bot enhancement - adding support for voice / video/ image input Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants