This project is a Q&A chatbot designed to answer questions related to the Global Academy of Technology (GAT) using a combination of large language models (LLMs), Text-Embeddings, Retrieval-Augmented Generation (RAG), and Prompt Engineering Techniques. The chatbot can process both text and audio inputs, providing relevant answers based on the conversation history and preloaded documents.
- Text and Audio Input: Accepts user queries via text input or voice recording.
- Retrieval-Augmented Generation (RAG): Enhances responses using relevant information retrieved from preloaded documents.
- Context-Aware Responses: Utilizes conversation history to provide coherent and contextually appropriate answers.
- Streamlit Interface: User-friendly interface built with Streamlit, featuring options for follow-up mode and search depth customization.
-
Clone the Repository
git clone https://github.com/mahadev0811/CollegeChatbot.git cd CollegeChatbot
-
Create a Virtual Environment
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies
pip install -r requirements.txt
-
Configure API Key
- Create a
config.json
file in the root directory with your Google API key:
{ "google_api_key": "YOUR_GOOGLE_API_KEY" }
- Create a
-
Run the Application
streamlit run st_app.py
-
Interact with the Chatbot
- Use the text input box or the audio recorder to submit your queries.
- Adjust settings using the sidebar options:
- Follow-up Questions Mode: Toggle to use conversation history for responses.
- Search Depth: Adjust the number of paragraphs to search in the documents for relevant information.
- st_app.py: Main application script.
- embedding_generator.py: Script for generating embeddings from the data file.
- webscrapper.ipynb: Jupyter notebook for scraping text data from URLs to generate the raw data file.
- config.json: Configuration file for API keys.
- requirements.txt: List of required Python packages.
- data_generation/gat_raw.txt: Raw scraped data containing information about GAT.
- data_generation/gat_refined.txt: Human-supervised and edited version of the raw data.
- gat_embeddings.pkl: Precomputed embeddings for the preloaded document.
To generate the initial raw data file (gat_raw.txt
), use the webscrapper.ipynb
notebook. This notebook scrapes text content from given URLs and formats it appropriately.
To generate embeddings from your data file, use the embedding_generator.py script. This script reads a text file containing the data, generates embeddings using the FlagEmbedding model, and saves the embeddings as a pickle file.
- Ensure your data file (e.g., data_generation/gat_refined.txt) is in the correct format, with paragraphs separated by double newlines (\n\n).
-
Run the embedding_generator.py script with the path to your data file as an argument:
python embedding_generator.py --data_file data_generation/gat_refined.txt
-
The script will generate embeddings for the paragraphs in the data file and save them as a pickle file (
gat_embeddings.pkl
).
- This video shows the chatbot in action, answering questions about GAT:
recordings.mp4
- FlagEmbedding: Custom embedding model used for encoding queries.
- Streamlit: Open-source app framework for ML and data science projects.
- Hugging Face Transformers: Library for state-of-the-art NLP models.
- Google Cloud Speech-to-Text API: Service for converting speech into text.
- Google Generative AI: Used for generating responses.
This project is licensed under the MIT License - see the LICENSE file for details.