Skip to content

Lohith625/codebase-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

20 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿค– Codebase RAG - Chat with Your Code Using AI

Python FastAPI Tests Coverage License

An intelligent AI-powered assistant that allows developers to interact with their codebase using natural language.

Features โ€ข Demo โ€ข Tech Stack โ€ข Quick Start โ€ข Architecture


๐ŸŒŸ Overview

Codebase RAG is a production-ready Retrieval-Augmented Generation (RAG) system that enables developers to:

  • ๐Ÿ’ฌ Chat with their codebase using natural language
  • ๐Ÿ” Semantically search across thousands of code files
  • ๐Ÿค– Get AI-powered explanations of complex code
  • ๐Ÿ“Š Visualize codebase insights with interactive dashboards
  • โšก Lightning-fast queries with 11ms average response time

Built with modern ML techniques including vector embeddings, semantic search, and Google's Gemini 2.5 Flash LLM.


โœจ Features

๐ŸŽฏ Core Capabilities

  • Natural Language Queries: Ask questions in plain English about your codebase
  • Semantic Code Search: Find relevant code using meaning, not just keywords
  • AI-Powered Explanations: Get detailed explanations of how code works
  • Multi-Language Support: Python, JavaScript, Java, C++, Go, and more
  • Real-time Indexing: Automatically updates as your codebase changes

๐Ÿš€ Performance

  • 4,364+ code chunks indexed with FAISS vector database
  • 11ms average query response time
  • 45% test coverage with 21/21 tests passing
  • Production-ready with comprehensive error handling

๐ŸŽจ User Interface

  • Modern, responsive design with smooth animations
  • Interactive dashboard with real-time metrics
  • Code syntax highlighting for better readability
  • Query history to track your interactions

๐ŸŽฌ Demo

Chat Interface

User: "How does Flask routing work in this codebase?"

AI: "In this codebase, Flask routing is implemented using the @app.route() 
decorator to map URL paths to Python functions. The routing system handles 
incoming HTTP requests by matching the URL pattern and executing the 
corresponding view function..."

Key Features in Action

  • ๐Ÿ’ฌ Natural conversations about code functionality
  • ๐Ÿ“‚ Ingest repositories with one command
  • ๐Ÿ’ก Explain code snippets interactively
  • ๐Ÿ“Š View analytics on indexed codebase

๐Ÿ› ๏ธ Tech Stack

Backend

  • FastAPI - Modern Python web framework
  • LangChain - LLM application framework
  • FAISS - Facebook AI Similarity Search (vector database)
  • Google Gemini 2.5 Flash - State-of-the-art LLM
  • Tree-sitter - Code parsing and AST generation

Frontend

  • Streamlit - Interactive web interface
  • Plotly - Data visualization
  • Custom CSS - Modern gradient designs

Infrastructure

  • Python 3.12+ - Modern Python features
  • Pytest - Comprehensive testing
  • Docker - Containerization (optional)
  • Git - Version control

๐Ÿš€ Quick Start

Prerequisites

Installation

  1. Clone the repository
git clone https://github.com/YOUR_USERNAME/codebase-rag.git
cd codebase-rag
  1. Create virtual environment
python3 -m venv codebase-rag-env
source codebase-rag-env/bin/activate  # On Windows: codebase-rag-env\Scripts\activate
  1. Install dependencies
pip install -r requirements.txt
  1. Configure API keys
# Copy example environment file
cp .env.example .env

# Edit .env and add your Gemini API key
# GEMINI_API_KEY=your_api_key_here
  1. Run the system
# Terminal 1: Start API server
python scripts/run_api.py

# Terminal 2: Start frontend
streamlit run frontend/app.py
  1. Open in browser
Frontend: http://localhost:8501
API Docs: http://localhost:8000/docs

๐Ÿ“ Project Structure

codebase-rag/
โ”œโ”€โ”€ backend/
โ”‚   โ”œโ”€โ”€ api/              # FastAPI REST endpoints
โ”‚   โ”‚   โ”œโ”€โ”€ main.py       # Main API application
โ”‚   โ”‚   โ””โ”€โ”€ models.py     # Pydantic models
โ”‚   โ”œโ”€โ”€ ingestion/        # Repository loading & processing
โ”‚   โ”‚   โ”œโ”€โ”€ github_loader.py
โ”‚   โ”‚   โ””โ”€โ”€ document_loader.py
โ”‚   โ”œโ”€โ”€ parsing/          # Code parsing & chunking
โ”‚   โ”‚   โ”œโ”€โ”€ chunker.py
โ”‚   โ”‚   โ””โ”€โ”€ language_detector.py
โ”‚   โ”œโ”€โ”€ retrieval/        # Vector search & embeddings
โ”‚   โ”‚   โ”œโ”€โ”€ embeddings.py
โ”‚   โ”‚   โ”œโ”€โ”€ vector_store.py
โ”‚   โ”‚   โ”œโ”€โ”€ indexer.py
โ”‚   โ”‚   โ””โ”€โ”€ search.py
โ”‚   โ””โ”€โ”€ llm/             # LLM integration
โ”‚       โ”œโ”€โ”€ llm_client.py
โ”‚       โ”œโ”€โ”€ rag_pipeline.py
โ”‚       โ””โ”€โ”€ query_constructor.py
โ”œโ”€โ”€ frontend/            # Streamlit UI
โ”‚   โ””โ”€โ”€ app.py
โ”œโ”€โ”€ tests/              # Unit & integration tests
โ”‚   โ”œโ”€โ”€ test_*.py
โ”‚   โ””โ”€โ”€ conftest.py
โ”œโ”€โ”€ data/               # Data storage
โ”‚   โ””โ”€โ”€ vector_store/   # FAISS indexes
โ”œโ”€โ”€ config/             # Configuration
โ”‚   โ””โ”€โ”€ settings.py
โ”œโ”€โ”€ scripts/            # Utility scripts
โ”‚   โ””โ”€โ”€ run_api.py
โ”œโ”€โ”€ .env.example        # Environment template
โ”œโ”€โ”€ requirements.txt    # Python dependencies
โ””โ”€โ”€ README.md          # This file

๐Ÿ—๏ธ Architecture

System Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Frontend  โ”‚ (Streamlit)
โ”‚  localhost  โ”‚
โ”‚    :8501    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚ HTTP Requests
       โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  FastAPI    โ”‚ (REST API)
โ”‚   Server    โ”‚
โ”‚  localhost  โ”‚
โ”‚    :8000    โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚
       โ”œโ”€โ”€โ–บ ๐Ÿ” Query Pipeline
       โ”‚    โ”œโ”€โ–บ Vector Search (FAISS)
       โ”‚    โ”œโ”€โ–บ Context Retrieval
       โ”‚    โ””โ”€โ–บ LLM Generation (Gemini)
       โ”‚
       โ”œโ”€โ”€โ–บ ๐Ÿ“ฅ Ingestion Pipeline
       โ”‚    โ”œโ”€โ–บ Code Loading
       โ”‚    โ”œโ”€โ–บ Parsing & Chunking
       โ”‚    โ””โ”€โ–บ Vector Indexing
       โ”‚
       โ””โ”€โ”€โ–บ ๐Ÿ’พ Data Layer
            โ””โ”€โ–บ FAISS Vector Store

RAG Pipeline Flow

  1. User Query โ†’ Natural language question
  2. Query Enhancement โ†’ Expand and optimize query
  3. Vector Search โ†’ Find relevant code chunks (FAISS)
  4. Context Building โ†’ Assemble relevant code snippets
  5. LLM Generation โ†’ Gemini generates contextual answer
  6. Response โ†’ AI-powered explanation with sources

๐Ÿ’ก Usage Examples

1. Index a Repository

curl -X POST http://localhost:8000/ingest \
  -H "Content-Type: application/json" \
  -d '{
    "repo_url": "https://github.com/username/repo",
    "branch": "main"
  }'

2. Query Your Codebase

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "How does authentication work?",
    "language": "python"
  }'

3. Explain Code Snippet

curl -X POST http://localhost:8000/explain \
  -H "Content-Type: application/json" \
  -d '{
    "code": "def fibonacci(n): return n if n < 2 else fibonacci(n-1) + fibonacci(n-2)",
    "language": "python"
  }'

๐Ÿงช Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=backend --cov-report=html

# Run specific test file
pytest tests/test_vector_store.py

# View coverage report
open htmlcov/index.html

Current Test Results:

  • โœ… 21/21 tests passing
  • ๐Ÿ“Š 45% code coverage
  • โšก Fast test execution

๐Ÿ”ง Configuration

Key configuration options in config/settings.py:

# Vector Store
CHUNK_SIZE = 512              # Code chunk size
CHUNK_OVERLAP = 50            # Overlap between chunks
VECTOR_DIMENSION = 384        # Embedding dimension

# LLM
GEMINI_MODEL = "gemini-2.5-flash"
MAX_TOKENS = 2048             # Max response tokens
TEMPERATURE = 0.3             # Response creativity

# Retrieval
TOP_K = 20                    # Initial retrieval count
TOP_N = 5                     # Final results to use

๐Ÿ“ˆ Performance Metrics

Metric Value
Indexed Vectors 4,364
Query Time ~11ms avg
Index Load Time <2s
Embedding Dimension 384
Test Coverage 45%
Tests Passing 21/21 โœ…

๐Ÿ—บ๏ธ Roadmap

Phase 1: Core Features โœ… (Completed)

  • Vector-based code search
  • Natural language queries
  • AI-powered explanations
  • Modern web interface
  • Real-time indexing

Phase 2: Enhancements ๐Ÿšง (In Progress)

  • Multi-repository support
  • Code generation capabilities
  • Team collaboration features
  • GitHub integration
  • VSCode extension

Phase 3: Advanced Features ๐Ÿ”ฎ (Planned)

  • Architecture visualization
  • Code quality analysis
  • Automated documentation
  • CI/CD integration
  • Enterprise features

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ™ Acknowledgments

  • Google Gemini - AI language model
  • FAISS - Vector similarity search
  • FastAPI - Modern Python web framework
  • Streamlit - Interactive UI framework
  • Tree-sitter - Code parsing library

๐Ÿ“ง Contact

Project Link: https://github.com/Lohith625/codebase-rag


โญ Star this repo if you find it useful!

Made with โค๏ธ and ๐Ÿค– by [Lohith m]

About

RAG-based intelligent codebase chat assistant with semantic search

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published