Skip to content

MinwooKim1990/comvis2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

58 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” Active Visual Search with DINOv3 + Reinforcement Learning

Learning efficient visual search strategies using foundation models and deep RL for robotics applications.

Python 3.8+ PyTorch License: MIT


🎯 Project Overview

This project implements an active visual search system where an RL agent learns to efficiently find target objects in large images using minimal observations. The agent controls a limited viewport (like a robot camera) and must intelligently navigate to locate specific objects.

Key Technologies:

  • DINOv3 (Meta AI): State-of-the-art vision foundation model for semantic feature extraction
  • Deep Q-Network (DQN): Reinforcement learning for learning optimal search policies
  • CIFAR-10: Dataset for training and evaluation

Real-world Applications:

  • πŸ€– Robot visual search (warehouse automation, inspection)
  • 🚁 Drone object detection with limited FOV
  • πŸ“± Mobile robot navigation with active perception
  • 🏭 Industrial quality inspection

πŸ“ Project Structure

active-visual-search/
β”œβ”€β”€ README.md                           # This file
β”œβ”€β”€ PROJECT_PLAN.md                     # Detailed PRD and roadmap
β”œβ”€β”€ ARCHITECTURE.md                     # Technical architecture
β”œβ”€β”€ PROGRESS.md                         # Session tracking
β”œβ”€β”€ requirements.txt                    # Python dependencies
β”‚
β”œβ”€β”€ config/
β”‚   └── default_config.yaml             # Configuration parameters
β”‚
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ environment.py                  # Search environment (Gym-like)
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ dinov3_encoder.py          # DINOv3 wrapper
β”‚   β”‚   β”œβ”€β”€ dqn_agent.py               # DQN agent implementation
β”‚   β”‚   └── policy_network.py          # Q-Network architectures
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ trainer.py                 # Training loop (TODO)
β”‚   β”‚   └── replay_buffer.py           # Experience replay (in dqn_agent.py)
β”‚   β”œβ”€β”€ evaluation/
β”‚   β”‚   └── evaluator.py               # Evaluation metrics (TODO)
β”‚   └── utils/
β”‚       β”œβ”€β”€ data_loader.py             # CIFAR-10 loading (TODO)
β”‚       └── visualization.py           # Visualization tools (TODO)
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01_demo_environment.ipynb      # Interactive environment demo
β”‚   β”œβ”€β”€ 02_visualize_dinov3.ipynb      # DINOv3 features (TODO)
β”‚   └── 03_results_analysis.ipynb      # Training results (TODO)
β”‚
β”œβ”€β”€ experiments/
β”‚   └── logs/                          # TensorBoard logs
β”‚
└── checkpoints/                       # Saved models

πŸš€ Quick Start

1. Installation

Requirements:

  • Python 3.8+
  • CUDA-capable GPU (recommended: RTX 4090, RTX 3090, etc.)
  • 8GB+ GPU VRAM
# Clone repository
git clone <repository-url>
cd comvis2

# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.cuda.is_available()}')"

2. Test Components

# Test DINOv3 encoder
python src/models/dinov3_encoder.py

# Test Q-Network
python src/models/policy_network.py

# Test environment (coming soon)
python src/environment.py

3. Run Demo Notebook

jupyter notebook notebooks/01_demo_environment.ipynb

4. Train Agent (Coming Soon)

python src/training/trainer.py --config config/default_config.yaml

πŸ“Š How It Works

Problem Setup

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     512x512 Canvas (Search Space)   β”‚
β”‚                                      β”‚
β”‚    πŸš—  ✈️     🐱                    β”‚
β”‚                                      β”‚
β”‚           🦌     🐦                  β”‚
β”‚                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”                         β”‚
β”‚  β”‚ 64x64  β”‚  ← Agent's viewport     β”‚
β”‚  β”‚ Window β”‚                          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Agent's Task: Find the target object (e.g., "airplane" ✈️) in minimum steps.

Actions: Up, Down, Left, Right, "Found!"

Observations: 64x64 RGB viewport + position + target class

Architecture

Current Viewport (64x64x3)
    ↓
DINOv3 Encoder (frozen)
    ↓
Features (384-dim)  +  Position (2D)  +  Target Embedding (384-dim)
    ↓
Q-Network (MLP)
    ↓
Q-values for 5 actions
    ↓
Action Selection (Ξ΅-greedy)

πŸŽ“ Key Concepts

1. Active Perception

Instead of processing entire high-resolution images (expensive), the agent learns to:

  • Move its viewport intelligently
  • Focus on promising regions
  • Minimize search time

2. DINOv3 Features

  • Pretrained on 142M images (self-supervised)
  • Captures rich semantic information
  • No fine-tuning needed (transfer learning)
  • Enables generalization to new objects

3. Deep Q-Learning

  • Learns state-action value function Q(s, a)
  • Experience replay for stable training
  • Target network to reduce correlation
  • Epsilon-greedy exploration

πŸ“ˆ Expected Results

Success Metrics:

  • βœ… Success Rate: 70%+ (vs ~20% random)
  • βœ… Average Steps: <15 (vs ~35 random)
  • βœ… Training Time: <1 hour on RTX 4090

Visualizations:

  • Search trajectory animations
  • Attention heatmaps
  • Success rate curves
  • Search strategy analysis

πŸ› οΈ Configuration

Edit config/default_config.yaml to customize:

environment:
  canvas_size: [512, 512]   # Search space
  window_size: [64, 64]     # Viewport size
  max_steps: 50             # Episode timeout

agent:
  learning_rate: 0.0001
  gamma: 0.95
  epsilon_decay: 0.995

training:
  num_episodes: 1000

πŸ“ Development Roadmap

See PROGRESS.md for detailed session-by-session tracking.

Phase 1: MVP βœ… (Current)

  • Project documentation
  • Environment implementation
  • DINOv3 encoder
  • DQN agent
  • Training loop
  • Demo notebook

Phase 2: Training & Evaluation

  • Complete training pipeline
  • Evaluation metrics
  • Visualization tools
  • First successful training run

Phase 3: Analysis & Optimization

  • Hyperparameter tuning
  • Advanced visualizations
  • Performance analysis
  • Documentation & demos

Future Enhancements

  • Multi-scale search (zoom actions)
  • Continuous control (SAC/TD3)
  • Real robot datasets
  • Multi-object search
  • Text-guided search

πŸ“š Documentation

  • PROJECT_PLAN.md: Comprehensive PRD with requirements, timeline, and success criteria
  • ARCHITECTURE.md: Technical deep dive into system design
  • PROGRESS.md: Session-by-session progress tracking for continuity

🀝 Contributing

This is a research/portfolio project. Suggestions and improvements welcome!


πŸ“„ License

MIT License - feel free to use for learning and research.


πŸ™ Acknowledgments

  • Meta AI for DINOv3 foundation model
  • OpenAI for Gym interface inspiration
  • PyTorch team for excellent deep learning framework

πŸ“§ Contact

For questions or collaboration: [Your contact info]


Status: 🚧 In Development (Phase 1 - MVP)

Last Updated: 2025-11-06

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •