Learning efficient visual search strategies using foundation models and deep RL for robotics applications.
This project implements an active visual search system where an RL agent learns to efficiently find target objects in large images using minimal observations. The agent controls a limited viewport (like a robot camera) and must intelligently navigate to locate specific objects.
Key Technologies:
- DINOv3 (Meta AI): State-of-the-art vision foundation model for semantic feature extraction
- Deep Q-Network (DQN): Reinforcement learning for learning optimal search policies
- CIFAR-10: Dataset for training and evaluation
Real-world Applications:
- π€ Robot visual search (warehouse automation, inspection)
- π Drone object detection with limited FOV
- π± Mobile robot navigation with active perception
- π Industrial quality inspection
active-visual-search/
βββ README.md # This file
βββ PROJECT_PLAN.md # Detailed PRD and roadmap
βββ ARCHITECTURE.md # Technical architecture
βββ PROGRESS.md # Session tracking
βββ requirements.txt # Python dependencies
β
βββ config/
β βββ default_config.yaml # Configuration parameters
β
βββ src/
β βββ environment.py # Search environment (Gym-like)
β βββ models/
β β βββ dinov3_encoder.py # DINOv3 wrapper
β β βββ dqn_agent.py # DQN agent implementation
β β βββ policy_network.py # Q-Network architectures
β βββ training/
β β βββ trainer.py # Training loop (TODO)
β β βββ replay_buffer.py # Experience replay (in dqn_agent.py)
β βββ evaluation/
β β βββ evaluator.py # Evaluation metrics (TODO)
β βββ utils/
β βββ data_loader.py # CIFAR-10 loading (TODO)
β βββ visualization.py # Visualization tools (TODO)
β
βββ notebooks/
β βββ 01_demo_environment.ipynb # Interactive environment demo
β βββ 02_visualize_dinov3.ipynb # DINOv3 features (TODO)
β βββ 03_results_analysis.ipynb # Training results (TODO)
β
βββ experiments/
β βββ logs/ # TensorBoard logs
β
βββ checkpoints/ # Saved models
Requirements:
- Python 3.8+
- CUDA-capable GPU (recommended: RTX 4090, RTX 3090, etc.)
- 8GB+ GPU VRAM
# Clone repository
git clone <repository-url>
cd comvis2
# Install dependencies
pip install -r requirements.txt
# Verify installation
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.cuda.is_available()}')"# Test DINOv3 encoder
python src/models/dinov3_encoder.py
# Test Q-Network
python src/models/policy_network.py
# Test environment (coming soon)
python src/environment.pyjupyter notebook notebooks/01_demo_environment.ipynbpython src/training/trainer.py --config config/default_config.yamlβββββββββββββββββββββββββββββββββββββββ
β 512x512 Canvas (Search Space) β
β β
β π βοΈ π± β
β β
β π¦ π¦ β
β β
β ββββββββββ β
β β 64x64 β β Agent's viewport β
β β Window β β
β ββββββββββ β
βββββββββββββββββββββββββββββββββββββββ
Agent's Task: Find the target object (e.g., "airplane"
Actions: Up, Down, Left, Right, "Found!"
Observations: 64x64 RGB viewport + position + target class
Current Viewport (64x64x3)
β
DINOv3 Encoder (frozen)
β
Features (384-dim) + Position (2D) + Target Embedding (384-dim)
β
Q-Network (MLP)
β
Q-values for 5 actions
β
Action Selection (Ξ΅-greedy)
Instead of processing entire high-resolution images (expensive), the agent learns to:
- Move its viewport intelligently
- Focus on promising regions
- Minimize search time
- Pretrained on 142M images (self-supervised)
- Captures rich semantic information
- No fine-tuning needed (transfer learning)
- Enables generalization to new objects
- Learns state-action value function Q(s, a)
- Experience replay for stable training
- Target network to reduce correlation
- Epsilon-greedy exploration
Success Metrics:
- β Success Rate: 70%+ (vs ~20% random)
- β Average Steps: <15 (vs ~35 random)
- β Training Time: <1 hour on RTX 4090
Visualizations:
- Search trajectory animations
- Attention heatmaps
- Success rate curves
- Search strategy analysis
Edit config/default_config.yaml to customize:
environment:
canvas_size: [512, 512] # Search space
window_size: [64, 64] # Viewport size
max_steps: 50 # Episode timeout
agent:
learning_rate: 0.0001
gamma: 0.95
epsilon_decay: 0.995
training:
num_episodes: 1000See PROGRESS.md for detailed session-by-session tracking.
- Project documentation
- Environment implementation
- DINOv3 encoder
- DQN agent
- Training loop
- Demo notebook
- Complete training pipeline
- Evaluation metrics
- Visualization tools
- First successful training run
- Hyperparameter tuning
- Advanced visualizations
- Performance analysis
- Documentation & demos
- Multi-scale search (zoom actions)
- Continuous control (SAC/TD3)
- Real robot datasets
- Multi-object search
- Text-guided search
- PROJECT_PLAN.md: Comprehensive PRD with requirements, timeline, and success criteria
- ARCHITECTURE.md: Technical deep dive into system design
- PROGRESS.md: Session-by-session progress tracking for continuity
This is a research/portfolio project. Suggestions and improvements welcome!
MIT License - feel free to use for learning and research.
- Meta AI for DINOv3 foundation model
- OpenAI for Gym interface inspiration
- PyTorch team for excellent deep learning framework
For questions or collaboration: [Your contact info]
Status: π§ In Development (Phase 1 - MVP)
Last Updated: 2025-11-06