Skip to content

Comments

feat(p2p): add Kubernetes Lease-based P2P coordination#478

Open
slin1237 wants to merge 6 commits intomainfrom
feature/p2p-model-distribution-n/6
Open

feat(p2p): add Kubernetes Lease-based P2P coordination#478
slin1237 wants to merge 6 commits intomainfrom
feature/p2p-model-distribution-n/6

Conversation

@slin1237
Copy link
Collaborator

Implement P2PLeaseManager for coordinating model downloads:

  • Lease acquisition with expired lease takeover
  • Lease renewal for long-running downloads
  • Complete/release lifecycle management
  • Ensures only one node downloads from HuggingFace

Tests cover:

  • Lease acquisition (new, existing, expired)
  • Lease expiration detection
  • Lease name generation with hash truncation
  • Renewal and holder verification

Add BitTorrent library dependency (anacrolix/torrent) and define
constants for P2P model distribution:
- Lease coordination constants (prefix, labels, durations)
- Default configuration values (ports, rates, timeouts)
- Environment variable keys for P2P configuration
Introduce the P2P distributor package with:
- Config struct with validation and defaults
- ConfigFromEnv for environment-based configuration
- Comprehensive Prometheus metrics for P2P operations
  - Download metrics (total, duration, failures)
  - Peer discovery and connection metrics
  - Lease and seeding metrics
  - Metainfo server metrics
Implement ModelDistributor for P2P model distribution:
- BitTorrent client management with rate limiting
- Peer discovery via Kubernetes headless service DNS
- Metainfo fetching from peers for torrent coordination
- Model seeding and download operations
- Active torrent tracking with proper cleanup

Fix API compatibility with anacrolix/torrent v1.57.1:
- Use bencode.Marshal instead of info.MarshalBencode
- Use t.Complete().Bool() instead of t.Complete.Bool
- Handle PeerRemoteAddr type assertion for peer addresses
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added model-agent Model agent changes tests Test changes dependencies Dependency updates labels Dec 31, 2025
@slin1237 slin1237 force-pushed the feature/p2p-model-distribution-n/6 branch 2 times, most recently from b66b6c6 to 167ce39 Compare December 31, 2025 05:01
Implement MetainfoServer to enable peers to discover available models:
- GET /metainfo/{modelHash} - serve torrent metainfo
- GET /health - health check endpoint
- GET /stats - P2P distribution statistics
- GET /models - list available models with seeding status
- Graceful shutdown support

Fix API compatibility with anacrolix/torrent v1.57.1:
- Add exists() helper function
- Use bencode.Marshal instead of info.MarshalBencode
Add comprehensive tests for the distributor package:
- Config validation tests (valid, missing fields, invalid ports)
- ConfigWithDefaults tests
- Metrics recording tests
- Stats struct tests
- Test helper functions for integration tests

Fix test config to include required LeaseDurationSeconds field.
Implement P2PLeaseManager for coordinating model downloads:
- Lease acquisition with expired lease takeover
- Lease renewal for long-running downloads
- Complete/release lifecycle management
- Ensures only one node downloads from HuggingFace

Tests cover:
- Lease acquisition (new, existing, expired)
- Lease expiration detection
- Lease name generation with hash truncation
- Renewal and holder verification
@slin1237 slin1237 force-pushed the feature/p2p-model-distribution-n/6 branch from 167ce39 to 1469a8d Compare December 31, 2025 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates model-agent Model agent changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant