Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast and extendable dataset sampling #110

Draft
wants to merge 27 commits into
base: main
Choose a base branch
from
Draft

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jan 9, 2025

✨ Description

  • Option to shuffle epochs independently, which means we can resume with more training samples (epochs) without messing up the ordering. (opt-in for backward compatibility, to be the default eventually if we like it)
  • Distributed dataset sampling/preparation. Split the task between the devices to make it a lot faster. Should basically make it num_gpus times faster for the current scheme (excluding blending), but we might lose the benefit if/once we replace blending of dataset shards with concatenation.
  • Trim sampling indices for the last epoch. This will reduce disk usage and speed up writing, especially when num_epochs <<1.
  • Skip build_sample_idx entirely. Instead, we use the cumsum of documents sizes to calculate the sample index on the fly. Actual impact on performance TDB, since I don't know how much of the time is spent on this.
  • TODO: skip pre-computation of blending indices. These are deterministic and not too hard to compute on the fly.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier changed the base branch from main to modular_dataset January 9, 2025 00:16
@jlamypoirier jlamypoirier changed the title Dataset improvements Fast and extendable dataset sampling Jan 9, 2025
Base automatically changed from modular_dataset to main January 22, 2025 05:29
@jlamypoirier jlamypoirier changed the base branch from main to modular_dataset January 23, 2025 21:04
@jlamypoirier jlamypoirier changed the base branch from modular_dataset to main January 28, 2025 00:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants