Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset tweaks #118

Merged
merged 4 commits into from
Jan 16, 2025
Merged

Dataset tweaks #118

merged 4 commits into from
Jan 16, 2025

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jan 16, 2025

✨ Description

A bunch of small changes for datasets, extracted from #104 to reduce its size:

  • Type hints
  • Decouple Data from Run (cache directory)
  • Decouple Dataset from Data
  • Turn SamplingConfig into a simple dataclass, add arguments so datasets don't need to know about Data
  • Extract Legacy data config
  • Add dataset monitor to replace ad-hoc monitoring in blended dataset
  • Turn fim into an independent dataset wrapper (small breaking change w.r.t random state, doesn't matter much)
  • Dummy dataset now returns different, deterministic samples (breaking but shouldn't matter)
  • Move things around (merge indexed dataset files, ceate gpt dataset config file, misc)

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier marked this pull request as ready for review January 16, 2025 22:54
@jlamypoirier jlamypoirier merged commit fbffa0f into main Jan 16, 2025
2 checks passed
@jlamypoirier jlamypoirier deleted the dataset_tweaks branch January 16, 2025 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant